ISYS 1078/1079辅导、辅导via Canvas、讲解web/HTML语言、辅导web设计 解析C/C++编程|调试Matlab程
- 首页 >> Database Web Search Engines and Information Retrieval
ISYS 1078/1079
Assignment 1
Assessment Type Group assignment. Submit online via Canvas → Assignments
→ Assignment 1. Marks awarded for meeting requirements as
closely as possible. Clarifications/updates may be made via
announcements/relevant discussion forums.
Due Date Midday, Wednesday 21 August 2019 – Week 5
Marks 100 (20% of total course mark)
Overview
In this assignment, you will implement an inverted index and use it to store term occurrence
information. You will need to design and implement appropriate data structures,
write your data to disk, and be able to read it back.
This assignment is intended to give you practical programming experience for implementing
core information retrieval functionality, and to deepen your understanding of
the inverted index data structure.
The “Web Search Engines and Information Retrieval” Canvas contains further announcements
and a discussion board for this assignment. Please be sure to check these on a
regular basis – it is your responsibility to stay informed with regards to any announcements
or changes.
Learning Outcomes
This assessment relates to the following learning outcomes:
CLO1: apply information retreival principles to locate relevant information in large
collections of data
CLO2: understand and deploy efficient techniques for the indexing of document
objects that are to be retrieved
CLO3: implement features of retrieval systems for web-based and other search tasks
In addition it relates to the program learning outcomes of Enabling Knowledge; Critical
Analysis; Problem Solving; and Communication/Teamwork.Teaching Servers
Three CSIT teaching servers are available for your use:
(titan|saturn|jupiter).csit.rmit.edu.au
You can log in to these machines using ssh, for example:
where s1234567 is your student number. These servers host the document collection
and other data for the assignments in this course. You are encouraged to develop your
code on these machines. If you choose to develop your code elsewhere, it is your responsibility
to ensure that your assignment submission can be compiled and executed on
(titan|saturn|jupiter).csit.rmit.edu.au, as this is where your code will be run for
marking purposes. If your code cannot be complied and run on the teaching servers at
marking time, you will receive zero marks for the programming component.
You are required to make regular backups of all of your work. This is good practice, no
matter where you are developing your assignment solutions.
Academic Integrity and Plagiarism
Academic integrity is about honest presentation of your academic work. It means acknowledging
the work of others while developing your own insights, knowledge and ideas.
You should take extreme care that you have:
Acknowledged words, data, diagrams, models, frameworks and/or ideas of others
you have quoted (i.e. directly copied), summarised, paraphrased, discussed or mentioned
in your assessment through the appropriate referencing methods.
Provided a reference list of the publication details so your reader can locate the
source if necessary. This includes material taken from Internet sites. If you do not
acknowledge the sources of your material, you may be accused of plagiarism because
you have passed off the work and ideas of another person without appropriate
referencing, as if they were your own.
RMIT University treats plagiarism as a very serious offence constituting misconduct.
Plagiarism covers a variety of inappropriate behaviours, including:
Failure to properly document a source
Copyright material from the internet or databases
Collusion between students
For further information on our policies and procedures, please refer to the following:
https://www.rmit.edu.au/students/student-essentials/rights-and-responsibilities/
academic-integrity.
2General Requirements
This section contains information about the general requirements that your assignment
must meet. Please read all requirements carefully before you start.
You must implement your programs in Java, C, C++, or Python. Your programs
should be well written, using good coding style and including appropriate use of
comments. Your markers will look at your source code, and coding style may form
part of the assessment of this assignment.
You must include a plain text file called “README.txt” with your submission.
This file should include the name and student ID of all team members at the top.
It needs to clearly explain how to compile and run your programs on
(titan|saturn|jupiter).csit.rmit.edu.au. The clarity of the instructions and
usability of your programs may form part of the assessment of this assignment.
Your programs may be developed on any machine, but must compile and run
on the course machines, (titan|saturn|jupiter).csit.rmit.edu.au. If your
submission does not compile and run on these machines, it will not be marked.
The submitted programs must be your own code. You should not use existing
external libraries that implement advanced data structures. Where libraries (or in
the case of scripting languages, built-in features beyond simple low-level data types)
are used for data structures such as hash tables, they must be clearly attributed,
and it is up to you to demonstrate a clear understanding of how the library is
implemented in the discussion in your assignment report.
Paths should not be hard-coded.
Where your programs need to create auxiliary files, these should be stored in the
current working directory.
Please ensure that your submission follows the file naming rules specified in the
tasks below. File names are case sensitive, i.e. if it is specified that the file name is
gryphon, then that is exactly the file name you should submit; Gryphon, GRYPHON,
griffin, and anything else but gryphon will be rejected.
Assignment Teams
This assignment must be carried out in groups of two. It is up to you to find a partner
and form a team. Please try to compose your team carefully, as you should work with
the same person on Assignment 2, which builds on Assignment 1.
Once you have formed a group, you need to register it through Canvas prior to assignment
submission. Further details are provided in the “What to Submit, When, and How”
section of this document.
To manage your teamwork, you should use Trello (https://trello.com).
Each team member should sign up using their RMIT email address.
Each team must create a shared Trello board called “WSEIR Assignment 1”.
You must invite the teaching staff (lecturer and tutor) to join your board.
3To allow for flexibility, you may use Trello in the way that best suits your team. However,
as a minimum you must demonstrate activity that involves:
Creating cards that correspond to the main assignment components.
Assigning particular cards/tasks to each team member.
Showing regular progress on tasks over each week of the assignment period (e.g.
updating cards, and progressing them from “To Do” to “Doing” to “Done”).
Trello will be covered in more detail in the first tutorial.
Note that your assignment report submission must include a participation statement,
indicating the proportion of work contributed by each team member. This should reflect
your Trello board.
Programming Tasks
Have a look at the file /home/inforet/a1/latimes on
(titan|saturn|jupiter).csit.rmit.edu.au. It is part of the TREC TIPSTER test
collection, and consists of various articles from the LA Times newspaper.
Here is an example of the markup format:
LA010189-0001
...
...
...
...
The article headline.
...
The text content of the document.
Individual documents are wrapped in ... tags. These indicate the beginning
and end of each individual document in the collection file.
The ... tags contain a unique identifier for the document. You will
need to keep track of this to refer to documents in the collection.
The ... and ... tags contains the text
content of the document. This is the actual content of the document that you will need
to index.
Your task is to write two programs.
The first will index the collection, gather appropriate term distribution statistics,
and write the index data to disk. It is described in section 1.
4 The second program will load the previously created index data from disk into
memory, accept text terms as input, and return the appropriate term distribution
statistics for each term. It is described in section 2.
Note: the latimes collection file is hosted in a shared directory. The file is large (476M)
and you should not make a local copy of the file in your home directory on the teaching
servers, but instead access the file directly.
1 Indexing (40%)
Your indexing program must be called index and accept a number of optional commandline
arguments. The invocation specification should be:
% ./index [-p]
(or equivalent invocation in another programming language) where the optional flag -p
will print the terms being indexed, and the mandatory argument is the
collection to be indexed. These are described in more detail below.
Note that your implementation must be efficient, making use of appropriate algorithms
and data structures. This will be part of the assessment criteria.
1.1 Parsing Module (10%)
Your first task is to parse the content data contained in the HEADLINE and TEXT tags, tokenising
and extracting content terms, and removing any punctuation and excess markup
tags. Punctuation consists of any symbols that are not letters or numbers. You will
need to carefully consider issues of token normalisation (for example, how to deal with
acronyms and hyphenated words).
All content terms should be case-folded to lower-case as they are indexed.
Your program must be called index, and should accept an optional command-line argument
-p. This flag indicates that your program should print all content terms, in the
same order as in the original document, to the standard output stream, stdout. If the
flag is not specified, your program should produce no output to stdout. An example of
how your program might be run is as follows:
% ./index -p /home/inforet/a1/latimes
(or equivalent invocation).
As your parser encounters each new document, you will need to assign an ordinal
number as a document identifier. These can be assigned sequentially (i.e. the first
document is 0, the second is 1, and so on). This is how the documents will be
referred to in the inverted list information (see below).
You also need to track the unique document identifier specified in the
tags, and how these map to the integer identifiers that you assign. Your program
should therefore always write an output file to disk, called map, which contains this
mapping information. Auxiliary files such as this must be written to the current
working directory.
51.2 Stopping Module (10%)
Stopping is the process of excluding words that have grammatical functions, but contain
no meaning themselves, from consideration during indexing. Examples of such stopwords
are the, of, and and. A stoplist is available on
(titan|saturn|jupiter).csit.rmit.edu.au at /home/inforet/a1/stoplist.
Extend your program index with a module to stop the input terms. Your program should
accept an optional command-line argument, -s, where stoplist is a file of
stopwords that are to be excluded from indexing. An example of how your program
should be invoked is:
% ./index [-s] [-p]
Note that the -p option must still be available; when it is specified, your program should
print all content terms that are not stopwords to the standard output stream.
The content of the file must be stored in an efficient lookup structure, a hash
table. It is up to you to choose a suitable hash function for text strings. You will be
asked to explain your implementation in the report (see below).
1.3 Index Construction Module (20%)
Extend your program index to build an inverted index for the tokenised text. The
inverted index needs to store term occurrence information for all content terms that
occur in the file that is being indexed. For each term t, you need to store:
The document frequency, ft
, a count of the number of documents in the collection
in which term t occurs.
An inverted list for term t, which consists of postings of the form
< d, fd,t >
where:
– d is the document integer identifier in which t occurs
– fd,t is the within-document frequency, a count of how often t occurs in document
d
For example, if the term insomnia occurs in two documents in the collection, twice in
document 10, and three times in document 23, then the inverted list would be:
insomnia: 2 <10, 2> <23, 3>
Important: the punctuation symbols are only used to make the list human-readable. The
actual internal representation of the inverted list would just store a sequence of integers,
represented appropriately in memory and on disk; your stored lists must not include the
extra punctuation between items. See the lecture notes for further details on the inverted
index and inverted lists.
6You can assume that you will be working with collections that are small enough to fit in
main memory.
When your index program has finished constructing the inverted lists for each unique
term that occurs in the collection, the data should be written to disk. Your program
should write three data files in total. These files must be written to the current working
directory.
1. lexicon: Contains each unique term that occurs in the collection and a “pointer”
to the inverted list for that term (see below).
2. invlists: Contains the inverted list information, consisting only of numerical data.
Important: the inverted lists must be written to disk as binary integer data, not
as text/character data.
3. map: Contains the mapping information from document id numbers (as used in the
inverted lists) to the actual document names (as assigned in Section 1.1).
Since the lexicon and inverted lists are stored separately, your lexicon file will need to
include information about the file offset position in invlists where the inverted list for
the corresponding term occurs. Your program should not simply read the invlists file
sequentially from the start each time it is accessed.
It is up to you to design the particulars of the implementation. However, please read
the next section on Querying carefully first, as this is likely to have implications for how
you store your data. You will be asked to explain your implementation in a report (see
below).
2 Querying (20%)
Write a program called search that loads your index data, and searches it to retrieve the
inverted lists for particular terms. An example of how your program should be invoked
is:
% ./search
ISYS 1078/1079
Assignment 1
Assessment Type Group assignment. Submit online via Canvas → Assignments
→ Assignment 1. Marks awarded for meeting requirements as
closely as possible. Clarifications/updates may be made via
announcements/relevant discussion forums.
Due Date Midday, Wednesday 21 August 2019 – Week 5
Marks 100 (20% of total course mark)
Overview
In this assignment, you will implement an inverted index and use it to store term occurrence
information. You will need to design and implement appropriate data structures,
write your data to disk, and be able to read it back.
This assignment is intended to give you practical programming experience for implementing
core information retrieval functionality, and to deepen your understanding of
the inverted index data structure.
The “Web Search Engines and Information Retrieval” Canvas contains further announcements
and a discussion board for this assignment. Please be sure to check these on a
regular basis – it is your responsibility to stay informed with regards to any announcements
or changes.
Learning Outcomes
This assessment relates to the following learning outcomes:
CLO1: apply information retreival principles to locate relevant information in large
collections of data
CLO2: understand and deploy efficient techniques for the indexing of document
objects that are to be retrieved
CLO3: implement features of retrieval systems for web-based and other search tasks
In addition it relates to the program learning outcomes of Enabling Knowledge; Critical
Analysis; Problem Solving; and Communication/Teamwork.Teaching Servers
Three CSIT teaching servers are available for your use:
(titan|saturn|jupiter).csit.rmit.edu.au
You can log in to these machines using ssh, for example:
where s1234567 is your student number. These servers host the document collection
and other data for the assignments in this course. You are encouraged to develop your
code on these machines. If you choose to develop your code elsewhere, it is your responsibility
to ensure that your assignment submission can be compiled and executed on
(titan|saturn|jupiter).csit.rmit.edu.au, as this is where your code will be run for
marking purposes. If your code cannot be complied and run on the teaching servers at
marking time, you will receive zero marks for the programming component.
You are required to make regular backups of all of your work. This is good practice, no
matter where you are developing your assignment solutions.
Academic Integrity and Plagiarism
Academic integrity is about honest presentation of your academic work. It means acknowledging
the work of others while developing your own insights, knowledge and ideas.
You should take extreme care that you have:
Acknowledged words, data, diagrams, models, frameworks and/or ideas of others
you have quoted (i.e. directly copied), summarised, paraphrased, discussed or mentioned
in your assessment through the appropriate referencing methods.
Provided a reference list of the publication details so your reader can locate the
source if necessary. This includes material taken from Internet sites. If you do not
acknowledge the sources of your material, you may be accused of plagiarism because
you have passed off the work and ideas of another person without appropriate
referencing, as if they were your own.
RMIT University treats plagiarism as a very serious offence constituting misconduct.
Plagiarism covers a variety of inappropriate behaviours, including:
Failure to properly document a source
Copyright material from the internet or databases
Collusion between students
For further information on our policies and procedures, please refer to the following:
https://www.rmit.edu.au/students/student-essentials/rights-and-responsibilities/
academic-integrity.
2General Requirements
This section contains information about the general requirements that your assignment
must meet. Please read all requirements carefully before you start.
You must implement your programs in Java, C, C++, or Python. Your programs
should be well written, using good coding style and including appropriate use of
comments. Your markers will look at your source code, and coding style may form
part of the assessment of this assignment.
You must include a plain text file called “README.txt” with your submission.
This file should include the name and student ID of all team members at the top.
It needs to clearly explain how to compile and run your programs on
(titan|saturn|jupiter).csit.rmit.edu.au. The clarity of the instructions and
usability of your programs may form part of the assessment of this assignment.
Your programs may be developed on any machine, but must compile and run
on the course machines, (titan|saturn|jupiter).csit.rmit.edu.au. If your
submission does not compile and run on these machines, it will not be marked.
The submitted programs must be your own code. You should not use existing
external libraries that implement advanced data structures. Where libraries (or in
the case of scripting languages, built-in features beyond simple low-level data types)
are used for data structures such as hash tables, they must be clearly attributed,
and it is up to you to demonstrate a clear understanding of how the library is
implemented in the discussion in your assignment report.
Paths should not be hard-coded.
Where your programs need to create auxiliary files, these should be stored in the
current working directory.
Please ensure that your submission follows the file naming rules specified in the
tasks below. File names are case sensitive, i.e. if it is specified that the file name is
gryphon, then that is exactly the file name you should submit; Gryphon, GRYPHON,
griffin, and anything else but gryphon will be rejected.
Assignment Teams
This assignment must be carried out in groups of two. It is up to you to find a partner
and form a team. Please try to compose your team carefully, as you should work with
the same person on Assignment 2, which builds on Assignment 1.
Once you have formed a group, you need to register it through Canvas prior to assignment
submission. Further details are provided in the “What to Submit, When, and How”
section of this document.
To manage your teamwork, you should use Trello (https://trello.com).
Each team member should sign up using their RMIT email address.
Each team must create a shared Trello board called “WSEIR Assignment 1”.
You must invite the teaching staff (lecturer and tutor) to join your board.
3To allow for flexibility, you may use Trello in the way that best suits your team. However,
as a minimum you must demonstrate activity that involves:
Creating cards that correspond to the main assignment components.
Assigning particular cards/tasks to each team member.
Showing regular progress on tasks over each week of the assignment period (e.g.
updating cards, and progressing them from “To Do” to “Doing” to “Done”).
Trello will be covered in more detail in the first tutorial.
Note that your assignment report submission must include a participation statement,
indicating the proportion of work contributed by each team member. This should reflect
your Trello board.
Programming Tasks
Have a look at the file /home/inforet/a1/latimes on
(titan|saturn|jupiter).csit.rmit.edu.au. It is part of the TREC TIPSTER test
collection, and consists of various articles from the LA Times newspaper.
Here is an example of the markup format:
The article headline.
...
The text content of the document.
Individual documents are wrapped in
and end of each individual document in the collection file.
The
need to keep track of this to refer to documents in the collection.
The
content of the document. This is the actual content of the document that you will need
to index.
Your task is to write two programs.
The first will index the collection, gather appropriate term distribution statistics,
and write the index data to disk. It is described in section 1.
4 The second program will load the previously created index data from disk into
memory, accept text terms as input, and return the appropriate term distribution
statistics for each term. It is described in section 2.
Note: the latimes collection file is hosted in a shared directory. The file is large (476M)
and you should not make a local copy of the file in your home directory on the teaching
servers, but instead access the file directly.
1 Indexing (40%)
Your indexing program must be called index and accept a number of optional commandline
arguments. The invocation specification should be:
% ./index [-p]
(or equivalent invocation in another programming language) where the optional flag -p
will print the terms being indexed, and the mandatory argument
collection to be indexed. These are described in more detail below.
Note that your implementation must be efficient, making use of appropriate algorithms
and data structures. This will be part of the assessment criteria.
1.1 Parsing Module (10%)
Your first task is to parse the content data contained in the HEADLINE and TEXT tags, tokenising
and extracting content terms, and removing any punctuation and excess markup
tags. Punctuation consists of any symbols that are not letters or numbers. You will
need to carefully consider issues of token normalisation (for example, how to deal with
acronyms and hyphenated words).
All content terms should be case-folded to lower-case as they are indexed.
Your program must be called index, and should accept an optional command-line argument
-p. This flag indicates that your program should print all content terms, in the
same order as in the original document, to the standard output stream, stdout. If the
flag is not specified, your program should produce no output to stdout. An example of
how your program might be run is as follows:
% ./index -p /home/inforet/a1/latimes
(or equivalent invocation).
As your parser encounters each new document, you will need to assign an ordinal
number as a document identifier. These can be assigned sequentially (i.e. the first
document is 0, the second is 1, and so on). This is how the documents will be
referred to in the inverted list information (see below).
You also need to track the unique document identifier specified in the
tags, and how these map to the integer identifiers that you assign. Your program
should therefore always write an output file to disk, called map, which contains this
mapping information. Auxiliary files such as this must be written to the current
working directory.
51.2 Stopping Module (10%)
Stopping is the process of excluding words that have grammatical functions, but contain
no meaning themselves, from consideration during indexing. Examples of such stopwords
are the, of, and and. A stoplist is available on
(titan|saturn|jupiter).csit.rmit.edu.au at /home/inforet/a1/stoplist.
Extend your program index with a module to stop the input terms. Your program should
accept an optional command-line argument, -s
stopwords that are to be excluded from indexing. An example of how your program
should be invoked is:
% ./index [-s
Note that the -p option must still be available; when it is specified, your program should
print all content terms that are not stopwords to the standard output stream.
The content of the
table. It is up to you to choose a suitable hash function for text strings. You will be
asked to explain your implementation in the report (see below).
1.3 Index Construction Module (20%)
Extend your program index to build an inverted index for the tokenised text. The
inverted index needs to store term occurrence information for all content terms that
occur in the file that is being indexed. For each term t, you need to store:
The document frequency, ft
, a count of the number of documents in the collection
in which term t occurs.
An inverted list for term t, which consists of postings of the form
< d, fd,t >
where:
– d is the document integer identifier in which t occurs
– fd,t is the within-document frequency, a count of how often t occurs in document
d
For example, if the term insomnia occurs in two documents in the collection, twice in
document 10, and three times in document 23, then the inverted list would be:
insomnia: 2 <10, 2> <23, 3>
Important: the punctuation symbols are only used to make the list human-readable. The
actual internal representation of the inverted list would just store a sequence of integers,
represented appropriately in memory and on disk; your stored lists must not include the
extra punctuation between items. See the lecture notes for further details on the inverted
index and inverted lists.
6You can assume that you will be working with collections that are small enough to fit in
main memory.
When your index program has finished constructing the inverted lists for each unique
term that occurs in the collection, the data should be written to disk. Your program
should write three data files in total. These files must be written to the current working
directory.
1. lexicon: Contains each unique term that occurs in the collection and a “pointer”
to the inverted list for that term (see below).
2. invlists: Contains the inverted list information, consisting only of numerical data.
Important: the inverted lists must be written to disk as binary integer data, not
as text/character data.
3. map: Contains the mapping information from document id numbers (as used in the
inverted lists) to the actual document names (as assigned in Section 1.1).
Since the lexicon and inverted lists are stored separately, your lexicon file will need to
include information about the file offset position in invlists where the inverted list for
the corresponding term occurs. Your program should not simply read the invlists file
sequentially from the start each time it is accessed.
It is up to you to design the particulars of the implementation. However, please read
the next section on Querying carefully first, as this is likely to have implications for how
you store your data. You will be asked to explain your implementation in a report (see
below).
2 Querying (20%)
Write a program called search that loads your index data, and searches it to retrieve the
inverted lists for particular terms. An example of how your program should be invoked
is:
% ./search