CS 2461辅导、讲解Search Engine、辅导C/C++语言、C/C++程序讲解
- 首页 >> C/C++编程CS 2461 Project 5: Information Retrieval (a ‘Search Engine’)
In this project you will implement a document retrieval system that searches through a set of
documents to determine the relevance of the documents with respect to a search phrase/query. The goal
of this project is to further expose you to working with pointers, memory allocation and file I/O
operations in C. It builds on your knowledge of data structures and hash tables and linked lists in
particular – be sure to go over homework 7 and read the example on linked lists in the textbook
(Chapter 19) very carefully. This project has an extra credit option –first focus on completing the basic
specifications before moving to the extra credit component.
You must work on your own on this project – no code or algorithm collaboration of any kind allowed,
and you CANNOT use source code from any other source apart from the textbook or lecture notes.
(Note: We will be running your code through a code plagiarism detection tool to detect similarities.)
You can discuss general architecture of the system and C/Unix questions. You can use the linked list
and Hash table codes you implemented in Homework 7 (and linked list code from Software
Engineering CS 2113)– but make sure you have fixed any problems you had with the code (you don’t
want points taken off multiple times for the same errors!). Failure to adhere to these policies will
constitute a violation of GW’s Academic Integrity code and you will be charged with a violation – a
grade of 0 on the project and at least one grade lower on the course.
Important: This project requires planning and writing a substantial amount of C code as well as
significant amount of time testing. It is highly recommended that you start designing your solution first
on paper (using a flowchart to describe the modules in your system), then determining what data
structures you need, and then building the functions to manipulate your data structures. This is not a
project that you can put off till the weekend before it is due and then expect to have a working project
by the deadline. Section 1 describes the problem and the ranking (relevance) algorithm. Section 2
describes a simple example. Section 3 provides the project specifications and requirements, and Section
4 describes an extra credit option that leads to a more realistic system.
Problem Statement: Alice Nedwons is interested in the task of searching through (secret) documents
in a directory and identify “documents of interest” which are documents that contain specific keywords
or query words. However, the directory has a large number of files including files which have no
relevance to her search parameters, and manually inspecting every file’s contents will take an
unacceptable amount of time. Fortunately, she has decided to hire you for the job to you since she
knows that you have the necessary skills and knowledge to write an efficient program to automate the
search process and give her a set of relevant documents. Her plan rests on the assumption that all
relevant documents (stored as plain text files) are stored in one directory. And luckily for you, this
problem reduces to a special case of the document retrieval problem that can be solved using
techniques that you have already studied!
1. Algorithm for search and retrieval using the tf-idf ranking function
Your task is searching through documents (i.e., files/webpages) in a directory and identify “documents
of relevance” for a search phrase (i.e., search query) submitted by a user. For example, if the search
phrase is “computer architecture GW”, you need to find the documents (i.e., files) that not only contain
(some of) the words in the query but you would like to rank the documents in order of relevance. The
most relevant document would be ranked first, and the least relevant file would be ranked last. This is
similar (but not necessarily the same method) to how you search the web using a search engine (such as
Google) – the results from a search engine are sorted in order of relevance. (Google’s page rank
algorithm uses a very different technique from what we describe here.)
1.1 Overall Architecture of the System
A na?ve solution to the problem is to read all the documents (i.e., words in the documents) and create a
single (large) linked list. Then for each query keyword, you can search through the linked list.
Question (a): If we have N documents, and document Di consists of mi words, then how long does this
simple solution take to search for a query consisting of K words. Give your answer in big O notation. A
more efficient solution can be constructed using other data structures.
This problem is an instance of a document retrieval problem and a solution could be architected bya
composition of two main phases (steps) – the (1) training phase and (2) search phase. The first step, of
training, consists of reading in all the documents and creating an effective data structure that will be
used during the search process. The training step is nothing but a pre-processing phase of your system,
and the data structure you can use to help speed up the search is a hash table. Since we will be
searching for words in a search query, we can create a hash table that stores information of the form
<word, pointer to bucket>. Each bucket is a linked list where a node in the linked list must contain (i)
the word, (ii) document ID, and (iii) number of times that the word appears in that document. Thus a
document may appear in multiple buckets; but a word appears in a single bucket and a word can appear
multiple times in a document. Towards the end of the training phase, all documents have been read by
the program and the hash table created. The final step of the training phase is removal of “stop words”.
Once all documents have been read and the hash index created, the system determines “stop words” for
this set of documents and then removes them from the index. (Stop words are words that occur
frequently, such as articles and prepositions in English or words that do not help us in the relevance
ranking since they occur in most documents. Removing stop words can not only help improve
relevance ranking but can also help speed up the search process since the size of the data structure is
reduced.) Question (b): If we assume that the hash function maps words evenly across the buckets (i.e.,
each bucket gets same number of words), then what is the time complexity (big O notation) of this
improved solution? Use the same parameters as Question (a) for the size of each document and size of
the query set.
The second step is the search/retrieval and ranking phase. The user provides a search query consisting
of a number of words/term, and the program must return the document IDs in order of relevance. If the
query contains multiple words, we perform a hash table lookup for each word. A table lookup for a
word gives us the corresponding bucsket, and by searching through the bucket we can determine if the
word exists in any of the documents and if so then its frequency of occurrence (i.e., the count of the
number of times it appeared in that document). By performing this table lookup for each word in the
search query, we can compute the score or relevance of each document for the query. The higher the
score the more relevant the document, and the result lists the documents in decreasing order of
relevance. This is where the method used to determine the relevance ranking of a document comes into
play. In this project, we use the term frequency-inverse document frequency (tf-idf) score to determine
the relevance of a document. This method is described next.
1.2 The tf-idf Algorithm for Ranking
The simplest way to process a search query is to interpret it as a Boolean query – either a document
contains all the words or they do not. However, this usually results in too few or too many results and
further does not provide a ranking that returns the documents that may be most likely to be useful to the
user. We wish to assign a ‘score’ to how well a document matched the query, and to do this we need a
way to assign a score to a document-query pair. To do this, consider some questions such as ‘how
many times did a word occur in the document’, ‘where the word occurs and how important the word
is’. The goal of relevance functions (which is the ‘secret sauce’ of search engines) is to determine a
score that co-relates to the relevance of a document.
The term frequency-inverse document frequency (tf-idf) method is one of the most common weighting
algorithms used in many information retrieval and text search problems. This starts with a bag of words
model – the query is represented by the occurrence counts of each word in the query (which means the
order is lost – for example, “john is quicker than mary” and “mary is quicker than john” both have the
same representation in this model).
Thus, a query of size m can be viewed as a set of m search terms/words w1, w2,…wm.
Note: We use the ‘word’ and ‘term’ interchangeably in what follows.
Term Frequency: is a measure to quantify the frequency of occurrence of a word within a particular
document.
The term frequency tfw,i of a term (word) w in document i is a measure of frequency of the
word in the document.
Using raw frequency, this is the number of times that term (word) appears in the document i. Note:
There are variations of term frequency that are used in different search algorithms; for example, since
raw frequency may not always relate to relevance they divide the frequency by the number of words in
the document to get a normalized raw frequency. Further, some compute the log frequency weight of
term w as the log of tf. For this project, you use the simple definition of number of times the word
appears in the document. You could, if you prefer, use the normalized (divide by number of words in
the document) or logarithm scaled but clearly indicate in your report and code documentation which
metric you are using IF you are not using the simple definition. One of the problems with the tf score is
that common (and stop) words can get a high score – for example, terms like “and” “of” etc. In
addition, if a writer of a document wants a high score they can ‘bias’ the search engine by replicating
words in the document.
Inverse Document Frequency: Many times a rare term/word is more informative than frequent terms
– for example, stop words (such as “the” “for” etc.). So we consider how frequent the term occurs
across the documents being searched (i.e., in the database), and the document frequency dfw captures
this aspect in the score.
The document frequency dfw is the number of documents that contain the term w.
The inverse document frequency idfw of term w is defined as idfw = log (N/dfw), where N is the
total number of documents in the database (i.e., being searched). If dfw=0 then 1 is added to the
denominator to handle the divide by zero case, i.e., for this case idfw = log (N/(1+dfw)). (The
logarithm is used, instead of N/dfw to dampen the effect of idf.)
tf-idf weighting: The tf-idf score gives us a measure of how important is a word to a document among
a set of documents. It uses local (term frequency) and global (inverse document frequency) to scale
down the frequency of common terms and scale up the frequency/score of rare terms.
The tf-idf(w,i) weight of a term (word) w in document i is the product of its term frequency and inverse
document frequency.
tf-idf(w,i) = tfw,i× idfw
A search query (i.e., search phrase), submitted by a user, typically consists of a number of words/terms.
Therefore we have to determine the relevance, or rank, of the document for the entire search phrase
consisting of some m number of words w1, w2….wm , using the tf-idf scores for each word. The
relevance, or rank, Ri of document i for this search phrase consisting of m words, is defined as the
sum of the tf-idf scores for each of the m words.
Ri = tf idf (wj,i) j=1
m
∑
Some references for more information on tf-idf method for document retrieval.
H. Wu and R. Luk and K. Wong and K. Kwok. "Interpreting TF-IDF term weights as making
relevance decisions". ACM Transactions on Information Systems, 26 (3). 2008.
J. Ramos, “Using TF-IDF to determine word relevance in document queries”.
1.3 Dealing with Stop words.
An important part of the information retrieval algorithms involves dealing (removing) with stop words.
Stop words are words which do not play a role in determining the significance or relevance of a
document – these could be either insignificant words (for example, articles, prepositions etc.) or are
very common in the context of the documents being processed. Stop words are language dependent, as
well as context dependent, and there are a number of methods discussed in the literature to identify stop
words and to create a list of stop words for the English language. In this project, you will use a simple
heuristic (described in what follows) that identifies stop words based on the context (i.e., the set of
documents). Simply stated, the words that occur frequently across all documents could be tagged as a
stop word since they will have little value in helping rank this set of documents for a user query. In
terms of our metrics, term frequency and inverse document frequency, the lower the idf the greater the
probability that the word is a stop word. Simplifying this further, we can tag a word as a stop word if it
has a idf=0 (i.e., it appears in all documents). (Note: A better solution would be combine words with
low idf with a list of stop words consisting of articles, prepositions etc. which may or may not have a
low idf score for the set of documents you are processing.) In this project you will implement a stop
word removal function that will remove words from your hash index based on an idf score of 0 – i.e.,
after the hash index is built the words with idf=0 will be removed from that bucket thus resulting in a
final hash index that has no words with idf=0. This will lead to a more efficient search/query process –
Question (c) why does this lead to a more efficient search process ? If you want to build a more
realistic system, you can combine this algorithm along with a statically provided stop word list (i.e.,
common prepositions etc.) and look up the stop word list during insertion into the hash table.
2. A Simple Example
Suppose you are given documents D1.txt, D2.txt and D3.txt whose contents are as follows:
D1.txt computer architecture at GW is both torture and fun
D2.txt computer architecture refers to the hardware and software
architecture of a computer
D3.txt Greco roman architecture is influenced by both greek
architecture and roman architecture
The search query, consisting of three words/terms is:
computer architecture GW
2.1 Training (Pre-Processing) phase
D1 contains 9 words, and D2 contains 12 words, D3 contains 12 words. The word architecture is
common to all three documents, and computer is common to D1 and D2, while GW appears only in
D1.
Suppose a hash function with 4 buckets (this is only an example – we are not using the actual hash
function you will implement) will hash some these words as follows (note that there are collisions – for
example, both “computer” and “torture” are hashed to the same bucket):
Bucket Words
0 computer, torture,roman
1 is, fun, and, greek, Greco, GW
2 architecture, refers,hardware,
3 a, at, by, influenced, both,
The pre-processing of the documents D1. D2 and D3, will result in entries being placed into the hash
table. Each bucket contains a pointer to a linked list; there are as many linked lists as there are buckets
in the hash table. Note the mapping of words to buckets is strictly dependent on the hash function being
used, but each bucket will contain entries of the type described earlier and shown in the figure below.
Note that if a word from D2 gets mapped to the same bucket, it should be “added” to the data structures
(lists) for that bucket. If a word is repeated then its count should be incremented – for example, the
count of “computer” in D2 is 2 since “computer” appears twice in document D2.
2.2 Dealing with stop words:
There are two options for organizing data to facilitate both the search process as well as removing stop
words. The first (straightforward) approach is shown in Figure 1. In this case, each bucket has a linked
list and the same word can appear multiple times in the list but only once for each document. Further,
the term frequency of the word in that document is stored at the node. To compute the idf score for
each word (after reading in all the documents), the algorithm needs to traverse the linked list and
compute the idf score, and if idf=0 then it should remove all occurrences (from all documents) of that
word from the linked list. This option will let you use the hash map you created in Homework 7 to be
used directly without major modifications.
The second approach is shown in Figure 2. In this case, we have a linked list for each word and the
document frequency df score for that word can be stored in the node of the linked list for that bucket.
Thus, after reading in all documents, removing stop words can be done by examining the df scores at
each node (in the upper level linked list) and computing the idf score to determine if that entire linked
list needs to be deleted. If you choose to implement this option then you will need to change your hash
table implementation from Homework 7.
In the example, the word “and” appears in all three documents and its document frequency df =3 and
therefore idf = log (3/3)=0 and is identified as a stop word and needs to be removed from the index.
Question 1(d): Which of the two approaches is more efficient in terms of removing stop words and
why. Which option did you choose to implement.
2.2 Search/Retrieval phase:
During the search phase, the system takes a user query and searches for relevant documents. To
determine the relevance of each document, it uses the tf-idf scoring (ranking) technique. In our
example, our query set contains the words “computer” and “architecture” and “GW”. The retrieval
process starts off by searching for the first word in the query – “computer” and computes its score.
Since “computer” gets hashed to the first bucket (bucket 0). We search through this bucket and
compute the tf-idf score for every document for the word “computer”. In this example, I am using the
raw frequency (i.e., count) to determine the tf score.
The term frequency for the search term “computer” for each document is: tfcomputer,1=1,
tfcomputer,2=2, tfcomputer,3=0 (since computer does not occur in document D3). This step simply
searches for the word (in the appropriate bucket) and retrieves the frequency count stored at the
node in the list (if the word is found, else it is zero).
For the document frequency: the word “computer” occurs in 2 out of 3 documents therefore
dfcomputer=2.
Inverse document frequency: idfcomputer= log(3/2)= 0.17
The tf-idf score for the term “computer” for the three documents are:
o tf-idf(computer,1)= 1*0.17 = 0.17
o tf-idf(computer,2)= 2*0.17= 0.34
o tf-idf(computer,3)= 0
We can similarly compute the tf-idf terms for the other two terms in the search query – “architecture”
and “GW” and this gives us:
for the term “architecture”
o tf-idf(architecture,1) = 1* (log (3/3))=0
o tf-idf(architecture,2) = 2 * log (3/3)=0
o tf-idf(architecture,3) = 3* log (3/3)=0
for the term “GW”
o tf-idf(GW,1)= 1*log(3/1)= 1* 0.48= 0.48
o tf-idf(GW,2)= 0
o tf-idf(GW, 3) =0
Using the above tf-idf scores for each term we can compute the rank/relevance score (for the entire
query “computer architecture GW”) of each document as:
o R1 = 0.17 +0 + 0.48 = 0.65
o R2 = 0.34+0 = 0.34
o R3 = 0 + 0 = 0
Based on the above relevance ranking, the system would return D1,D2,and D3 in order of relevance. If
the term frequency for all search terms is zero, then that document should not be returned since none of
the terms in the search query appear in the document (in the example, D3 does contain “architecture”).
We can also have a perfect match if all words in the query appear in a document (i.e., no-zero term
frequency for all words in the query for a document).You have to figure out how to keep track of the
score and what data structure to use for this purpose.
3. Specification and Requirements
This project is worth 150 points.
A formal description of the problem can be stated as:
Task : Given a search query consisting of a number of words, retrieve (i.e., list) and rank documents in
that are relevant to this search query.
Input : A set of plain text documents (D1, D2... Dn) in a single directory that need to be stored and
indexed. A search query consisting of query words w1,.. wq.
Output : Listing of the (names) of documents ordered by descending order of relevance/score (i.e., the
most relevant document with the highest score first).
3.1 Assumptions
Following are some assumptions and conditions that your solution must satisfy:
For the sake of this assignment, assume there are at least three documents D1.txt, D2.txt and
D3.txt. You can design your solution with more than 3 documents – labeled D1.txt through
Dn.txt. If you choose to implement the extra credit version (reading files from a directory) then
the number of documents depends on the number of files in the directory.
You can assume each document contains several words, and you can assume no word is longer
than 20 characters (it is possible to design a solution without these assumptions).
The document only contains words from the English alphabet (i.e., no digits or special
characters from the keyboard).
For simplicity, you can assume all words are in lower case. But see if you can write a program
that is case insensitive – if you implement this option, then please indicate this clearly in your
documentation (code comments).
The query set (i.e., the search phrase/query) can be of an arbitrary length (and you can again
assume no word is longer than 20 characters). If you feel the need to simplify this and assume a
maximum number of words in the query, then clearly state this assumption – you will lose 2.5
points for this assumption.
The query set (of keywords) is entered by the user at run-time (after the pre-processing phase
when all the documents have been read) and you can assume they are entered on one line. The
program must prompt the user for the query keywords and then return the result of the search.
After returning the results the program will return to prompt for the next query set or for special
symbol # to exit the program. If you set a maximum size to the query set then include this in
your prompt.
The number of buckets in the hash table is specified at run-time by the user.
o If you make a simplifying assumption and assume that this size is specified statically at
compile time in the program you will incur a penalty of 5 points.
You should not make any assumptions on the contents of the documents or the query words.
3.2 What you have to implement: Requirements and Specifications
A hashing function that takes a string w as input and converts it into a number. Refer to
Homework 7 for the function specifications. A simple (and general) process is: Sum up the
ASCII values of all the characters in the string to get a number S and then apply the hash
function to get bucket b. For the hash function, you can choose the simple b= S% N (i.e., S
modulo N) function where N is the number of buckets in the hash table. You should explore if
there are other, better, hash functions you can choose for this application, and if you choose a
different hash function, you must then define that function in your documentation and why you
chose that function. Note: Hash functions using a modulo N function typically use a prime
number for N (the number of buckets). Why do you think this is the case ?
A function that inserts a string w and the associated document number i in the hash table (into
bucket) – refer to homework 7 for the function specification ( hm_put ). Take care to ensure
that the frequency of the string in that document is updated if the string has appeared before in
the document (i.e., if it has already been inserted into the table). This function will need to call
some of the functions you need to implement linked lists. If you need to refer to code to
implement linked list functions, then read Chapter 19 and you can use the code provided in the
book.
A function training for the “training” process, i.e., pre-processing, that takes a set of
documents as input and returns the populated hash table as output. Figure out the specifications
for the function.
A function read_query to read the search query.
A function rank in the search/retrieval process that computes the score for each document and
ranks them based on the tf-idf ranking algorithm. Your system should also determine if there is
no document with a match – i.e., if none of the words in the search query appear in any of the
documents.
A function stop_word that is part of (last step of) the training process that identifies stop
words and removes the stop words from the hash table and adjusts the hash table and lists
accordingly.
A main function that first calls the training process to read all the documents and create the
hash table.
o Note that main must first prompt user for the size of the hash table, i.e., “Enter number
of buckets: “
o Once the training phase is over, it will enter the retrieval (search) phase to search for the
keywords and find the documents that contain these keywords. main will prompt user
and asks for “Enter S for search” or “X to Exit”.
o If user enters S, then prompts for the query set (keywords entered on one line) and then
calls the read_query function to read the query set. The program then computes the
score (call function rank ) prints out the documents in order of relevance (i.e.,
descending order of scores), and returns to the main prompt (i.e., to prompt for another
search or to exit).
A makefile. Think carefully about how you want to construct the different modules and
therefore how you set up the makefile.
3.3 Grading and Submission Instructions:
You must turn in, using blackboard, a tar (or zip) file containing (1) a short document (report)
describing your implementation – show the flow chart, algorithms and how the different functions
interact with each other, (2) the source code files and (3) the makefile.
We will test the code on shell.seas.gwu.edu – so be sure your code works on shell (and gcc)
before submission.
If your code does not compile, you will receive a zero for the project.
If your code crashes during normal operation (i.e., the specifications of the project), then it can
result in a 50% penalty depending on the severity of the reason for the crash.
You are required to provide the makefile. How you break up your code into different files will
play a role in your grade. To run the code, we will use the make command – so make sure you
test your makefile before submitting.
You will be graded on both correctness (60%) as well as efficiency (30%) of your solution, in
addition to documentation and code style (10%).
o Efficiency refers to the time complexity of your algorithm and also includes data
structures you use and memory management (no memory leaks!).
o You must document your code – if you provide poor documentation then you lose 10%
of the grade. However, if you provide no documentation whatsoever then you will
be penalized 20%.
Any assumptions you make on the specification of the input and search process (if the provided
specifications do not cover your question) then you should state these clearly in the report and
in the comments in your code (in function main). Failure to do so may lead to your program
being graded as one that does not meet specifications. Additionally, if you make an assumption
that contradicts the specifications we provided then you are not meeting the project
specifications and points will be deducted.
Read the next section for extra credit options.
If you choose to use your one time late submission, you have 36 hours extra but will incur a 10% late
penalty (in addition to any other points taken off during grading).
4. Extra Credit Options
Consider adding this extra credit option after you have completed the basic project. It will help you
learn a few more C/Unix utilities (you can try to get an idea of how to query directories by running
some sample code before integrating into your project). There is no partial credit on the extra credit
options – you must implement the specified functionality fully, and to meet the specifications.
Option: Automatic reading of arbitrary number of documents: (10 points) In this project,
we hard coded the names of the file we are interested in, in the program itself. This can be quite
cumbersome when we have too many files that we need to search in. In a realistic scenario, we could
just specify a directory and the program just reads all the text files from that directory and uses them to
build the hash table. In order to do this, we make the following assumptions:
1. We are given a wild card string to search for within the directory.
2. The directory does not have sub-directories.
In Unix the function we use is glob which is defined in glob.h. So all you need to do is include it in
the same way as stdio.h and then you can call it as
glob_t result;
retval = glob( search_string, flags, error_func, result );
For further details on this function and a simple code snippet, please use the command "man glob" on
the command line. The results of the function call are in the variable result and the pathnames of the
files found in this directory can be found in the gl_pathv member variable of this structure.
As a part of your extra credit assignment, you are expected to use this function in your code to read in a
directory and a search string from the user and use this to read in all the specified files. For example, if
there is a directory called "sample" and it contains three files "a.txt" "b.txt" and "c.txt" and suppose that
the user enters the search string "sample/*.txt", you should use glob to obtain the filenames
"sample/a.txt" "sample/b.txt" "sample/c.txt". Your code should then use these file names and read in
their contents. The rest of the code remains unchanged.