代做COMPSCI5011 Information Retrieval 2023代写留学生Matlab语言
- 首页 >> Matlab编程DEGREES OF MSc, MSci, MEng, BEng, BSc, MA and MA (Social Sciences)
Information Retrieval (M)
COMPSCI5011
Wednesday 10 May 2023
1.
(a) The following corpus of documents has been processed by an IR system where stemming is not applied:
Doc1: Real estate speculation is of interest.
Doc2: Interest rates are increasing interest in home costs.
Doc3: Students have no real interest in interest rates.
Doc4: As interest rates fall, the real estate market is heating up.
Doc5: The government is considering increasing interest rates.
(i) Assume that the following terms are stopwords: an, as, and, are, do, in, is, of, not, the, up. Give the vector space model representations of documents Doc1 and Doc2, assuming that you are using (raw) term frequency to weight the terms. Show clearly the terms of your vectors as well as their weights. [2]
(ii) Consider the following query Q:
Q= interest rates
Provide the vector space model representation of Q, showing both the terms as well as their weights. [1]
Compute the cosine similarity between the query Q and Doc1 as well as the cosine similarity between Q and Doc2. Show your working. [2]
(iii) Assume the same list of stopwords as (i) above. Construct an inverted file for
all the documents of the corpus, showing clearly the dictionary and posting list components. Your inverted file needs to store sufficient information for computing a simple tf*idf term weight, where wij = tfij *log2(N/dfi). [5]
(iv) Assuming the use of a best match ranking algorithm, rank all documents of the
corpus using their relevance scores for the following query:
real estate interest
Show your working. Note that log2(1.5)= 0.3219, log2(1.6666)= 0.7369, log2(2.5)= 1.3219 and log2(5)= 2.3219 (you may not need all of these). [3]
(b) Consider the following non-interpolated recall-precision graph, showing the performance of an IR system on a given query Q. For this query Q, the IR system has returned 20 documents. Assume that Q has 16 relevant documents in the ground truth, not all of which have been retrieved by the system for the query.
(i) Compute the interpolated precision values for this query Q. Show your working. [4]
(ii) The IR system has returned 20 documents ranked from rank=1 to rank=20. In the tables below, indicate if the corresponding document at that rank was relevant (R) or non-relevant (X). [2]
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
|
|
|
|
|
|
|
|
|
|
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
|
|
|
|
|
|
|
|
|
|
(iii) Compute the Average Precision (AP) for query Q. [1]
2.
(a) Consider a collection A with 5000 documents with a lexicon size of 10000 words (number of unique words). Consider a collection B with 11250 documents and a lexicon size of 15000 words. Suppose that all documents have an average length of 200 words. We then add 2200 documents to each collection.
Use Heaps’ Law with β= 0.5 to estimate how many additional unique terms we would add to A & B. Show your working.
Explain whether the obtained estimates are as you would expect, given the number of documents in A & B, and why? [4]
(b) Assume that a user’s original query is bargain books bargain DVDs really bargain books. The user examines two documents, d1 and d2. She judges d1, with the content great books bargain bestsellers really bargain books relevant and d2 with content big big bargain DVDs non-relevant.
Assume that the IR system is using the vector space model with raw term frequency (i.e. with no IDF). Moreover, for relevance feedback, suppose the use of Rocchio’s Relevance Feedback (RF) algorithm, with parameters α=1, β= 0.75, γ= 0.25.
(i) First, provide the original query vector. Moreover, provide the vector representations for d1 and d2. You need to clearly show the terms of the vector in addition to the weights. [2]
(ii) Next, provide the reformulated query that will be run by the IR system after the application of Rocchio’s RF algorithm. Terms in the reformulated query with negative weights can be dropped, i.e. their weights can be changed back to 0. Show your working. [3]
(iii) The users of the IR system requested a new feature in the user interface,
where next to each returned document, they can click “Find documents like this one” to obtain more similar documents. You have been asked to re-use Rocchio’s algorithm to deploy such a feature and return similar documents. Explain which weight setting for α, β and γ you would use to deploy this new feature. Justify your answer. [2]
(c) Consider the following six documents.
Doc1: I like Terrier Terrier Terrier
Doc2: I like Terrier and I like the course
Doc3: I like the course
Doc4: I don’t like Terrier
Doc5: I like dogs
Doc6: I like puppies
Assume an IR system that uses the probabilistic Binary Independence Retrieval Model. Recall that in this model, the relevance score of a document to a query is computed using the following RSV formulae:
Assume a large corpus (N=1000), and that the terms: like, Terrier, course, dogs and puppies only occur in the documents above (i.e. they do not occur in the rest of the 994 documents).
(i) First, assume that there is no relevance information available to the system. Moreover, assume that pi is a constant, namely pi=0.5; i.e., half of all relevant documents will contain each query term. Now, consider the following query Q:
Q = Terrier
Calculate the Retrieval Status Value (RSV) of the six documents above for query Q, and rank the documents. Use log base 10. Show your workings and justify your answer. [3]
(ii) Next, assume that we have some information about term occurrences in the relevant and non-relevant documents for the above query Q. Indeed, through the use of relevance feedback, we were told that the only relevant documents for Q1 are: Doc1, Doc2, and Doc6. Everything else is non-relevant. Use this relevance information to compute the probabilities pi and si. Then, calculate the RSV between the query Q and the six documents. Rank the documents for each query. [4]
(iii) Briefly provide the main take-away messages you might draw from your results in Parts (i) and (ii). [2]
3.
(a) We discussed a number of IR models in the class. These models can make a number of assumptions to simplify the relevance estimation of a document given a query. Provide 4 examples of assumptions made by IR models, which do not necessarily hold true. In particular, for each assumption, briefly provide a concrete and informative example showing why the assumption does not hold in a real-world search environment. [4]
(b) Explain why recording the upper bound contribution of a weighting model on a given posting list facilitates the deployment of techniques that improve the efficiency of a search engine. [4]
(c) Consider a query q, which returns all web pages shown in the hyperlink structure below.
(i) Write the adjacency matrix A for the above graph. [1]
(ii) Using the iterative HITS algorithm, provide the hub and authority scores for all the webpages of the above graph by running the algorithm for two iterations. Show the hub and authority scores for each page after each iteration. Show your workings. [4]
You can write matrices in plain text like a table, or write matrices row by row, e.g. ([a,b];[c,d]) shows a matrix with two columns and two rows where the first row is [a,b] and the second row is [c,d].
(d) Explain why dense retrieval approaches cannot easily handle long documents,
and hence passages are preferred. Discuss two methods for aggregating passage scores, and the intuitions these approaches encode. How would they handle a long document where the relevant content for the query is spread through the document? [7]