代做COMPSCI 753 Algorithms for Massive Data (Exam) 2021调试数据库编程
- 首页 >> WebCOMPSCI 753 (11/ 11/2021 17:00) Algorithms for Massive Data (Exam)
1 Locality-Sensitive Hashing
Given three documents S1, S2 , S3 and a customized query document q:
S1 = {3, 4, 5}, S2 = {0, 1, 2}, S3 = {0, 1, 3}, q = {2, 3, 4, h(y)};
h(y) = y mod 6.
where y is the last digit of your Student ID. For instance, suppose my Stu- dentID=xxxxx7, my query document would be q= {2, 3, 4, 1}.
1.1 Computing MinHash Signatures
1. Generate the bit-vector representation for {S1, S2 , S3 , q} in feature space {0, 1, 2, 3, 4, 5}. [1 mark]
2. Generate the MinHash matrix for {S1, S2 , S3 , q} using the following four MinHash functions. [2 marks]
h1 (x) = x mod 6
h2 (x) = (x + 1) mod 6
h3 (x) = (x + 3) mod 6
h4 (x) = (x + 5) mod 6
3. Consider the query q and estimate the signature-based Jaccard similari- ties: J(q, S1 ), J(q, S2 ), and J(q, S3 ). [1 mark]
1.2 Tuning Parameters for rNNS
In our lecture, we have learnt to formulate the collision probability (i.e., S- curve) given the number of bands b and the number of rows per band r as follows: Pr(s) = 1 - (1 - sr )b.
Consider three sets of parameters (r=2,b=10), (r=6,b=30), (r=10,b=50). The collision probabilities for similarity s in range of [0,1] for each (r ,b) are pro- vided accordingly as follows:
1. Which settings give at most 5% of false negatives for any 70%-similar pairs? Briefly explain the reason. [1 mark]
2. Which settings give at most 15% of false positives for any 30%-similar pairs? Briefly explain the reason. [1 mark]
1.3 c-Approximate Randomized rNNS [3 marks]
We have learnt that a family of functions H is called (d1 , d2 , p1 , p2 )-sensitive with collision probability p1 > p2 and c > 1 if the following conditions hold for any uniformly chosen h ∈ H and 8 x, y ∈ U :
– If d(x, y) ≤ r , Pr[h(x) = h(y)] ≥ p1 for similar points, and
– If d(x, y) ≥ cr , Pr[h(x) = h(y)] ≤ p2 for dissimilar points.
Consider a family transformation from (d1 , d2 , p1 , p2 )-sensitive to (d1 , d2 , 1 — (1 — p1(k))L , 1 — (1 — p2(k))L )-sensitive,where k and L refer to the number of hash functions and the number of hash tables, respectively. Briefly describe steps to achieve such transformation. What is the expected impact on probability bounds after the transformation?
2 Data Stream Algorithms
2.1 Misra-Gries Algorithm [1 mark]
Given the data stream below, perform. the Misra-Gries algorithm with k = 3 counters and present the summary, including the elements and its counter values, when the execution of the algorithm is finished.
S = {4, 36, 14, 36, 57, 36, 22, 57, 5, 57}
2.2 CountMin Sketch Algorithm [4 marks]
Given the following three hash functions, perform the CountMin Sketch al- gorithm on the same data stream S in Section 2.1 and present the (i) hash table, (ii) counter matrix, and (iii) estimated frequency of each element in a stream after processing all elements.
h1 (x) = x mod 3
h2 (x) = (3x + 1) mod 3
h3 (x) = (5x + 2) mod 3
2.3 Count Sketch Algorithm [4 marks]
Consider the same data stream S and hash functions in Section 2.2. Given the sign hash functions below, perform the Count Sketch algorithm and present the (i) hash table, (ii) counter matrix, and (iii) estimated frequency of each element in a stream after processing all elements.
s1 (x) = ((2x + 1) mod 3) mod 2
s2 (x) = ((3x + 2) mod 3) mod 2
s3 (x) = ((5x + 2) mod 3) mod 2
3 Algorithms for Graphs
3.1 Biased PageRank
Given a directed graph:
1. We have learnt the matrix formulation of PageRank r = M · r. Convert the above graph to a column-stochastic adjacency matrix. [1 mark]
2. Compute the PageRank of the graph above. [2 marks]
3. Let β = 0.8, calculate the biased PageRank with the teleport set S = {[the last digit of your student ID] mod 4}. [3 marks]
Note: The rank of each node should round to 3 decimal places.
3.2 Community Detection
1. Use two or three sentences to explain the diferences between modularity maximization and the Girvan-Newman algorithm for community detec- tion. [2 marks]
2. In the modularity maximization algorithm, which pair of nodes in the graph below should be merged in the first step to maximize the gain of modularity? Explain your answer. If there are multiple pairs of nodes with the same modularity gain, report all of them to get full mark. [4 marks]
3.3 In uence Maximization
1. Compute the influence spread of the seed set S = {1} using the Indepen- dent Cascade (IC) model on the following graph. [2 marks]
2. In the lecture, we have learnt a greedy algorithm to find a seed set for maximizing the influence based on the IC model. Starting with empty seed set, which node will be the first to be added to the seed set in the greedy algorithm? [2 marks]
4 Recommender Systems
4.1 Collaborative filtering
Given the following user-item interaction matrix:
1. Apply the basic user-based collaborative filtering (without considering bias) with cosine similarity. Give the top-1 recommended item to user u2 . [3 marks]
2. In the lecture, we have discussed how to model the rating bias including (i) the bias over all transactions; (ii) the bias of a user; and (iii) the bias of an item in collaborative filtering. What is the rating of user u1 to item p2 if all biases are considered? [3 marks]
Note: The predicted ratings should round to one decimal place.
4.2 Evaluation of recommender system
A recommender system generates a ranked list of items for a specific user u as (p3 ; p10 ; p5 ; p7 ; p1 ; p9 ; p2 ; p4 ; p6 ; p8 ). The ranked list contains all items that haven’t been purchased by the user in the training data. We find that the user only buys items p10 and p1 in the test data.
1. Compute the AUC for user u. [1 mark]
2. If top-3 items are returned to the user, what are the values of Precision@3 and Recall@3? [2 marks]
Note: The AUC, Precision@3 and Recall@3 should round to 2 decimal places.
4.3 Application of Recommendation Algorithms
A start-up company plans to build a system for recommending training courses to users. The company has 100,000 users and 500 courses. Each user in the database has a complete profile with age, gender, educational back- ground, and working experience. Each course has a short description of its content. Ninety percent of the users have taken one course, and the remaining users have taken more than two courses.
The data scientists in the company are considering three recommendation al- gorithms that we have learnt in the lectures: (a) user-based collaborative fil- tering; (b) item-based collaborative filtering; and (c) content-based approach.
1. Which of (a) and (b) is more appropriate for the above system? Explain the reason. [2 marks]
2. If you were one of the data scientists, which of the above methods will you use for recommending new courses to users? Explain the reason. [2 marks]
3. The company wants a single model to support (i) recommending existing courses to an arbitrary group of users; (ii) recommending existing courses to new users; and (iii) recommending new courses to existing users. Which of the above methods can be used? If none of them apply, give a solution based on the methods we learnt in the lectures. [3 marks]