代做COMPSCI5089 Intro to Data Sci & Systems 2024代做Python编程

2025.06.26 - 首页 >> Matlab编程

Intro to Data Sci & Systems M

COMPSCI5089

Friday 20 December 2024

1. This question is concerned with the Linear Algebra part of the course.

Note: When answering this question, you are recommended to use either Numpy pseudo- code or Latex syntax (at your preference) for typing mathematical answers into Moodle. Incorrect syntax will not be penalised as long as it is clear and unambiguous. For example, the identity matrix could be written as: [[1,0],[0,1]] or

| 1 0 |

| 0 1 |

and the matrix inverse as Aˆ-1 or inv(A).

Consider that you are working for a shipment company and studying the movements of parcels between 5 sites: A, B, C, D and E. The transitions between those sites every day are expressed in the following graph:

(a) (i) What is the adjacency matrix for this graph? Provide the corresponding matrix (Note:

ensure that the edge weights are correctly encoded). [3]

(ii) Assume that at t = 0 you have the following distribution A = 100,B = 10, C = 20,D =

0, E = 0, what would be the distribution at t = 1? [2]

(iii) How would you calculate the package distribution two days ago (xt =-2)? Detail the approach you would use, but you do not have to calculate the actual values. [2]

(iv) How would you transform. this adjacency matrix to make the graph undirected (ie, ensure that paths between any two nodes go both ways)? [2]

(b) What is a steady state of A? Explain two ways to calculate the steady state for this process. [3]

(c) Consider the 2 × 3 matrix A with the following SVD decomposition A = UΣVT

where

(i) What are the singular values of A? [3]

(ii) How can you calculate the pseudo-inverse of A, A+ from this decomposition. Explain all steps.

(Hint: We have (AB)+ = B+A+ and if A is invertible, then A+ = A-1.) [5]

2. This question is concerned with the optimisation part of the course.

Hint: For the following questions, you can use ((1 0) (0 1)) ˆ-1 to represent ify your typings.

You are given the following linear least squares optimisation problem:

Minimise the cost function:

f(x) = ∥Ax—b∥2

where:

• x ∈ R2 is the vector of unknowns,

• A ∈ R2×2 is a matrix of known values,

• b ∈ R2 is a vector of known values.

Given:

(a) Solve the least squares problem using the normal equations method to find the optimal solution x∗ .

Hint: The normal equations are derived from the gradient of the least squares function, set to zero: A ⊤Ax = A ⊤b. And [5]

(b) Solve the least squares problem using gradient descent, starting from an initial guess xo = and using a step size α = 0.5. Perform. two iterations.

Hint: The gradient of the least squares cost function is given by ▽f(x) = 2A⊤ (Ax—b). You can use \delta to represent ▽ . [5]

(c) Discuss the merits of stochastic gradient descent (SGD) for solving least squares problems, especially in the context of large datasets. [4]

(d) Now, consider the same least squares problem, but with the additional constraint that x1 +x2 = 1. Solve this constrained optimisation problem using the Lagrange multiplier method.

Hint: You don’t need to substitute the values of x, but describe overall the steps with formulas. [6]

3. This question is concerned with the probabilities part of the course.

| 1 0 |

| 0 1 |

and the matrix inverse as Aˆ-1 or inv(A).

Let us assume that for a user study you have recorded the gaze of users when navigating a webpage (ie, you have recored what part of the webpage they were looking at). As a result, you have obtained a database D of 100 gaze locations for 100 users. Each record provides you with the x and y coordinates of the user’s gaze location in the page. We will assume that the coordinates are normalized between 0 and 1, such that (0, 0) indicate the top left corner of the page and (1, 1) the bottom right corner.

(a) As a first attempt at the problem, you decide to assume that the distribution of users’ gazes is normally distributed. Explain:

(i) how this distribution would be parametrised, stating the dimensionality of each param-eter; [2]

(ii) and how you would estimate those parameters from D? [3]

(b) After initial experiments, your model appears to perform very poorly:

(i) In what case would this assumption of a normal distribution be obviously wrong? How would you identify that from the data in D? [2]

(ii) Propose an alternative model you could use and explain and how it would be parametrised (stating all dimensions). [3]

(iii) How would you estimate those parameters? [3]

(c) Let us assume that out of the 100 users you recorded, 25 did actually buy something on the site, and that in addition to the users’ gaze, you have also recorded which users decided to buy something and which did not.

Using this data, how would you estimate how likely is a user to buy something given that they have gazed at a location g0? Explain all steps of your approach in details. [7]

4. This question is concerned with the databases part of the course.

(a) You are given the following two relations:

• Course (C):

– Schema: Course(Id, Description, Credits)

– Attributes:

* Id: A 4-byte integer (primary key)

* Description: A 256-byte string

* Credits: A 1-byte integer

– Total Records (rC): 32

• Transcript (T):

– Schema: Transcript(StudentId, CourseId, Mark)

– Attributes:

* StudentId: A 4-byte integer (foreign key)

* CourseId: A 4-byte integer (foreign key to Course(Id))

* Mark: A 8-byte double precision floating number

* Primary Key: Combination of StudentId and CourseId

– Total Records (rT ): 51,200

Assume: The size of a disk block is 4096 bytes. CourseId in T references Id in C.

(i) Calculate number of blocks for storing each relation (C and T).

Hint: To simplify your typing, you can use cel(x) and floor(x) to represent the floor and ceiling functions used to round numbers to the nearest integer. [6]

(ii) Estimate the selection cardinality of joining the relations C and T on the attribute CourseId. Assume that the courses in C are uniformly enrolled by all the students and appears in the Transcript. (T) records. [2]

(iii) Explain how the selectivity helps in the query process. [2]

(b) You are a data engineer at a company that develops personalised music streaming services. The platform needs to recommend songs to users based on their listening history and preferences. Each song is represented by a high-dimensional feature vector that includes acoustic attributes, artist information, and embedded representations from machine learning models.

(i) Which types of database would you choose to store and query the song data? Justify your choice. [6]

(ii) Describe how you would store and index the song data to allow efficient retrieval and recommendation. [4]