CSE 150A讲解、辅导Java编程语言、Python，c/c++设计讲解辅导留学生 Statistics统计、回归、迭代|辅导Web开发

2020.07.28 - 首页 >> OS编程

CSE 150A. Assignment 3 Summer 2020, Session 1
Out: Thu Jul 16
Due: Thu Jul 23
3.1 Maximum likelihood estimation
(a) Complete data
Consider a complete data set of i.i.d. examples {at, bt, ct, dt}
T
t=1 drawn from the joint distribution
of the above belief network. Compute the maximum likelihood estimates of the conditional probability
tables (CPTs) shown below for this data set. Express your answers in terms of equality-testing
functions, such as:
I(a, at) =
1 if a = at
,
0 if a 6= at
.
For example, in terms of this function, the maximum likelihood estimate for the CPT at node A is
given by P(A = a) = 1
T
PT
t=1 I(a, at). Complete the numerators and denominators in the below
expressions.
P(B =b|A=a) =
P(C =c|A=a, B =b) =
P(D =d|A=a, C =c) =
1
A
B C
D
(b) Posterior probability
Consider the belief network shown above, with observed nodes B and D and hidden nodes A and C.
Compute the posterior probability P(a, c|b, d) in terms of the CPTs of the belief network—that is, in
terms of P(a), P(b|a), P(c|a, b) and P(d|a, c).
(c) Posterior probability
Compute the posterior probabilities P(a|b, d) and P(c|b, d) in terms of your answer from part (b).
In other words, in this problem, you may assume that P(a, c|b, d) is given, since you showed how to
compute that in part (b).
(d) Log-likelihood
Consider a partially complete data set of i.i.d. examples {bt
, dt}
T
t=1 drawn from the joint distribution
of the above belief network. The log-likelihood of the data set is given by:
L =
X
t
log P(B =bt
, D =dt).
Compute this log-likelihood in terms of the CPTs of the belief network. You may re-use work from
earlier parts of the problem.
2
A
B C
D
(e) EM algorithm
The posterior probabilities from parts (b) and (c) can be used by an EM algorithm to estimate CPTs
that maximize the log-likelihood from part (d). Complete the numerator and denominator in the below
expressions for the EM update rules. Simplify your answers as much as possible, expressing them in
terms of the posterior probabilities P(a, c|bt
, dt), P(a|bt
, dt), and P(c|bt
, dt), as well as the functions
I(b, bt), and I(d, dt).
P(A=a) ←
P(B =b|A=a) ←
P(C =c|A=a, B =b) ←
P(D =d|A=a, C =c) ←
3
3.2 EM algorithm for noisy-OR
In this problem, we will see how the EM algorithm can be used to learn the parameters of a noisy-OR model.
Part (a) is related to the derivation of the EM update rules for the noisy-OR model. Parts (b) through (d) will
ask you to write code to learn the parameters of a noisy-OR model based on a particular data set.
(a) Equivalence of models
The following diagrams depict two variations of the noisy-OR model:
Y
X1 X2 ... Xn
Y
Z1 Z2 ... Zn
X1 X2 Xn
...
Model A Model B
For this part of the problem, we will use the notation PA(· · ·) and PB(· · ·) to distinguish between
probabilities computed using Model A and Model B, respectively.
Model A is a standard noisy-OR model in which the CPT for Y is
PA(Y = 1|X~ = ~x) = 1 −
Yn
i=1
(1 − pi)
xi
.
Model B is a variation of the noisy-OR model in which we’ve inserted a layer of hidden variables
Z1, . . . , Zn, with the following CPTs:
PB(Zi = 1|Xi = xi) = (
pi
, if xi = 1
0, if xi = 0
PB(Y = 1|Z~ = ~z) = (
1, if Zi = 1 for any i
0, if Zi = 0 for all i
In contrast to Model A, the Y node in Model B is a deterministic OR, since it will be 1 if and only if
any of the Zi’s are 1. We can view the Zi nodes as being a sort of “noisy copy” of the corresponding
Xi nodes: if Xi = 0, then Zi
is guaranteed to be 0 as well, but if Xi = 1, then Zi has probability pi
of being 1.
Note also that both models are defined in terms of parameters pi for i ∈ {1, . . . , n}.
4
To prove that the Y nodes in both models are equivalent, it is enough to show that
PA(Y = 0|X~ = ~x) = PB(Y = 0|X~ = ~x).
The following equations are an incomplete proof of this fact, which differs slightly from the one seen
in lecture. To complete this proof, provide a brief justification for each of the steps labeled (i) through
(Because of this result, we can simply use the notation P(· · ·) for probabilities in the remaining parts
of this problem, since we have shown that the the two models PA(· · ·) and PB(· · ·) agree.)
(b) EM Implementation: Per-iteration statistics
Suppose we are given a data set of T samples {(~xt
, yt)}
T
t=1 from Model A.
If we try to apply standard maximum-likelihood techniques to estimate the values of the parameters
pi based on Model A, we will find that there is no closed-form solution.
However, since part (a) showed that Model A and Model B are equivalent, we can instead view the
data as an incomplete data set corresponding to Model B. This allows us to estimate the values of pi
using the EM algorithm.
Noting that pi corresponds to P(Zi = 1|Xi = 1), which is one of the CPT entries for Model B, we
can write the EM update rule as:
pi ←
count d (Zi = 1, Xi = 1)
count d (Xi = 1)
=
P
t P(Zi = 1, Xi = 1|X~ = ~xt
, Y = yt)
P
t P(Xi = 1|X~ = ~xt
, Y = yt)
Then, as shown in class, the EM update rule for noisy-OR can be simplified to the following:
xit is the number of observations where Xi = 1.
5
Next, you will use the EM algorithm for estimating the noisy-OR parameters pi
. Download the
data files hw3 noisyOr x.txt and hw3 noisyOr y.txt for this homework assignment. The data set has
T = 267 examples over n= 23 inputs.1
For a data set {(~xt, yt)}T
t=1, the normalized log (conditional) likelihood is given by:
In your code, you should initialize all pi = 0.05 and perform 512 iterations of the EM algorithm using
the update rule shown above. An easy to miss reason for getting different results on this assignment is
to not use the initialization specified here! At each iteration, compute the log conditional likelihood
shown above. (If you have implemented the EM algorithm correctly, this log conditional likelihood
will always increase from one iteration to the next.)
Also compute the number of mistakes made by the model at each iteration; a mistake occurs either
• when yt = 0 and P(Y = 1|X~ = ~xt) ≥ 0.5 (indicating a false positive), or
• when yt = 1 and P(Y = 1|X~ = ~xt) < 0.5 (indicating a false negative).
In other words, use the current parameter values of the iteration to try to predict Y given the values
for X, and count how many of those predictions are wrong. The number of mistakes should generally
decrease as the model is trained, though it is not guaranteed to do so at each iteration.
For this part, turn in a completed version of the following table:
iteration number of mistakes M log conditional likelihood L
0 175 -0.9581
You should use the already completed entries of this table to check your work. As always you may
program in the language of your choice.
(c) EM Implementation: Estimated values for pi
Produce a table or bar plot of your final estimated values for pi
.
(d) EM Implementation: Source code
Include your source code for this problem as part of your Gradescope submission.
1
For those interested, more information about this data set is available here:
http://archive.ics.uci.edu/ml/machine-learning-databases/spect/SPECT.names
6
Debugging hints: If you are having trouble getting your results to match starting at iteration 1, try printing
out the values of p1 and p23 after the first few iterations. After iteration 1, you should find that p1 ≈ 0.11655
and p23 ≈ 0.15310. If you find that your value of p1 matches but p23 does not, this is likely a sign that
you are using the updated values for pi
too early. To solve this, you should save the new values for pi
in
temporary variables until p1, p2, . . . , p23 have all been computed; then, overwrite the old values of pi all at
once.
7
3.3 EM algorithm for binary matrix completion
In this problem you will use the EM algorithm to build a simple movie recommendation system. Download
the files hw3 movieTitles su20.txt and hw3 movieRatings su20.txt. The first file provides the titles of 60
movies. The second file contains a 150×60 matrix of zeros, ones, and missing elements denoted by question
marks. The (i, j)
th element in this matrix contains the i
th user’s rating of the j
th movie (corresponding to
the j
th line of the first file), according to the following key:
1 recommended,
0 not recommended,
? not seen.
Note: In past versions of this course, this data came from a survey given to you and your peers. In the interest
of time for the Summer session, I have generated synthetic data to only loosely match the distribution seen
from past students, so as to give them some privacy.
(a) Sanity check
Compute the mean popularity rating of each movie, given by the simple ratio
number of users who recommended the movie
number of users who saw the movie
,
and sort the movies by this ratio. Print out the movie titles from most popular (Doctor Strange) to
least popular (The Last Airbender).
(b) Likelihood
Now you will learn a naive Bayes model of these movie ratings, represented by the belief network
shown below, with hidden variable Y ∈ {1, 2, . . . , k} and partially observed binary variables
R1, R2, . . . , R60 (corresponding to movie ratings).
Y
R1 R2 R3 R60 ...
This model assumes that there are k different types of movie-goers, and that the i
th type of moviegoer—who
represents a fraction P(Y =i) of the overall population—likes the j
th movie with conditional
probability P(Rj = 1|Y =i).
8
This model is an interesting extension of the hidden variable models studied in lecture: for this problem,
each sampled set of user’s reviews contains its own mix of observed and unobserved variables.
Some users have watched many movies, some users have watched only a few. A hidden variable for
one user might be observed for another user. This is what we mean by ”partially observed”. Note the
the type of movie-goer (variable Y ) is always hidden for all users.
Let Ωt denote the set of movies seen (and hence rated) by the t
th user. Show that the likelihood of the
t
th user’s ratings is given by.
(c) E-step
The E-step of this model is to compute, for each user, the posterior probability that they correspond
to a particular type of movie-goer.
(d) M-step
The M-step of the model is to re-estimate the probabilities P(Y =i) and P(Rj = 1|Y =i) that define
the CPTs of the belief network. As shorthand,
denote the probabilities computed in the E-step of the algorithm. Also, let T = 150 denote the number
of users. Show that the EM updates are given by
denotes a sum over all users t that rated movie j. Similarly, P
t:j /∈Ωt
denotes a sum
over all users t that did not rate movie j.
(e) Implementation
Download the files hw3 probType init su20.txt and hw3 probRatingGivenType init su20.txt, and use
them to initialize the probabilities P(Y = i) and P(Rj = 1|Y = i) for a model with k = 4 types of
movie-goers. Run 128 iterations of the EM algorithm, computing the (normalized) log-likelihood
at each iteration. Your log-likelihood should increase (i.e., become less negative) at each iteration.
Fill in the following table, using the already provided entries to check your work:
9
iteration log-likelihood L
Note: Choosing k = 4 is somewhat arbitrary, but it is based on assuming that there are a small number
of fundamentally different movie-goer types. Moreover, the initial values for P(Rj = 1|Y =i) in
hw3 probRatingGivenType init su20.txt have been chosen randomly. We provide these to you so that
this HW unfolds in a certain manner and you’ll get the same results we got. If you are interested in
exploring further after completing this assignment, feel free to try out other initializations and other
values of k.
(f) User categorization
Using the formula from part (c) along with the parameters you estimated in part (e), evaluate the
probability distribution over movie-goer types (i.e., over Y ) given the following ratings:
”Tron” recommended
”Frozen” recommended
”Star Trek Beyond” NOT recommended
”Jumanji: Welcome to the Jungle” NOT recommended
”It” NOT recommended
”Batman v Superman: Dawn of Justice” NOT recommended
”The Dark Knight” NOT recommended
”The Lord of the Rings: The Fellowship of the Ring” recommended
”Black Panther” recommended
”The Avengers” recommended
”Avengers: Infinity War” recommended
all others not seen
Include either a table or a plot of P(Y =i| those ratings) for i ∈ {1, . . . , k}.
What value of i ∈ {1, . . . , k} maximizes the value of P(Y =i| those ratings)? (In other words, which
of the k types of movie-goers does this user most closely match?)
(g) User movie recommendations
Next, compute this user’s expected ratings on the movies they haven’t yet seen. You can do this by
applying the following formula for each movie ` that they haven’t seen:
P (R` = 1 | their ratings) = X
k
i=1
P (Y =i | their ratings) P(R` = 1|Y =i)
10
What are the top ten recommended movies? (Print out the first ten movies in the list of these (unseen)
movies sorted by their expected ratings.) Does this list seem different than the list in part (a)? Hopefully
it does (although our data set is obviously far smaller and more incomplete than the data sets
at actual movie streaming companies, not to mention the fact this particular data set was randomly
generated).
Note: If you are interested in exploring further after completing this assignment, feel free to try out
placing your own ratings in and re-doing parts (f) and (g).
(h) Source code
Include your source code for this problem as part of your Gradescope submission. As usual, you may
program in the language of your choice.
11