讲解K-Means、辅导Python、c/c++，Java编程设计讲解辅导Python程序|调试Matlab程序

2019.04.06 - 首页 >> 其他

Assignment 4 : Part II – Expectation Maximization and K-Means
Submission deadline:
Final version due 10 April, by 11:59pm (extensions possible as long as draft submitted by then).
You may do this in groups of 2 or 3.
NOTE THE “ALTERNATE” OPTION THAT WAS DESCRIBED IN CLASS TODAY-- I WILL SHORTLY
ADD A BRIEF DESCRIPTION OF THAT!
The goal is to familiarize yourself with, and to help you review EM and K-Means for the exam.
[We are including time estimates associated with each question, as before; note that the total “fast” time
estimate is around 2 hours for a single person working alone quickly, as it involves only a minimal amount
of code; if you are taking more than 3 hours in this case, I would strongly encourage you to be practical
and use the google groups discussion as a resource-- everybody will benefit from this! ]
1. [15 marks; ~1.5 hrs ] Expectation Maximization: In this question, you will apply EM to cluster Iris
data (See sklearn.load_iris). [SEE BELOW FOR AN ALTERNATIVE VERSION FOR Q1]. The Iris
dataset is a well-known example, with three types of Irises which can be classified using
4-dimensional measurements (sepal length, sepal width, petal length, petal width). Note that the
sklearn page has links to visualizations of K-Means clustering examples on this dataset, and also
you can see scatterplots of the basic Iris data set here.
a. [10 marks; ~0.75hr ] 1-dimensional EM.
If you look at the the scatterplots linked above, you can see that you can at least partially
cluster the Iris dataset using even just one dimension. You can still use the EM algorithm to
do this, selecting just one of the measurements and using two or more Gaussians to cluster
according to that measurement. (Note that the three classes can potentially be partitioned
into two groups, as is visible in some of the scatterplots).
Implement a 1-D version of the EM algorithm (not using a pre-written skLearn library EM
function) , and use the scatterplots to help you choose reasonable initializations.
Can you choose a dimension that lets you cluster the data into two clear clusters? Which
two flowers get grouped together in that case?
Can you choose a dimension and initial values that let you cluster the data into three
clusters? Which dimension did you use, and which two flowers are still a bit hard to
differentiate this way? Is this consistent with the scatterplots?
Make sure to clearly indicate the dimensions you used and the resulting means and
variances of your clusters after running the EM algorithm.
b. [5 marks; ~0.75hr ] multi-dimensional EM.Extend your work from the previous part to an EM algorithm that clusters according to the
full 4-dimensional measurements. Let k=3 and plot the Sepal data points with a color
coding based on the obtained clusters. More specifically, you can plot the data points
with color where the RGB colour values correspond to the probability estimates of a data
point belonging to each class.
Hint: You can use numpy.linalg.pinv to find the inverse of a matrix. Also, numpy.copy can
be used to temporary save a vector.
Optional/Hints: If you wish, you can compare your results with SkLearn by using the same seed to
initialize the variables. In order to match the results from SkLearn, the following tips may be useful:
● Begin by setting the random seed using np.random.seed. Example:
np.random.seed = 10.
● Begin with the E-step by computing the soft class assignments randomly as:
assignments = np.random.rand(data.shape[0], num_classes)
assignments = assignments / np.sum(assignments, axis=1, keepdims=True)
● Now, perform the M-step and repeat E and M steps alternately till convergence.
● You may use the following code to generate the SkLearn-results:
gmm = GaussianMixture(
n_components = num_classes,
random_state=10,
init_params ="random",
covariance_type="diag",
reg_covar=0,
verbose=2,
verbose_interval=5)
gmm.fit(data)
1. ALTERNATIVE VERSION
As discussed in class, I will only provide a high-level version here, for basic reference.
Instead of implementing EM, you may choose to do the EM algorithm “by hand” for 2 iterations. In
this case, you must demonstrate it for two cases of the same (iris) dataset:
Case 1: 2 clusters, 1 dimensional version (choose a dimension as discussed in class)
Case 2: 3 clusters, 4 dimensional version (i.e. entire dataset), assuming spherical Gaussians
Present your results very clearly.
Again, as discussed in class, “by hand” means that you can write separate bits of code to compute
the means and variances, and the responsibilities, etc.
A total of 12/15 marks (i.e. in the A range) can be obtained by just doing this.
2. [10 marks; ~0.5 hrs] K-means is derived from EM. However, instead of calculating both means
and variances, it calculates only the means. What is(are) the assumption(s) that K-Means makes?
Using a suitable synthetic dataset, demonstrate how K-Means fails when any one of theassumptions are violated. Use appropriate visualizations to explain better. (You can use the
K-means function available in the skLearn library).
Hint: examples of such datasets have been discussed in class on several occasions.