辅导Rmarkdown、讲解R设计、讲解data留学生、R编程语言调试辅导Web开发|讲解数据库SQL

2020.05.08 - 首页 >> Java编程

Homework Set 6
April 28, 2020
NOTICE: The homework is due on May 8 (Friday) 11:59pm. Please provide
the R codes (Rmarkdown is highly recommended) and steps
that you use to get your solutions. You are allowed, and even encouraged,
to discuss the homeworks with your classmates. However, you must
write up the solutions on your own. Plagiarism and other anti-scholarly
behavior will be dealt with severely.
Problem 1. Here we explore the maximal margin classifier on a toy data
set.
(a) We are given n = 7 observations in p = 2 dimensions. For each observation,
there is an associated class label.
(b) Sketch the optimal separating hyperplane, and provide the equation for
this hyperplane.
1
(c) Describe the classification rule for the maximal margin classifier. It
should be something along the lines of “Classify to Red if β0 + β1X1 +
β2X2 > 0, and classify to Blue otherwise.” Provide the values for β0, β1,
and β2.
(d) On your sketch, indicate the margin for the maximal margin hyperplane.
(e) Indicate the support vectors for the maximal margin classifier.
(f) Argue that a slight movement of the seventh observation would not affect
the maximal margin hyperplane.
(g) Sketch a hyperplane that is not the optimal separating hyperplane, and
provide the equation for this hyperplane.
(h) Draw an additional observation on the plot so that the two classes are
no longer separable by a hyperplane.
Problem 2. K-means.
414 10. Unsupervised Learning
(b) Repeat (a), this time using single linkage clustering.
(c) Suppose that we cut the dendogram obtained in (a) such that
two clusters result. Which observations are in each cluster?
(d) Suppose that we cut the dendogram obtained in (b) such that
two clusters result. Which observations are in each cluster?
(e) It is mentioned in the chapter that at each fusion in the dendrogram,
the position of the two clusters being fused can be
swapped without changing the meaning of the dendrogram. Draw
a dendrogram that is equivalent to the dendrogram in (a), for
which two or more of the leaves are repositioned, but for which
the meaning of the dendrogram is the same.
3. In this problem, you will perform K-means clustering manually, with
K = 2, on a small example with n = 6 observations and p = 2
features. The observations are as follows.
Obs. X1 X2
1 1 4
2 1 3
3 0 4
4 5 1
5 6 2
6 4 0
(a) Plot the observations.
(b) Randomly assign a cluster label to each observation. You can
use the sample() command in R to do this. Report the cluster
labels for each observation.
(c) Compute the centroid for each cluster.
(d) Assign each observation to the centroid to which it is closest, in
terms of Euclidean distance. Report the cluster labels for each
observation.
(e) Repeat (c) and (d) until the answers obtained stop changing.
(f) In your plot from (a), color the observations according to the
cluster labels obtained.
4. Suppose that for a particular data set, we perform hierarchical clustering
using single linkage and using complete linkage. We obtain two
dendrograms.
(a) At a certain point on the single linkage dendrogram, the clusters
{1, 2, 3} and {4, 5} fuse. On the complete linkage dendrogram,
the clusters {1, 2, 3} and {4, 5} also fuse at a certain point.
Which fusion will occur higher on the tree, or will they fuse at
the same height, or is there not enough information to tell?
2
414 10. Unsupervised Learning
(b) Repeat (a), this time using single linkage clustering.
(c) Suppose that we cut the dendogram obtained in (a) such that
two clusters result. Which observations are in each cluster?
(d) Suppose that we cut the dendogram obtained in (b) such that
two clusters result. Which observations are in each cluster?
(e) It is mentioned in the chapter that at each fusion in the dendrogram,
the position of the two clusters being fused can be
swapped without changing the meaning of the dendrogram. Draw
a dendrogram that is equivalent to the dendrogram in (a), for
which two or more of the leaves are repositioned, but for which
the meaning of the dendrogram is the same.
3. In this problem, you will perform K-means clustering manually, with
K = 2, on a small example with n = 6 observations and p = 2
features. The observations are as follows.
Obs. X1 X2
1 1 4
2 1 3
3 0 4
4 5 1
5 6 2
6 4 0
(a) Plot the observations.
(b) Randomly assign a cluster label to each observation. You can
use the sample() command in R to do this. Report the cluster
labels for each observation.
(c) Compute the centroid for each cluster.
(d) Assign each observation to the centroid to which it is closest, in
terms of Euclidean distance. Report the cluster labels for each
observation.
(e) Repeat (c) and (d) until the answers obtained stop changing.
(f) In your plot from (a), color the observations according to the
cluster labels obtained.
4. Suppose that for a particular data set, we perform hierarchical clustering
using single linkage and using complete linkage. We obtain two
dendrograms.
(a) At a certain point on the single linkage dendrogram, the clusters
{1, 2, 3} and {4, 5} fuse. On the complete linkage dendrogram,
the clusters {1, 2, 3} and {4, 5} also fuse at a certain point.
Which fusion will occur higher on the tree, or will they fuse at
the same height, or is there not enough information to tell?
Problem 3. Hierarchical clustering.
10.7 Exercises 413
differ: for instance, cluster 4 in K-means clustering contains a portion of
the observations assigned to cluster 1 by hierarchical clustering, as well as
all of the observations assigned to cluster 2 by hierarchical clustering.
Rather than performing hierarchical clustering on the entire data matrix,
we can simply perform hierarchical clustering on the first few principal
component score vectors, as follows:
> hc.out=hclust(dist(pr.out$x [,1:5]) )
> plot(hc.out , labels =nci. labs , main="Hier. Clust. on First
Five Score Vectors ")
> table(cutree (hc.out ,4), nci.labs)
Not surprisingly, these results are different from the ones that we obtained
when we performed hierarchical clustering on the full data set. Sometimes
performing clustering on the first few principal component score vectors
can give better results than performing clustering on the full data. In this
situation, we might view the principal component step as one of denoising
the data. We could also perform K-means clustering on the first few
principal component score vectors rather than the full data set.
10.7 Exercises
Conceptual
1. This problem involves the K-means clustering algorithm.
(a) Prove (10.12).
(b) On the basis of this identity, argue that the K-means clustering
algorithm (Algorithm 10.1) decreases the objective (10.11) at
each iteration.
2. Suppose that we have four observations, for which we compute a
dissimilarity matrix, given by
For instance, the dissimilarity between the first and second observations
is 0.3, and the dissimilarity between the second and fourth
observations is 0.8.
(a) On the basis of this dissimilarity matrix, sketch the dendrogram
that results from hierarchically clustering these four observations
using complete linkage. Be sure to indicate on the plot the
height at which each fusion occurs, as well as the observations
corresponding to each leaf in the dendrogram.
414 10. Unsupervised Learning
(b) Repeat (a), this time using single linkage clustering.
(c) Suppose that we cut the dendogram obtained in (a) such that
two clusters result. Which observations are in each cluster?
(d) Suppose that we cut the dendogram obtained in (b) such that
two clusters result. Which observations are in each cluster?
(e) It is mentioned in the chapter that at each fusion in the dendrogram,
the position of the two clusters being fused can be
swapped without changing the meaning of the dendrogram. Draw
a dendrogram that is equivalent to the dendrogram in (a), for
which two or more of the leaves are repositioned, but for which
the meaning of the dendrogram is the same.
3. In this problem, you will perform K-means clustering manually, with
K = 2, on a small example with n = 6 observations and p = 2
features. The observations are as follows.
Obs. X1 X2
1 1 4
2 1 3
3 0 4
4 5 1
5 6 2
6 4 0
(a) Plot the observations.
(b) Randomly assign a cluster label to each observation. You can
use the sample() command in R to do this. Report the cluster
labels for each observation.
(c) Compute the centroid for each cluster.
(d) Assign each observation to the centroid to which it is closest, in
terms of Euclidean distance. Report the cluster labels for each
observation.
(e) Repeat (c) and (d) until the answers obtained stop changing.
(f) In your plot from (a), color the observations according to the
cluster labels obtained.
4. Suppose that for a particular data set, we perform hierarchical clustering
using single linkage and using complete linkage. We obtain two
dendrograms.
(a) At a certain point on the single linkage dendrogram, the clusters
{1, 2, 3} and {4, 5} fuse. On the complete linkage dendrogram,
the clusters {1, 2, 3} and {4, 5} also fuse at a certain point.
Which fusion will occur higher on the tree, or will they fuse at
the same height, or is there not enough information to tell?
3
414 10. Unsupervised Learning
(b) Repeat (a), this time using single linkage clustering.
(c) Suppose that we cut the dendogram obtained in (a) such that
two clusters result. Which observations are in each cluster?
(d) Suppose that we cut the dendogram obtained in (b) such that
two clusters result. Which observations are in each cluster?
(e) It is mentioned in the chapter that at each fusion in the dendrogram,
the position of the two clusters being fused can be
swapped without changing the meaning of the dendrogram. Draw
a dendrogram that is equivalent to the dendrogram in (a), for
which two or more of the leaves are repositioned, but for which
the meaning of the dendrogram is the same.
3. In this problem, you will perform K-means clustering manually, with
K = 2, on a small example with n = 6 observations and p = 2
features. The observations are as follows.
Obs. X1 X2
1 1 4
2 1 3
3 0 4
4 5 1
5 6 2
6 4 0
(a) Plot the observations.
(b) Randomly assign a cluster label to each observation. You can
use the sample() command in R to do this. Report the cluster
labels for each observation.
(c) Compute the centroid for each cluster.
(d) Assign each observation to the centroid to which it is closest, in
terms of Euclidean distance. Report the cluster labels for each
observation.
(e) Repeat (c) and (d) until the answers obtained stop changing.
(f) In your plot from (a), color the observations according to the
cluster labels obtained.
4. Suppose that for a particular data set, we perform hierarchical clustering
using single linkage and using complete linkage. We obtain two
dendrograms.
(a) At a certain point on the single linkage dendrogram, the clusters
{1, 2, 3} and {4, 5} fuse. On the complete linkage dendrogram,
the clusters {1, 2, 3} and {4, 5} also fuse at a certain point.
Which fusion will occur higher on the tree, or will they fuse at
the same height, or is there not enough information to tell?
Problem 4. Hierarchical clustering.
4