辅导Math 157留学生、讲解R、辅导R语言程序、讲解SageMath kernel
- 首页 >> Algorithm 算法A12-06-18
December 4, 2018
0.1 Math 157: Intro to Mathematical Software
0.2 UC San Diego, fall 2018
0.3 Homework 8: Due December 6, 2018
Enter all answers within this notebook. As usual, don’t forget to cite sources and collaborators.
You can use the SageMath kernel or the R kernel for the problem set, or you can switch between
kernels for different problems. Please run all your code before the problem set is collected.
0.3.1 Problem 1: Plotting in R
Grading criteria: correctness and thoroughness of explanations.
In each of the following cells, run the R commands as indicated, then explain in words: - what
is the data being analyzed; - what the code is doing; - one conclusion you drew from the data.
An example conclusion (which doesn’t correspond to any of the following datasets) would be
something like "people with blue eyes are more likely to have blond hair than people with brown
eyes".
You might want to consult the documentation for the R datasets package, from which these
examples were taken.
In [0]: %load_ext rpy2.ipython
In [0]: %%R
boxplot(weight ~ feed, data = chickwts, col = "lightgray",
main = "chickwt data",
ylab = "Weight at six weeks (gm)")
In [0]: %%R
pairs(trees, panel = panel.smooth, main = "trees data")
In [0]: %%R
mosaicplot(Titanic, main = "Survival on the Titanic", color = TRUE)
1
0.3.2 Problem 2: Modeling co2 data
Grading criteria:
In lecture on 11/26, we looked at the R dataset co2. We saw that it could be decomposed into
three pieces: a roughly linear trend, a periodic seasonal trend, and some random-seeming noise.
2a. Fit a linear model to the overall trend in the co2 data.
In [0]:
2b. Combine your linear model with the periodic seasonal variation given by the decomposition
to create a time series going from 1959 to 2018 with your predictions for the co2 levels. Plot
this along with the co2 data to see how well your prediction matches the data we were given.
In [0]:
2c. Plot your prediction along with the up-to-date data here
http://scrippsco2.ucsd.edu/data/atmospheric_co2/primary_mlo_co2_record. Hoow well
does your prediction match the new data? Note that you will need to ignore the first rows of the
csv file when you read it. (Hint: Both R and pandas have optional attributes in their commands
for reading csv files that will allow you to do this. You can read about this using the ? command
as usual.)
In [0]:
0.3.3 Problem 3: Face completion
Grading criteria: correctness of code and answers
Special note: In this problem, you will be running code in another notebook. However, you
will be only credited for answers entered in this notebook.
Go to the scikit-learn example of face completion; download the script as a Jupyter notebook;
upload it into this folder in your project; and run all cells using the Python 3 (Ubuntu Linux)
kernel. Then answer the following questions.
3a. What fraction of the data was used for testing? How do you know?
3b. Suppose we want to predict the top half of the face from the bottom. Which lines of code
need to be changed, and to what?
In [0]: # List all the lines of code that you are changing here...
In [0]: # ...and their replacements here.
0.3.4 Problem 4: Classifying digits using K-means
Grading criteria: correctness of code and thoroughness of analysis.
4a. Use the K-means classifier demonstrated in lecture on 11/30 to sort the digits dataset used
in lecture on 11/28 into 10 clusters. Note that you should not enter the labels of the digits, just the
8x8 arrays.
In [0]:
4b. Relabel the clusters so that the label of each cluster is the most common digit in that cluster.
2
In [0]:
4c. How effective would a model be which takes an array corresponding to a digit, sorts it into
a cluster, and predicts that the digit is the same as the label of that cluster? What percent of the
digits would be correctly labeled? Which digit would be correctly labeled most frequently Least
frequently
In [0]:
4d. How does this model compare to the support vector machine we used in lecture on 11/28?
Why do you think this is?
In [0]:
0.3.5 Problem 5: Visualizing digits with principal component analysis
Grading criteria: correctness of code
5a. Use principal component analysis as demonstrated in lecture on 11/30 to project the points
of the digit dataset into two dimensions.
In [0]:
5b. Make two different plots of this projection, one color-coded using the correct labels of the
dataset and one color-coded using the labels you obtained in problem 4 using K-means classification.
In [0]:
0.3.6 Problem 6: Implementing two-dimensional K-means
Grading criteria: correctness of code
The goal of this problem is to implement K-means clustering for a two-dimensional dataset.
Throughout this problem, we’ll represent a point in the plane as a tuple (x,y) containing two
floats.
6a. Write a Python function that, given two points in the plane, returns the distance between
the two points.
In [0]:
6b. Write a Python function that, given a list K points [p1,...,pk] in the plane and another
point q, returns the point pi in the list that is closest to q.
In [0]:
6c. Write a Python function that takes as input a list of K ’center points’ [p1,...,pk] in the
plane and a list l of points in the plane. It should then return a dictionary with K keys, corresponding
to the points p1,...,pk, such that the value corresponding to the key pi is a list of all
the points in l which are closer to pi than to any of the other center points.
In [0]:
3
6d. Write a Python function that, given a list l of points in the plane, returns the centroid.
The x-coordinate of the centroid is the average of the x-coordinates of the points in l, and the
y-coordinate of the centroid is the average of the y-coordinates of the points in l.
In [0]:
6e. Combining the above functions, write a Python function to implement K-means clustering.
Your function should take as input a list l of points in the plane and a positive integer K. It should
then choose K random points in the plane and sort the points of l into K lists using your function
from 5c. It should use your function from 5d to compute the centroid of each list, and rerun your
function from 5c using these K points. It should repeat this process until the lists stop changing,
then return the K lists.
In [0]:
6f. Apply your function to the two-dimensional projection of the digits dataset that you made
in 5a. Make a plot of this projection, colored using the labels generated by your function.
In [0]: