辅导Mathematical Software、讲解Python编程设计、辅导SageMath/R语言、讲解Python

2018.12.05 - 首页 >> Python编程

A12-06-18

December 4, 2018

0.1 Math 157: Intro to Mathematical Software

0.2 UC San Diego, fall 2018

0.3 Homework 8: Due December 6, 2018

Enter all answers within this notebook. As usual, don’t forget to cite sources and collaborators.

You can use the SageMath kernel or the R kernel for the problem set, or you can switch between

kernels for different problems. Please run all your code before the problem set is collected.

0.3.1 Problem 1: Plotting in R

Grading criteria: correctness and thoroughness of explanations.

In each of the following cells, run the R commands as indicated, then explain in words: - what

is the data being analyzed; - what the code is doing; - one conclusion you drew from the data.

An example conclusion (which doesn’t correspond to any of the following datasets) would be

something like "people with blue eyes are more likely to have blond hair than people with brown

eyes".

You might want to consult the documentation for the R datasets package, from which these

examples were taken.

In [0]: %load_ext rpy2.ipython

In [0]: %%R

boxplot(weight ~ feed, data = chickwts, col = "lightgray",

main = "chickwt data",

ylab = "Weight at six weeks (gm)")

In [0]: %%R

pairs(trees, panel = panel.smooth, main = "trees data")

In [0]: %%R

mosaicplot(Titanic, main = "Survival on the Titanic", color = TRUE)

0.3.2 Problem 2: Modeling co2 data

Grading criteria:

In lecture on 11/26, we looked at the R dataset co2. We saw that it could be decomposed into

three pieces: a roughly linear trend, a periodic seasonal trend, and some random-seeming noise.

2a. Fit a linear model to the overall trend in the co2 data.

In [0]:

2b. Combine your linear model with the periodic seasonal variation given by the decomposition

to create a time series going from 1959 to 2018 with your predictions for the co2 levels. Plot

this along with the co2 data to see how well your prediction matches the data we were given.

In [0]:

2c. Plot your prediction along with the up-to-date data here

http://scrippsco2.ucsd.edu/data/atmospheric_co2/primary_mlo_co2_record. Hoow well

does your prediction match the new data? Note that you will need to ignore the first rows of the

csv file when you read it. (Hint: Both R and pandas have optional attributes in their commands

for reading csv files that will allow you to do this. You can read about this using the command

as usual.)

In [0]:

0.3.3 Problem 3: Face completion

Grading criteria: correctness of code and answers

Special note: In this problem, you will be running code in another notebook. However, you

will be only credited for answers entered in this notebook.

Go to the scikit-learn example of face completion; download the script as a Jupyter notebook;

upload it into this folder in your project; and run all cells using the Python 3 (Ubuntu Linux)

kernel. Then answer the following questions.

3a. What fraction of the data was used for testing? How do you know?

3b. Suppose we want to predict the top half of the face from the bottom. Which lines of code

need to be changed, and to what?

In [0]: # List all the lines of code that you are changing here...

In [0]: # ...and their replacements here.

0.3.4 Problem 4: Classifying digits using K-means

Grading criteria: correctness of code and thoroughness of analysis.

4a. Use the K-means classifier demonstrated in lecture on 11/30 to sort the digits dataset used

in lecture on 11/28 into 10 clusters. Note that you should not enter the labels of the digits, just the

8x8 arrays.

In [0]:

4b. Relabel the clusters so that the label of each cluster is the most common digit in that cluster.

In [0]:

4c. How effective would a model be which takes an array corresponding to a digit, sorts it into

a cluster, and predicts that the digit is the same as the label of that cluster? What percent of the

digits would be correctly labeled? Which digit would be correctly labeled most frequently? Least

frequently?

In [0]:

4d. How does this model compare to the support vector machine we used in lecture on 11/28?

Why do you think this is?

In [0]:

0.3.5 Problem 5: Visualizing digits with principal component analysis

Grading criteria: correctness of code

5a. Use principal component analysis as demonstrated in lecture on 11/30 to project the points

of the digit dataset into two dimensions.

In [0]:

5b. Make two different plots of this projection, one color-coded using the correct labels of the

dataset and one color-coded using the labels you obtained in problem 4 using K-means classification.

In [0]:

0.3.6 Problem 6: Implementing two-dimensional K-means

Grading criteria: correctness of code

The goal of this problem is to implement K-means clustering for a two-dimensional dataset.

Throughout this problem, we’ll represent a point in the plane as a tuple (x,y) containing two

floats.

6a. Write a Python function that, given two points in the plane, returns the distance between

the two points.

In [0]:

6b. Write a Python function that, given a list K points [p1,...,pk] in the plane and another

point q, returns the point pi in the list that is closest to q.

In [0]:

6c. Write a Python function that takes as input a list of K ’center points’ [p1,...,pk] in the

plane and a list l of points in the plane. It should then return a dictionary with K keys, corresponding

to the points p1,...,pk, such that the value corresponding to the key pi is a list of all

the points in l which are closer to pi than to any of the other center points.

In [0]:

6d. Write a Python function that, given a list l of points in the plane, returns the centroid.

The x-coordinate of the centroid is the average of the x-coordinates of the points in l, and the

y-coordinate of the centroid is the average of the y-coordinates of the points in l.

In [0]:

6e. Combining the above functions, write a Python function to implement K-means clustering.

Your function should take as input a list l of points in the plane and a positive integer K. It should

then choose K random points in the plane and sort the points of l into K lists using your function

from 5c. It should use your function from 5d to compute the centroid of each list, and rerun your

function from 5c using these K points. It should repeat this process until the lists stop changing,

then return the K lists.

In [0]:

6f. Apply your function to the two-dimensional projection of the digits dataset that you made

in 5a. Make a plot of this projection, colored using the labels generated by your function.

In [0]: