辅导STAT 603、R设计辅导、programming留学生讲解、辅导R编程设计 讲解数据库SQL|辅导留学生 Statistics统计、
- 首页 >> Web STAT 603: Homework 7
Due: Thursday, May 2nd.
Directions:
0. You may work in groups to discuss about ideas, but the programming and writing must be your own
work. Copying others’ work/code or allowing others to copy your own work/code are all
considered cheating and plagiarism, and will result in zero point for the whole homework
and F grade for STAT603. Cheating in any coursework is considered serious offense against academic
integrity and University rules.
1. Submit a PDF copy of your homework, R source code, and your label prediction on
Canvas. For the PDF file, you should name it as “myhomework.pdf”; for the R source code, you
should name it as “mycode.R”; for your prediction for the testing data, you should name it as
“myprediction.txt” (See Q8). Only file types of “pdf”, “R” and “txt” will be accepted on Canvas. If
any of these three files are missing online, we won’t grade your homework.
2. Submit a hardcopy of the PDF file “myhomework.pdf” in class. We won’t grade your
homework without a hardcopy.
3. Show all your work! Both source code and key outputs from running your code are required. Simply
giving a final answer or source code without appropriate explanation/key outputs will not receive any
points.
4. Typing answers in RMarkdown or LaTeX is strongly recommended.
In this homework, we continue to work on the MNIST data sets. Recall from Q10 in HW6 that using the
training count data set, for a given digit k (k = 0, 1, · · · , 9), we can get the sample points x1, x2, · · · , xn ∈ R
d
for true digit label k with xi = (xi1, · · · , xid). Then for digit k, its MLE p with d = 49 can be
obtained by
Using the training count data set “mnist_train_counts.csv”, perform the following exercise Q1-Q3.
Q1
For digit k = 5, extract the sub-sample of the training data set that corresponds to the true digit label “5”.
Print out the sample size of this sub-sample.
Q2
For digit k = 5, apply the MLE formula on the extracted sub-sample in Q1 to find the MLE p?k = p?. Print
out your answer.
Q3
Repeat Q2 for each digit k = 0, 1, · · · , 9. For grader to verify your answer, print out a d × 10 matrix that
contains all pk for k = 0, 1, · · · , 9, that is, the jth column of this matrix is pj1.
Next, we will use this “naive” probabilistic model to make prediction for the testing data set
“mnist_test_counts.csv”.
1Q4
To warm up, suppose we want to make prediction for the 100th data point in the testing data set. Extract
this data point’s count vector x. For grader to check your results, print out x. In addition, find the sample
proportions πk (k = 0, 1, · · · , 9), which are from Q6 in HW6.
To make prediction for the 100th data point with the count vector x, we can use the Bayes rule:
y = arg max
k=0,1··· ,9
πfk(x | pk) = arg max
k=0,1··· ,9,
where the function g(x, p) given x = (x1, · · · , xd) and p = (p1, · · · , pd) is
g(x, p) = logY
xj log pj .
Q5
Write an R function named gfun(x, p), which returns output for g(x, p). For grader to verify your answer,
print out the outputs of gfun(x, p) using the 100th data point’s count vector x and pk for digit k = 5. Note:
when implementing gfun(x, p), how would you handle the possible situation that pj = 0?
Q6
We are now ready to make prediction for the 100th data point. Use the function gfun(x, p) above to calculate
log πk + g(x, pk) for all k = 0, 1, · · · , 9 and find your label prediction y. For grader to verify the results, print
out all these outputs.
Q7
Now let’s look at the true label for the 100th data point. Print out the true label y and I(y 6= y). Does your
prediction give the correct label?
Q8
Repeat the process above to perform prediction for all the data points in the testing data set. Calculate the
misclassification error rate for the the “naive” model by
misclassification rate
is the predicted label, yi
is the true label, and N is the sample size of the testing data set. In
addition, save your label prediction as a “myprediction.txt” file, with the ith row representing your prediction
yi. Specifically, suppose yhat is the vector object that contains your prediction, you should use the following
code to generate the file “myprediction.txt”:
write.table(yhat,file="myprediction.txt",row.names=FALSE,col.names=FALSE,sep="")
Any other format of your prediction file will NOT be graded.
2
Due: Thursday, May 2nd.
Directions:
0. You may work in groups to discuss about ideas, but the programming and writing must be your own
work. Copying others’ work/code or allowing others to copy your own work/code are all
considered cheating and plagiarism, and will result in zero point for the whole homework
and F grade for STAT603. Cheating in any coursework is considered serious offense against academic
integrity and University rules.
1. Submit a PDF copy of your homework, R source code, and your label prediction on
Canvas. For the PDF file, you should name it as “myhomework.pdf”; for the R source code, you
should name it as “mycode.R”; for your prediction for the testing data, you should name it as
“myprediction.txt” (See Q8). Only file types of “pdf”, “R” and “txt” will be accepted on Canvas. If
any of these three files are missing online, we won’t grade your homework.
2. Submit a hardcopy of the PDF file “myhomework.pdf” in class. We won’t grade your
homework without a hardcopy.
3. Show all your work! Both source code and key outputs from running your code are required. Simply
giving a final answer or source code without appropriate explanation/key outputs will not receive any
points.
4. Typing answers in RMarkdown or LaTeX is strongly recommended.
In this homework, we continue to work on the MNIST data sets. Recall from Q10 in HW6 that using the
training count data set, for a given digit k (k = 0, 1, · · · , 9), we can get the sample points x1, x2, · · · , xn ∈ R
d
for true digit label k with xi = (xi1, · · · , xid). Then for digit k, its MLE p with d = 49 can be
obtained by
Using the training count data set “mnist_train_counts.csv”, perform the following exercise Q1-Q3.
Q1
For digit k = 5, extract the sub-sample of the training data set that corresponds to the true digit label “5”.
Print out the sample size of this sub-sample.
Q2
For digit k = 5, apply the MLE formula on the extracted sub-sample in Q1 to find the MLE p?k = p?. Print
out your answer.
Q3
Repeat Q2 for each digit k = 0, 1, · · · , 9. For grader to verify your answer, print out a d × 10 matrix that
contains all pk for k = 0, 1, · · · , 9, that is, the jth column of this matrix is pj1.
Next, we will use this “naive” probabilistic model to make prediction for the testing data set
“mnist_test_counts.csv”.
1Q4
To warm up, suppose we want to make prediction for the 100th data point in the testing data set. Extract
this data point’s count vector x. For grader to check your results, print out x. In addition, find the sample
proportions πk (k = 0, 1, · · · , 9), which are from Q6 in HW6.
To make prediction for the 100th data point with the count vector x, we can use the Bayes rule:
y = arg max
k=0,1··· ,9
πfk(x | pk) = arg max
k=0,1··· ,9,
where the function g(x, p) given x = (x1, · · · , xd) and p = (p1, · · · , pd) is
g(x, p) = logY
xj log pj .
Q5
Write an R function named gfun(x, p), which returns output for g(x, p). For grader to verify your answer,
print out the outputs of gfun(x, p) using the 100th data point’s count vector x and pk for digit k = 5. Note:
when implementing gfun(x, p), how would you handle the possible situation that pj = 0?
Q6
We are now ready to make prediction for the 100th data point. Use the function gfun(x, p) above to calculate
log πk + g(x, pk) for all k = 0, 1, · · · , 9 and find your label prediction y. For grader to verify the results, print
out all these outputs.
Q7
Now let’s look at the true label for the 100th data point. Print out the true label y and I(y 6= y). Does your
prediction give the correct label?
Q8
Repeat the process above to perform prediction for all the data points in the testing data set. Calculate the
misclassification error rate for the the “naive” model by
misclassification rate
is the predicted label, yi
is the true label, and N is the sample size of the testing data set. In
addition, save your label prediction as a “myprediction.txt” file, with the ith row representing your prediction
yi. Specifically, suppose yhat is the vector object that contains your prediction, you should use the following
code to generate the file “myprediction.txt”:
write.table(yhat,file="myprediction.txt",row.names=FALSE,col.names=FALSE,sep="")
Any other format of your prediction file will NOT be graded.
2