辅导Moodle留学生、讲解R程序语言、R设计讲解、辅导Linear

- 首页 >> 其他
Group coursework 2
Please submit your coursework on Moodle by Midday on 1st of March.
Please upload your answers to Question 1 ii) and Question 2 in one pdf file.
Please also upload three R scripts in .R files for Question 1 i), Question 1 ii) and
Question 2.
Make sure that you have included sufficient comments in the codes to make them
readable by other people. There should be no error messages shown when I run
your R scripts. You can assume that I have installed all required packages.
Question 1 [8 marks]
i) Complete the following myLDA function without using any additional packages.
With the feature matrix X ∈ R
N×p
(N > p) and the label vector y ∈ R
N×1 of
the training data, the myLDA function outputs the linear discriminant w ∈ R
p×1
for
binary classification.
[6 marks]
myLDA <- function(X,y){ This function calculates the linear discriminant for binary
classification.
Input: Feature matrix, X (N by p) and label vector, y (N by 1)
Output: Linear discriminant, w (p by 1)
return(w)
}
ii) Calculate the cosine of the angle between the linear discriminant calculated from
myLDA(X=iris[51:150,-5],y=iris[51:150,5]) and that calculated from
lda(Species~.,data=iris[51:150,]). [You can ignore the warning message from
lda that the setosa class is empty.]
The cosine of the angle between two vectors, u ∈ R
p×1 and v ∈ R
p×1
, is defined as
cos(u, v) = u
T v
||u||2||v||2
,
where ||u||2 =

uTu and ||v||2 =

vT v.
What conclusion can you make from this result? [2 marks]
1Question 2 [12 marks]
Download the newthyroid.txt data from moodle. This data contain measurements for
normal patients and those with hyperthyroidism. The first variable class=n if a patient
is normal and class=h if a patients suffers from hyperthyroidism. The rest variables
feature1 to feature5 are some medical test measurements.
i) Draw a pairs plot for the newthyroid.txt data. What patterns can you see from
this plot? [2 marks]
ii) Apply kNN and LDA to classify the newthyroid.txt data: randomly split the data
to a training set (70%) and a test set (30%) and repeat the random split 50 times.
Record the 50 AUC values of kNN and LDA in two vectors.
For kNN, repeat 5-fold cross-validation five times to choose k from (3, 5, 7, 9). Use
AUC as the metric to choose k, i.e. choose k with the largest AUC. [5 marks]
Hint: Read http://topepo.github.io/caret/model-training-and-tuning.html#
model-training-and-parameter-tuning to see how to use AUC as the metric to
choose k.
iii) For the first random split, draw the ROC curves of kNN and LDA on one plot.
[2 marks]
iv) Draw two boxplots based on the 50 AUC values of kNN and LDA. [1 mark]
v) What conclusions can you make from the classification results of kNN and LDA on
the newthyroid.txt data? [2 marks]