STATS 4014讲解、辅导R编程语言、Data Science讲解、辅导R程序 讲解Java程序|解析C/C++编程
- 首页 >> 其他 STATS 4014
Advanced Data Science
Assignment 4
CHECKLIST
: Have you shown all of your working, including probability notation where necessary?
: Have you given all numbers to 3 decimal places unless otherwise stated?
: Have you included all R output and plots to support your answers where necessary?
: Have you included all of your R code?
: Have you made sure that all plots and tables each have a caption?
: If before the deadline, have you submitted your assignment via the online submission on MyUni?
: Is your submission a single pdf file - correctly orientated, easy to read? If not, penalties apply.
: Penalties for more than one document - 10% of final mark for each extra document. Note that you
may resubmit and your final version is marked, but the final document should be a single file.
: Penalties for late submission - within 24 hours 40% of final mark. After 24 hours, assignment is not
marked and you get zero.
: Assignments emailed instead of submitted by the online submission on MyUni will not be marked
and will receive zero.
: Have you checked that the assignment submitted is the correct one, as we cannot accept other
submissions after the due date?
Due date: Friday 17th May 2019 (Week 9), 5pm.
Q1. Natural splines
Consider the data
(x1, y1),(x2, y2), . . . ,(xn, yn).
Suppose that g(x) is a natural cubic spline with knots
Let g(x) be any other twice continuously differentiable function such that
1a. If h(x) = g(x) g(x) then use integration by parts to show that if h(x) = 0 for all a < x < b.
c. Show that the solution to the problem of finding a smoothing spline:
must be a natural cubic spline with knots at
x1, x2, . . . , xn.
Q2. ROC class
a. Create an S3 class that deals with ROC curves. For complete marks, you will need
i. a constructor,
ii. a print function,
iii. a plot function, and
iv. a generic confusion matrix function that takes a ROC object and cutoff and returns the confusion
matrix.
To give an example, code using my S3 class is given below.
data("starwars")
starwars <-
starwars %>%
mutate(human = ifelse(species == "Human", 1, 0)) %>%
na.omit()
starwars_lr <- glm(human ~ height + mass, data = starwars, family = binomial())
starwars_roc <- ROC(
pred = predict(starwars_lr),
obs = starwars$human
)
starwars_roc
## The number of observations is 29.
## The number of positives is 18.
## The number of negatives is 11.
##
## First rows of data
## # A tibble: 6 x 2
## pred obs
##
2## 1 0.705 1
## 2 2.31 1
## 3 0.184 1
## 4 2.37 1
## 5 0.836 1
## 6 0.665 1
##
## First row of summary data frame:
## TP FP FN TN Score FPR TPR precision recall
## 1 0 0 18 11 2.3652725 0.00000000 0.00000000 NaN 0.00000000
## 2 1 0 17 11 2.3093987 0.00000000 0.05555556 1.0000000 0.05555556
## 3 2 0 16 11 1.6933920 0.00000000 0.11111111 1.0000000 0.11111111
## 4 2 1 16 10 0.8576164 0.09090909 0.11111111 0.6666667 0.11111111
## 5 2 2 16 9 0.8357629 0.18181818 0.11111111 0.5000000 0.11111111
## 6 3 2 15 9 0.7668831 0.18181818 0.16666667 0.6000000 0.16666667
TPR plot(starwars_roc, type = "PR")
conf_matrix(starwars_roc)
## # A tibble: 2 x 3
## HC `0` `1`
##
## 1 0 7 6
## 2 1 4 12
conf_matrix(starwars_roc, cutoff = 0.9)
## # A tibble: 2 x 3
## HC `0` `1`
##
## 1 0 10 16
## 2 1 1 2
conf_matrix(1:10, cutoff = 0.9)
## [1] "I do not know how to deal with the class default"
Q3. Titanic dataset
The data in titanic.csv contains the details for 712 passengers on the ship Titanic. The following variables
are given:
4Variable Definition Key
survival Survival 0 = No 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way. . .
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way. . .
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
a. Read in the dataset and clean it.
b. Fit a MARS model.
c. Fit a CART.
d. Using both models, predict which is more likely to survive a first class 24 year old male travelling alone
or a first class 24 year old female travelling alone.
e. According to both models, which class and sex are least likely to survive?
5Mark scheme
Part Marks Difficulty Area Type Comments
Q1
1a 7 0.29 Splines proof 7 for proof
1b 7 0.29 Splines proof 7 for proof
1c 5 0.00 Splines proof 5 for proof
Total 19
Q2
2ai 5 0.00 S3 OOP coding 5 for code
2aii 5 0.00 S3 OOP coding 5 for code
2aiii 6 0.50 S3 OOP coding 6 for code
2aiv 6 0.50 S3 OOP coding 6 for code
Total 22
Q3
3ab 4 0.00 MARS/CART analysis 4 for analysis
3c 2 0.00 MARS/CART analysis 2 for analysis
3d 4 0.00 MARS/CART analysis 4 for analysis
3e 3 0.00 MARS/CART analysis 3 for analysis
Total 13
Assignment total 54
6
Advanced Data Science
Assignment 4
CHECKLIST
: Have you shown all of your working, including probability notation where necessary?
: Have you given all numbers to 3 decimal places unless otherwise stated?
: Have you included all R output and plots to support your answers where necessary?
: Have you included all of your R code?
: Have you made sure that all plots and tables each have a caption?
: If before the deadline, have you submitted your assignment via the online submission on MyUni?
: Is your submission a single pdf file - correctly orientated, easy to read? If not, penalties apply.
: Penalties for more than one document - 10% of final mark for each extra document. Note that you
may resubmit and your final version is marked, but the final document should be a single file.
: Penalties for late submission - within 24 hours 40% of final mark. After 24 hours, assignment is not
marked and you get zero.
: Assignments emailed instead of submitted by the online submission on MyUni will not be marked
and will receive zero.
: Have you checked that the assignment submitted is the correct one, as we cannot accept other
submissions after the due date?
Due date: Friday 17th May 2019 (Week 9), 5pm.
Q1. Natural splines
Consider the data
(x1, y1),(x2, y2), . . . ,(xn, yn).
Suppose that g(x) is a natural cubic spline with knots
Let g(x) be any other twice continuously differentiable function such that
1a. If h(x) = g(x) g(x) then use integration by parts to show that if h(x) = 0 for all a < x < b.
c. Show that the solution to the problem of finding a smoothing spline:
must be a natural cubic spline with knots at
x1, x2, . . . , xn.
Q2. ROC class
a. Create an S3 class that deals with ROC curves. For complete marks, you will need
i. a constructor,
ii. a print function,
iii. a plot function, and
iv. a generic confusion matrix function that takes a ROC object and cutoff and returns the confusion
matrix.
To give an example, code using my S3 class is given below.
data("starwars")
starwars <-
starwars %>%
mutate(human = ifelse(species == "Human", 1, 0)) %>%
na.omit()
starwars_lr <- glm(human ~ height + mass, data = starwars, family = binomial())
starwars_roc <- ROC(
pred = predict(starwars_lr),
obs = starwars$human
)
starwars_roc
## The number of observations is 29.
## The number of positives is 18.
## The number of negatives is 11.
##
## First rows of data
## # A tibble: 6 x 2
## pred obs
##
2## 1 0.705 1
## 2 2.31 1
## 3 0.184 1
## 4 2.37 1
## 5 0.836 1
## 6 0.665 1
##
## First row of summary data frame:
## TP FP FN TN Score FPR TPR precision recall
## 1 0 0 18 11 2.3652725 0.00000000 0.00000000 NaN 0.00000000
## 2 1 0 17 11 2.3093987 0.00000000 0.05555556 1.0000000 0.05555556
## 3 2 0 16 11 1.6933920 0.00000000 0.11111111 1.0000000 0.11111111
## 4 2 1 16 10 0.8576164 0.09090909 0.11111111 0.6666667 0.11111111
## 5 2 2 16 9 0.8357629 0.18181818 0.11111111 0.5000000 0.11111111
## 6 3 2 15 9 0.7668831 0.18181818 0.16666667 0.6000000 0.16666667
TPR plot(starwars_roc, type = "PR")
conf_matrix(starwars_roc)
## # A tibble: 2 x 3
## HC `0` `1`
##
## 1 0 7 6
## 2 1 4 12
conf_matrix(starwars_roc, cutoff = 0.9)
## # A tibble: 2 x 3
## HC `0` `1`
##
## 1 0 10 16
## 2 1 1 2
conf_matrix(1:10, cutoff = 0.9)
## [1] "I do not know how to deal with the class default"
Q3. Titanic dataset
The data in titanic.csv contains the details for 712 passengers on the ship Titanic. The following variables
are given:
4Variable Definition Key
survival Survival 0 = No 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way. . .
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way. . .
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
a. Read in the dataset and clean it.
b. Fit a MARS model.
c. Fit a CART.
d. Using both models, predict which is more likely to survive a first class 24 year old male travelling alone
or a first class 24 year old female travelling alone.
e. According to both models, which class and sex are least likely to survive?
5Mark scheme
Part Marks Difficulty Area Type Comments
Q1
1a 7 0.29 Splines proof 7 for proof
1b 7 0.29 Splines proof 7 for proof
1c 5 0.00 Splines proof 5 for proof
Total 19
Q2
2ai 5 0.00 S3 OOP coding 5 for code
2aii 5 0.00 S3 OOP coding 5 for code
2aiii 6 0.50 S3 OOP coding 6 for code
2aiv 6 0.50 S3 OOP coding 6 for code
Total 22
Q3
3ab 4 0.00 MARS/CART analysis 4 for analysis
3c 2 0.00 MARS/CART analysis 2 for analysis
3d 4 0.00 MARS/CART analysis 4 for analysis
3e 3 0.00 MARS/CART analysis 3 for analysis
Total 13
Assignment total 54
6