代做STAT 4620 Assignment 2代做留学生R程序

2025.09.29 - 首页 >> Java编程

STAT 4620 Assignment 2

Due: October 1st, 2025 by 5PM

Question 1 [5 points]

This problem has to do with odds.

(a) On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default?

(b) Suppose that an individual has a 16 % chance of defaulting on her credit card payment. What are the odds that she will default?

Question 2 [10 points]

Recall that the odds of “success” for the 2-class classification problem (i.e. K = 2) with a univariate predictor variable X (i.e. p = 1) are given by

and in logistic regression, the log-odds is modeled as a linear function of X,

log(odds) = α + βX.

1. Derive an expression for the log-odds of “success” when K = 2 and p = 1 for the LDA model.

2. Is the log-odds of success when modeling the data using LDA also a linear function of the univariate pre-dictor variable X?

Hint: In LDA, the probability p(X) is calculated according to Equation 4.17 in the text.

Question 3 [30 points]

Please use the Weekly dataset available in the package ISLR2. You may find Section 4.6 in your book helpful.

(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

(b) Use the full data set to perform. a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary() function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

(e) Repeat (d) using LDA.

(f) Repeat (d) using QDA.

(g) Repeat (d) using KNN with K = 1.

(h) Which of these methods appears to provide the best results on this data?

Question 4 [10 points]

Recall the “stairfun” function introduced in class during Lecture 3 that made use of indicator variables to con-struct a simple yet highly flexible model for univariate regression. The function depended on a single parameter, numcuts, which dictated the flexibility of the resulting estimated function. In class I demonstrated how, when I knew the true underlying function f, I could construct curves summarizing the bias-variance tradeoff as the flexibility of the model varied (i.e., as numcuts varied).

In this question, I have given observations of a true underlying function, f (different from the one used in class), of a single predictor variable x ∈ [0, 1] in the file fun.Rdata (uploaded on Carmen), containing an input vector x and an response vector y.

Using k-fold cross-validation, construct MSE curves that summarize the bias-variance tradeoff of fitting the stairfun function to this dataset as numcuts is varied. You can do this by calculating estimates of the MSE using cross-validation at different settings of numcuts and plotting a curve of CV error versus model complexity. Try k = 10 and k = 5 in your cross-validation. Does changing the value of k have a large effect on the plotted curves? What value of the complexity parameter numcuts does your investigation recommend when using the stairfun function to predict this dataset?

Note 1: You can reuse any R code in the file stairfun.R.

Note 2: When generating folds for cross-validation, sample data indices for each fold randomly (this is different from what I did in stairfun.R). To do this, you can use the command matrix(sample(1:n), nrow=k), where n is the sample size and k is the number of folds.