代做FIT5201 Complexity and Model Selection
- 首页 >> C/C++编程Assignment 1, FIT5201, S1 2023
1 Model Complexity and Model Selection
In this section, you study the effect of model complexity on the training and testing error. You also demonstrate your programming skills by developing a regression algorithm and a cross-validation technique that will be used to select the models with the most effective complexity.
Background. A KNN regressor is similar to a KNN classifier (covered in Activity 1.1) in that it finds the K nearest neighbors and estimates the value of the given test point based on the values of its neighbours. The main difference between KNN regression and KNN classification is that a KNN classifier returns the label that has the majority vote in the neighborhood, whilst KNN regressor returns the average of the neighbors’ values. In Activity 1 of Module 1, we use the number of misclassifications as the measurement of training and testing errors in a KNN classifier. For KNN regressor, you need to choose another error function (e.g., the sum of the squares of the errors) as the measurement of training errors and testing errors.
Question 1 [KNN Regressor, 5+5=10 Marks]
I Implement a KNN regressor using the scikit-learn conventations, i.e., in a class with the following skeleton.
class KnnRegressor:
def __init__(self): # ADD PARAMETERS AS REQUIRED
# YOUR CODE HERE
def fit(self, x, y):
# YOUR CODE HERE
return self
def predict(self, x):
# YOUR CODE HERE
Hint: You can closely follow the implementation from Activity 1.1 of the KNN classifier. You cannot use sklearn.neighbors.KNeighborsRegressor to solve this task.
II To test your implementation, load the datasets diabetes and california housing through the functions load diabetes and fetch california housing, both of which are available in the module sklearn.datasets. For both datasets, perform a training/test split (using a fraction of 0.6 of the data as training data), fit your KNN regressor to the training portion (using some guess for a good value of k), and report the training and test errors.
Question 2 [L-fold Cross Validation, 5+5+5=15 Marks]
I Implement a L-Fold Cross Validation (CV) scheme using the scikit-learn convention for data splitters, i.e., using the following skeleton.
class LFold:
def __init__(self): # ADD PARAMETERS AS REQUIRED
# YOUR CODE HERE
def get_n_splits(self, x=None, y=None, groups=None):
# YOUR CODE HERE
def split(self, x, y=None, groups=None):
# YOUR CODE HERE
Test your implementation for correctness by running a simple example like the following.
for idx_train, idx_test in LFold(5).split(list(range(20))):
print(idx_train, idx_test)
You cannot use sklearn.model selection.KFold to solve this task.
II For both datasets from Question 1, use your L-fold CV implementation to systematically test the effect of the KNN parameter K by testing all options from 1 to 50 and, for each K, instead of only performing a single training/test split run your L-Fold CV. For each K compute the mean and standard deviation of the mean squared error (training and test) across the L folds and report the K for which you found the best test performance (for both datasets).
III For both datasets, plot the mean training and test errors against the choice of K with error bars (using the standard error of the mean). You can compute the standard error of the means as
ste = 1.96s/√ L
where s is the sample standard deviation of the error across the L folds. Based on this plot,comment on
– The effect of the parameter K. For both datasets, identify regions of overfitting and underfitting for the KNN model.
– The effect of the parameter L of the CV procedure. HINT: You might want to repeat the above process with different values for L to get an intuition of its effect.
Question 3 [Automatic Model Selection, 5 + 5 = 10 Marks]
I Implement a version of the KNN regressor that automatically chooses an appropriate value of K from a list of options by performing an internal cross-validation on the training set at fitting time. As usually, use the scikit-learn paradigm, i.e., use the following template.
class KnnRegressorCV:
def __init__(self, ks=list(range(1, 21)), cv=LFold(5)):
# YOUR CODE HERE
def fit(self, x, y):
# YOUR CODE HERE
return self
def predict(self, x):
# YOUR CODE HERE
II For both datasets from the previous questions, test your KNN regressor with internal CV by using either a outer single train/test-split or, ideally, with an outer cross-validation (resulting in a so-called nested cross-validation scheme). Report on the (mean) k value that is chosen by the KNN regressor with internal cross-validation and whether it corresponds to the best k-value with respect to the outer test sets. Comment on what factors determine whether the internal cross-validation procedure is successful in approximately selecting the best model.
2 Probabilistic Machine Learning
In this section, you show your knowledge about the foundation of probabilistic machine learning (i.e. probabilistic inference and modeling) by solving a simple but basic statistical inference problem.
Solve the following problem based on the probability concepts you have learned in Module 1 with the same math conventions.
Question 4 [Bayes Rule, 5+5=10 Marks]
Recall the simple example from Appendix A of
Module 1. Suppose we have one red, one blue, and one yellow box with the following content:
In the red box we have 3 apples and 5 oranges,
in the blue box we have 4 apples and 4 oranges, and
in the yellow box we have 1 apples and 1 orange.
Now suppose we selected one of the boxes uniformly at random and then, in a second step, picked a fruit from it, again uniformly at random.
I Implement a Python function that simulates the above experiment (using a suitable method of a numpy random number generator obtained via numpy.random.get default rng).
II Answer the following question by a formal derivation: If the picked fruit is an apple, what is the probability that it was picked from the yellow box?
Hint: Formalise this problem using the notions in the “Random Variable” paragraph in Appendix A of Module 1.
3 Ridge Regression
In this section, you develop Ridge Regression by adding the L2 norm regularization to the linear regression (covered in Activity 2.1 of Module 2) and study the effect of the L2 norm regularization on the training and testing errors. This section assesses your mathematical skills (derivation), programming, and analytical skills.
Question 5 [Ridge Regression, 10+5+5=20 Marks]
I Given the gradient descent algorithms for linear regression (discussed in Chapter 2 of Module 2), derive weight update steps of stochastic gradient descent (SGD) for linear regression with L2 regularisation norm. Show your work with enough explanation in your PDF report; you should provide the steps of SGD.
Hint: Recall that for linear regression we defined the error function E. For this assignment,you only need to add an L2 regularization term to the error function (error term plus the regularization term). This question is similar to Activity 2.1 of Module 2.
II Using the analytically derived gradient from Step I, implement either a direct or a (stochastic) gradient descent algorithm for Ridge Regression (use again the usual template with init , fit, and predict methods. You cannot use any import from sklearn.linear model for this task.
III Study the effect of the L2-regularization on the training and testing errors, using the synthetic data generator from Activity 2.3. i.e., where data is generated according to
a For each λ in {0, 0.4, 0.8, . . . , 10}, create a pipeline of your implemented ridge regressor with a polynomial feature transformer with degree 5.
b Fit the model ten times (resampling a training dataset of size 20 each time) for all choices of λ.
c Create a plot of mean squared errors (use different colors for the training and testing errors), where the x-axis is log lambda and y-axis is the error. Discuss λ, model complexity, and error rates, corresponding to underfitting and overfitting, by observing your plot.
4 Multiclass Perceptron
In this section, you are asked to demonstrate your understanding of linear models for classification.
You expand the binary-class perceptron algorithm that is covered in Activity 3.1 of Module 3 into a multiclass classifier. Then, you study the effect of the learning rate on the error rate. This section assesses your programming, and analytical skills.
Background. Assume we have N training examples {(x1, t1), …,(xN , tN )} where tn is one of K discrete values {1, . . . , K}, i.e. a K-class classification problem. For a prediction function of a model with parameters w, we use, as usual, yn(xn, w) to represent the predicted label of data point xn. In particular, for the K-class classification problem with p-dimensional inputs, we will consider a k × p weight matrix w, or alternatively, a collection of K weight vectors wk, each of which corresponding to one of the classes. At prediction time, a data point x will then be classified as
y = arg max wk · x .
k∈{1,…K}
We can fit those weights with the multiclass perceptron algorithm as follows:
Initialise the weight vectors w1, . . . , wK randomly to small weights
FOR n = 1 to N:
– y = arg maxk∈{1,…,K} wk · xn
– IF y! = tn THEN for all k ∈ {1, . . . , K}
IF weights have changed THEN go to Step 2 ELSE terminate
In what follows, we look into the convergence properties of the training algorithm for multiclass perceptron (similar to Activity 3.1 of Module 3).
Question 6 [Multiclass Perceptron, 5+5+10=20 Marks]
I Implement the multiclass perceptron as explained above using the usual template. You cannot use sklearn.linear model.Perceptron to solve this task.
II Evaluate your algorithm using the digits dataset provided through the function load digits in sklearn.datasets. This is a classification problem with 10 classes corresponding to the digites 0 to 9 (see the scikit-learn online documentation for more information). Perform an 80/20 train/test split and report your train and test error rates (using η = 0.01).
III Modify your classifier implementation to store the history of the weight vectors (similar to the gradient descent algorithms implemented in Activity 2.1). Then run the model fitting for two different learning rates (η = 0.1 and η = 0.9) and draw a plot of the training and test error as the number of iterations of the inner loop increases (it is enough to only evaluate the errors every 5 iterations). Explain how the testing errors of two models behave differently, as the training data increases, by observing your plot.
5 Logistic Regression versus Bayes Classifier
This task assesses your analytical skills. You need to study the performance of two well-known generative and probabilistic models, i.e. Bayesian classifier and logistic regression, as the size of the training set increases. Then, you show your understanding of the behavior of learning curves of typical generative and probabilistic models.
Question 7 [Discriminative vs Generative Models, 5+5+5+5=20 Marks]
I Load the breast cancer dataset via load breast cancer in sklearn.datasets and copy the code from Activities 3.2 and 3.3. for the Bayes classifier (BC) and logistic regression (LR).
Note: for logistic regression you can instead also simply import LogisticRegression from sklearn.linear model and, when using, set the parameter penalty to ’none’. Perform a training/test split (with train size equal to 0.8) and report which model performs better in terms of train and test performance.
II Implement an experiment where you test the performance for increasing training sizes of N = 5, 10, . . . , 500. For each N sample 10 training sets of the corresponding size, fit both models, and record training and test errors.
Hint: you can use training test split from sklearn.model selection with an integer parameter for train size.
III Create suitable plots that compare the mean train and test performances of both models as a function of training size. Make sure to also include error bars in the plot computed similar to those in Question 5.
IV Formulate answers to the following questions:
a What happens for each classifier when the number of training data points is increased?
b Which classifier is best suited when the training set is small, and which is best suited when the training set is big?
c Justify your observations in previous questions (III.a & III.b) by providing some speculations and possible reasons.
Hint: Think about model complexity and the fundamental concepts of machine learning covered in Module 1.
Submission and Interview
Submission Please submit one zip-file that contains two versions of a single Jupyter Notebook file that contains
your name and student ID in a leading markdown cell followed by
a structure that clearly separates between sections and questions (with markdown headlines and sub-headlines) and then for each question
all required code,
all required figures,
and your answers to all questions that require a free-text answer (in markdown cells).
One version is the actual notebook file (with extension “.ipynp”). The other one is a pdf export (“.pdf”). Note that depending on your system it might be necessary to first generate an html export and then save that with your web browser as pdf-file.
The three files should be named STUD ID FIRSTNAME LASTNAME assessment 1.SUFFIX where SUFFIX is “zip”, “pdf”, and “ipynb”, respectively. The submission must be received via Moodle by the due date mentioned on the top of this document.
Interview
In addition to the submission, you will be asked to meet (online or on-campus) with your tutor for an interview when your assessment is marked. Not submitting the file or not attending the interview will both result in 0 marks for the assignment.
Notes Please note that,
One second delay will be penalized as one day delay. So please submit your assignment in advance (considering the possible Internet delay) and do not wait until the last minute.
We will not accept any resubmitted version. So please double check your assignment before the submission.
Your final grade does not only depend on your submission but also on your ability to explain your solution in an assignment interview to be held after submission. Failure to attend this interview will result in 0 marks.