CSC 503/SENG 474 Data Mining Assignment 1

- 首页 >> Java编程

 Data Mining (CSC 503/SENG 474)

Assignment 1 Due on Friday, June 26th, 11:55pm
Instructions:
• You must complete this assignment on your own; this includes any coding/implementing,
running of experiments, generating plots, analyzing results, writing up results, and working
out problems. Assignments are for developing skills to make you strong. If you do the
assignments well, you’ll almost surely do better on the midterm and final.
• On the other hand, you can certainly have high-level discussions with classmates about course
material. You also are welcome to come to office hours (or otherwise somehow ask me and
the TAs things if you are stuck).
• You must type up your analysis and solutions; I strongly encourage you to use LaTeX to do
this. LaTeX is great both for presenting figures and math.
• Please submit your solutions via conneX by the due date/time indicated above. This is a
hard deadline. Any late assignment will result in a zero (I won’t accept assignments even
one day late; “I forgot to hit submit on conneX”, etc. will not be met with any sympathy).
However, I will make an exception if I am notified prior to the deadline with an acceptable
excuse and if you further can (soon thereafter) provide a signed note related to this excuse.
1
Assignment 1
This assignment has two parts. The first part involves implementing some of the methods that
you have already seen and analyzing their results on some classification problems; this part is for
all students (both CSC 503 and SENG 474). The second part is actually only for graduate students
(only CSC 503). The first part might seem like a lot of work. However, if you do it well, you’ll
become strong, and also gain valuable skills for your project. This is a good thing.
1 Experiments and Analysis
First, “implement” the following methods:
• Decision trees (with pruning). For simplicity, I suggest going with reduced error pruning
(this is the form of pruning we covered in class). For the split criterion, you can use informa￾tion gain or the Gini index or some other good criterion. You might even try a few different
ones as part of your analysis (explained further below).
• Random forests (no pruning). What forest size should you use? Well, you should exper￾iment with different forest sizes and see what happens. For the size of the random sample
(sampled with replacement) used to learn each tree, I suggest setting the number of samples
with replacement to be equal to the original sample size. For the number of random features
when selecting the split feature at each decision node, I suggest starting at √
d (for d features)
and experimenting by going up and down from there.
• Neural networks. Any number of layers is fine, as long at there is at least one hidden layer;
I suggest going with 3 layers (i.e. the input layer, 1 hidden layer, and the output layer).
I put “implement” in quotes because I won’t actually require you to implement these methods;
you can instead use machine learning software that you find online. If you do implement anything,
you can use whatever programming language you like. For neural networks in particular, I shame￾fully recommend using an existing implementation here, but eventually (after this assignment) for
your own edification it would be great if you implement a neural network yourself.
What to test on
You’ll be analyzing the performance of these methods on two classification problems.
For the first problem, you’ll be using the heart disease dataset from the UCI repository:
https://archive.ics.uci.edu/ml/datasets/Heart+Disease
In order to make your life a bit easier, I have provided a slightly cleaned and modified dataset,
cleaned_processed.cleveland.data. In case you’re interested, you can find details on how I
prepared this dataset in Appendix A. The last attribute/feature (the label) takes values 0 and 1,
where 0 indicates no heart disease and 1 indicates some level of heart disease. Keep in mind that
the dataset has not been split into a training and test set. You should do this. For any training/test
split, a good rule of thumb is 80% for the training set and 20% for the test set.
The second problem will actually be designed by you. It will again be a classification problem,
meaning that it consists of a set of training examples and a set of test examples for a classification
task. For simplicity, I suggest going with a binary classification task. You can either generate
the data yourself or find some data online. The important thing is that the problem should be
interesting: that is to say, the learning task should not be so easy that every method can obtain
zero test error, but it should be easy enough where some method is able to obtain a reasonably
small test error. What does “reasonably small test error” mean? Well, for binary classification,
this would be a test error well below 50%. Why? Because a classifier that just randomly predicts
with equal probability will always get close to 50% test error.
2
How to do the analysis
The core of this assignment, meaning what is worth the most marks, is the experiment analysis.
Your analysis will go into a file called report.pdf. This file should:
• contain a description of the second classification problem (the one you came up with or
selected), including an explanation of why the problem is interesting; specifically, explain
why it is non-trivial and why it enables good comparison of the methods.
• present the performance of the methods in terms of the training and test error on each of
the problems. In particular, you should present graphs that show how each of training error
and test error vary with training set size. But beyond that, you should also play with the
parameters of the methods to see the effect. For instance, what happens when you change the
learning rate (or number of nodes in the hidden layer, or the number of iterations of training)
for the neural network? What happens if you change the number of random features used
for the random forest? What happens if you change the pruning rule (or use a different split
criterion) for the decision tree? As much as possible, try to present your results with plots.
• contain a detailed analysis of your results, including various design choices you made (like
choice of split criterion for decision trees, choice of nonlinearity for neural networks, etc.).
Try and explain the results that you got for the various experiments you ran, and use these
results to compare the methods. Think about the best parameter settings for the methods
(and maybe think about how the best parameter setting might change as the training sample
size increases). Ideally, use your analysis to come up with ideas on how you might improve
the methods.
Please make your analysis concise (yet thorough). Don’t ramble, and don’t make stuff up. Act
as if you are a respected scientist presenting some work to your colleagues (and you want them to
continue being your colleagues after the presentation).
What to submit
In all, for the analysis portion of the assignment, you should submit (as a zip file):
• the file report.pdf explained above;
• a file called README.txt which contains instructions for running your code (for any code you
wrote) or gives proper attribution to any code/datasets you got from other sources (like the
Internet, for example). If you mostly used some existing code/software but needed to modify
a few files, then give attribution and mention the files you modified.
• a file called code.zip, containing your code. Please organize this file well (embrace the idea
of directories). In case you mostly used some existing code/software but needed to modify a
few files, then just provide those files here, and make sure your README.txt mentions those
files in relation to the existing code/software.
• any additional files that you need; in particular, this should include the training and test sets
for the second classification problem in case you came up with your own problem.
3
2 Problem-solving part - CSC 503 only
In this problem, we consider the gradient descent algorithm for a two-dimensional regression prob￾lem. So, we have 2 continuous features x1 and x2, and the task is to predict a continuous target y.
Suppose that for a given input xi =
 
x1,i
x2,i!
, we predict using hypotheses of the following form:
fw(xi) = w1 x1,i + w2 x2,i + w3 x1,i x2,i + b.
Assume that we have n training examples (x1, y1), . . . ,(xn, yn). Suppose that we measure the
error according to the squared error with a further penalty on the squared Euclidean norm of w.
Then, for a fixed, positive number λ, the training error can be written a
In the below, assume the learning rate (also called the step size) is some positive number η.
(a) Derive the gradient descent update for b.
(b) Derive the gradient descent update for w1.
(c) Derive the gradient descent update for w3.
(d) Show the full gradient descent update for the vector w.
3 How you’ll be marked
• For each of CSC 503 and SENG 474, the total number of marks is 100, and you receive 10
marks for your code (in case you write code) together with the classification problem that
you design.
• For undergrads (SENG 474), the analysis (in report.pdf) is worth 90 marks.
• For grad students (CSC 503), the analysis (in report.pdf) is worth 80 marks and the
Problem-solving part is worth 10 marks.
4
A Cleaned version of processed.cleveland.data
In order to obtain the provided cleaned version of the data, I started from the original dataset
processed.cleveland.data, which can be found in the Data Folder from the link above. You
should not use this dataset!
In the original processed.cleveland.data dataset, some of the 6 examples have one feature
with the value ‘?’. This is a missing value. There are various ways to deal with missing values,
but an easy way (when not many values are missing) is to simply discard those examples; this is
what I did. Also, the last attribute/feature (the label) takes values {0, 1, 2, 3, 4}. I formed a binary
classification task by:
• keeping the label 0 as is;
• grouping together the labels in the set {1, 2, 3, 4} by changing their label to 1. Note that all
of these non-zero labels indicate some level of heart disease.
站长地图