代做DATA3888 (2024): Assignment 1帮做Python语言程序

2024.03.26 - 首页 >> Java编程

DATA3888 (2024): Assignment 1

Instructions

1. Your assignment submission needs to be a HTML document that you have compiled using R Markdown

or Quarto. Name your ﬁle as SIDXXX_Assignment.html” where XXX is your Student ID. 2. Under author, put your Student ID at the top of the Rmd ﬁle (NOT your name).

3. For your assignment, please use set.seed(3888) at the start of each chunk (where required). 4. Do not upload the code ﬁle (i.e. the Rmd or qmd ﬁle).

5. You must use code folding so that the marker can inspect your code where required.

6. Your assignment should make sense and provide all the relevant information in the text when the code is hidden. Don’t rely on the marker to understand your code.

7. Any output that you include needs to be explained in the text of the document. If your code chunk generates unnecessary output, please suppress it by specifying chunk options like message = FALSE.

8. Start each of the 3 questions in a separate section. The parts of each question should be in the same section.

9. You may be penalised for excessive or poorly formatted output.

Question 1: Reef

Between 2014-2017, marine scientists recorded an unprecedented global coral bleaching event. Your friend Farhan is a marine science expert who wants to study the environmental variables that may have triggered this event. To do this, we will use a public dataset, curated by Sally and colleagues. This dataset records coral bleaching events at 3351 locations in 81 countries from 1998 to 2017 with a suite of environmental and temperature metrics. The data is in the ﬁle Reef_Check_with_cortad_variables_with_annual_rate_of_SST_change.csv and the full descrip- tion of the variables can be found in the supplementary table of the study.

Part (a)

Farhan has noticed on average the North of Australia experienced higher levels of coral bleaching compared to the South, during the global bleaching event from 2014-2017. In the paper, the authors ﬁnd that the following variables are associated with the probability of coral bleaching.

• TSA_Frequency_Standard_Deviation

• Temperature_Mean

• TSA_Frequency

• Temperature_Kelvin_Standard_Deviation

• TSA_DHW_Standard_Deviation

• SSTA_Frequency_Standard_Deviation

Create one informative graphic to visualise how these six variables are diﬀerent between the North and South of Australia during the 2014-2017 global coral bleaching event. Explain any data ﬁltering or transformation that you perform. Comment on the visualisation and suggest at least one variable that appears to be diﬀerent between the North and the South and thus may be associated with the higher levels of bleaching observed in the North.

Note: the midpoint of Australia is located at -23 degrees Latitude. Observations higher than -23 degrees latitude is considered North Australia. Your graphic can have multiple panels.

Part (b)

Farhan is interested in exploring which reefs were the most aﬀected by the 2014-2017 global bleaching event, across the globe. Create an interactive map visualisation to show the average proportion of coral bleaching between 2014-2017, that allows a marine scientist to identify the names of the most aﬀected coral reefs, the region (recorded as State.Province.Island) and the values of the measurements of the associated environmental variables identiﬁed in part (a). Justify your choice of visualisation, and comment on the result. List 4 regions that were severely bleached in this time period.

Part (c)

Farhan wants to explore the impact of environmental variables on coral bleaching in the most aﬀected regions. For the regions identiﬁed in part (b), create one informative visualisation to show how the average bleaching has changed over time (not restricted to 2014-2017), and its relationship with one of the associated environmental variables identiﬁed in part (a). Comment on the visualisation.

Note: your graphic can have multiple panels.

Question 2: Kidney

Your friend Harry is a nephrologist (kidney specialist) who is interested in building an accurate classiﬁer to detect graft rejection in his kidney transplant patients. He is also interested in knowing which genes may be aﬀecting graft rejection. In this problem, we will build a classiﬁcation model using the public data set GSE138043. We will perform feature selection and build a classiﬁer, estimating its accuracy on unseen data.

Part (a)

Harry wants to know the most diﬀerentially expressed genes between patients that experience graft rejection and stable patients. Use the topTable function in the limma package to output the gene symbols of the 10 most diﬀerentially expressed genes.

Hint: in the GSE138043 dataset, the outcome is found in the characteristics_ch1 column of the phenoData and the gene symbols are found the in gene_assignment column of the featureData, between the ﬁrst and second // symbols.

Part (b)

Harry wants to build a random forest classiﬁer to predict whether a patient is stable or experiencing graft rejection and estimate its accuracy on unseen data. To do this, Harry tries to perform repeated cross-validation on the entire data set, but it takes too long to run. To speed up the model training, Harry knows he can implement feature selection in one of 3 parts of the framework on the next page (OPTION A, OPTION B, or OPTION C), however he is not sure which one.

Explain the diﬀerence (if any) between the 3 options and which option(s) would be the most appropriate for Harry’s task.

Part (c)

Harry wants to implement feature selection in the most appropriate option of Part (b), but he’s not sure how many features he should select. Use the framework from part (b) to evaluate the performance of a random forest classiﬁer on unseen data with feature selection taking the top 10, 50 or 100 genes. Visualise your results and comment on them. How many features would you recommend Harry to use?

Hint: if implemented correctly, this code should take no more than a few minutes to run.

Part (d)

Using the optimal number of features found in part (c), build a random forest classiﬁer on the entire training data set, that Harry could implement on future data. Harry wants to know which genes are the most important in making the classiﬁcation. Output the gene symbols of the top 10 genes in terms of importance in the random forest classiﬁer. Comment on the overlap between the top 10 important genes in the classiﬁer and the top 10 diﬀerentially expressed genes (if your ﬁnal model only uses 10 genes, comment on the concordance in ranking of the 10 genes).

Hint: in a random forest model fit, the feature importance can be obtained by fit$importance, where a higher value indicates higher importance in the classiﬁer.

Question 2 Part (b) appendix

set.seed(3888)

X = t(exprs(gse))

y = ifelse(grepl("non-AR", pData(gse)$characteristics_ch1), "Stable", "Rejection")

cvK = 5

n_sim = 50

cv_accuracy_gse1b = numeric(n_sim)

### OPTION A ###

for (i in 1:n_sim) {

cvSets = cvFolds(nrow(X), cvK)

cv_accuracy_folds = numeric(cvK)

### OPTION B ###

for (j in 1:cvK) {

test_id = cvSets$subsets[cvSets$which == j]

X_train = X[-test_id,]

X_test = X[test_id,]

y_train = y[-test_id]

y_test = y[test_id]

### OPTION C ###

rf_fit = randomForest(x = X_train, y = as.factor(y_train))

predictions = predict(rf_fit, X_test)

cv_accuracy_folds[j] = mean(y_test == predictions)

}

cv_accuracy_gse1b[i] = mean(cv_accuracy_folds)

}

Question 3: Brain

Your friend Shila is a physicist who needs your help in building a classiﬁer to detect left and right eye movements from brain EEG signals in real time. She has a data set stored under zoe_spiker.zip that contains brain signal series (each series is a ﬁle) which corresponds to sequences of eye movements of varying lengths.

The ﬁle name corresponds to the true eye movement. For example the ﬁle LRL_z.wav corresponds to left-right-left eye movements; the ﬁle LLRLRLRL_z.wav corresponds to left-left-right-left-right-left-right-left eye movements. There are a total of 31 ﬁles.

The folder also contains two RDS ﬁles which may be used to train an event detection classiﬁer (training_data.rds, training_labels.rds)

Part (a)

The ﬁrst stage of our classiﬁer is to identify events (eye movement). Shila has provided some training data (training_data.rds) which corresponds to waves, and labels (training_labels.rds) where TRUE represents the presence of an event and FALSE represents no event. Use the tsfeatures package to calculate some autocorrelation features and build a random forest classiﬁer to detect events.

Report and comment on the accuracy of this model.

Hint: use tsfeatures(training_data, c("acf_features")) to compute the autocorrelation features from training_data. In a random forest model fit, the confusion matrix of out-of-bag predictions can be obtained by fit$confusion. In a random forest classiﬁer, the out-of-bag predictions can be treated as the predictions on a independent data set.

Part (b)

Build a classiﬁcation rule for detecting {L,R} under a streaming condition, using the trained Random Forest model from part (a) in a window to identify events, and using the min-max rule to classify events into “Left” or “Right” (Lab 3 Exercise 2.3). Demonstrate your classiﬁer on a length 3, 8 and long wave ﬁle (note that the result should be reasonable, but doesn’t have to be good). You may use the code template on the following page.

Part (c)

Shila thinks multiple window sizes must be evaluated to ﬁnd the best Random Forest streaming classiﬁer.

Compare the performance of the Random Forest streaming classiﬁer for detecting {L,R} under a streaming condition, using multiple window sizes. Use the short wave ﬁles to evaluate performance. Which window size gives the best performance? Justify your answer with appropriate visualisations.

Hint: you may use the Levenshtein similarity metric to evaluate the accuracy of your predictions. This can be

computed via stringdist::stringsim, with method set to "lv".

The increment of your window should always be 1/3 of the window size.

increment = window_size/3

Part (d)

Shila’s friend Jean thinks a zero-crossing classiﬁcation rule will perform. just as well to the Random Forest classiﬁer.

Build a classiﬁcation rule for detecting {L,R} under a streaming condition, using the number of zero- crossings in a window to identify events (from Lab 3 Exercise 1.3), and using the min-max rule to classify events into “Left” or “Right” (Lab 3 Exercise 2.3). You may use any window size that gives reasonable performance.

Jean also thinks multiple thresholds must be evaluated to ﬁnd the best zero-crossings classiﬁcation rule.

Compare the performance of the zero-crossings classiﬁcation rule using multiple thresholds on the short wave ﬁles. Which threshold gives the best performance? Justify your answer with appropriate visualisations.

Part (e)

For both the best models that you found in part (c) and part (d), evaluate its performance on sequences of varying lengths. Does the length of the sequence have an impact on the classiﬁcation accuracy? Which classiﬁer performs the best on this data set, and why might you choose one over the other? Justify your answer with appropriate visualisations.

Question 3 Part (b) appendix

ts_features_classifier = function(wave_file,

window_multiplier = 1) {

window_size = [email protected]*window_multiplier

increment = window_size/3

Y = wave_file@left

xtime = seq_len(length(Y))/[email protected]

predicted_labels = c()

window_lb = 1

max_time = length(Y)

while(max_time > window_lb + window_size) {

window_ub = window_lb + window_size

window = Y[window_lb:window_ub]

event = <event detection>

if (event) {

predicted = <LR prediction>

predicted_labels = c(predicted_labels, predicted)

window_lb = window_lb + window_size

} else {

window_lb = window_lb + increment

}

return(paste(predicted_labels, collapse = ""))

}