代写COMP4139 Machine Learning Assignment 2代做留学生Python程序
- 首页 >> DatabaseCOMP4139 Machine Learning
Assignment 2
Machine Learning for Breast Cancer Treatment
Response Prediction
1. Introduction
This assignment assesses your practical skills in applying machine learning methods to a real-world problem. The implementation will be based on Python and third-party Machine Learning libraries. Same as assignment 1, you must work in the same group and submit your work by 12th December 2025 at 3 pm UK time on Moodle by member 1 of each group. You can split and distribute the work to individual members, but each individual is expected to understand every aspect of the work.
2. Background
Breast cancer is the most common cancer in the UK for women. Chemotherapy is a commonly used treatment strategy to reduce the size of locally advanced tumours before surgery. However, chemotherapy is a toxic process to the human body and it is not always effective for everyone. Complete tumour resolution at surgery, known as pathological complete response (PCR), has a high likelihood of achieving a cure and longer relapse-free survival (RFS) time. RFS is the length of time after primary treatment for cancer ends that the patient survives without any signs or symptoms of that cancer. However, only 25% of patients receiving chemotherapy will achieve a PCR, with the remaining 75% having residual disease and a range of prognosis. Better patient stratification and treatment could be achieved if PCR and RFS could be predicted using information prior to chemotherapy treatment.
3. Aim
You are asked to use advanced machine learning methods to predict PCR (classification) and RFS (regression) using both clinically measured features and features derived from magnetic resonance images (MRI) prior to chemotherapy treatment.
4. Data
Based on the public dataset from The American College of Radiology Imaging Network (I-SPY 2 TRIAL), a simplified dataset is generated for this assignment.
Each patient in this dataset contains 11 clinical features (Age, ER, PgG, HER2, TrippleNegative Status, Chemotherapy Grade, Tumour Proliferation, Histology Type, Lymph node Status, Tumour Stage and Gene) and 107 MRI-based features. The image-based features were extracted from the tumour region of MRIs using a radiomics feature extraction package (known as Pyradiomics: https://pyradiomics.readthedocs.io/en/latest/ ). You do not need to understand the meaning of these clinical features and image-based features to complete this assignment but worth reading background information on the I-SPY 2 Trial website. “999” in the spreadsheet means a missing data value. A training dataset (trainDataset.xls) is provided and available on Moodle that contains 400 patients. A test dataset that contains N patients is reserved (hidden from you) for the final performance evaluation. You can assume that the test set and training set are sampled from the same data distribution, but the ratio of PCR positive and negative could be different.
5. Implementation Requirement
You are asked to build a machine-learning model for each of the PCR (classification) and RFS (regression) predictions. You need to consider and implement methods for data pre-processing (e.g. how to handle missing data, outlier, normalisation, etc, if needed), data imputation, feature selection, machine learning modelling, hyperparameter tuning (if applicable) and method evaluation. There is no restriction or requirement for the selection of methods. However, you will likely need to compare several methods to pick the best one with the best parameter setting. When you perform. feature selection, ER, HER2 and Gene are very important features that must be retained and used in the modelling process.
Your code will be finally tested on a reserved test set after your code is submitted. An example test file is provided (testDatasetExample.xls) that only contains 3 examples. It is your responsibility to ensure your code can run on a test file in a similar format but contains more patients. You must name your final test code “FinalTestPCR.py” or “FinalTestPCR.ipynb” for PCR prediction, and “FinalTestRFS.py” or “FinalTestRFS.ipynb” for RFS prediction so that they can be tested on the test dataset. The code for method development needs to be in a separate file, not in the “FinalTestXXX” file.
The test set will be released on 11th December 2025 at 9 am and you need to run your code to produce the predictions for the test set and submit on Moodle by 12th December 2025 at 3 pm together with other deliverables (section 7). One spreadsheet for PCR and one for RFS must be generated to store the prediction outcome. The output files must be a spreadsheet (.csv) that contains the predicted outcome for each tested patient (i.e. the first column is the patient ID, and the second column is either the predicted PCR or RFS outcome). Name the files: PCRPrediction.csv and RFSPrediction.csv. Balanced classification accuracy will be used to evaluate PCR prediction. Mean Absolute Error will be used to evaluate RFS estimation.
All implementations need to use Python programming language. Any machine learning libraries are allowed (e.g. Scikit-learn, Scipy, Pandas, Tensorflow, Pytorch, etc.). Grid search for automatic hyperparameter tuning is allowed. However, any autoML based package or Large Language Models
(e.g. ChaptGPT or other methods that accept the raw data and automatically select the best ML method and optimise the parameter for you) are NOT allowed.
6. Assessment
Assignment 2 weighs 80% of the coursework mark (i.e. 24% of the whole course mark). The marking will be performed based on the objective performance on the test set, the quality of code and the quality of technical writing. The marking criteria are provided in section 8. A single mark and feedback will be given to each group. The final mark for individual students will be calculated based on the contribution table described in section 7.
7. Deliverables
For the completion of Assignment 2, the following have to be submitted on Moodle. One report (.pdf) and one zipped code file need to be submitted per group.
1. The Python code for implementing the two tasks (PCR and RFS prediction). Besides the code for method development, two files “FinalTestPCR” and “FinalTestRFS” must be included for testing the test set. The two .csv files for PCR and RFS predictions of the test set should also be included in the code folder (note: the test set will be released on 11th December 9 am on Moodle).
2. A report in the format of an IEEE conference paper. Technical paper writing will be introduced in one of the lectures. A template of the required format will be provided in Word and Latex. Based on the given format, a maximum of 4 pages is allowed, excluding references (references can be on the 5th page).
3. At the end of the paper (excluded from the 4 pages), the following contribution table needs to be completed and agreed upon by all members, which will be used to calculate individual student’s final marks.
|
Task and Weighting |
Data pre- processin g (10%) |
Feature Selection (25%) |
ML method development (25%) |
Method Evaluation (10%) |
Report Writing (30%) |
|
Name of member 1 |
30% |
15% |
20% |
20% |
20% |
|
Name of member 2 |
0% |
25% |
30% |
0% |
20% |
|
Name of member 3 |
30% |
20% |
20% |
10% |
20% |
|
Name of member 4 |
0% |
10% |
30% |
30% |
20% |
|
Name of member 5 |
40% |
30% |
0% |
40% |
20% |
The percentage of contribution in the above table is an example, which will be different for each group depending on the true contribution of each member. However, the task names and their weighting highlighted in red in the table should NOT be changed, and the sum of the contributions from all members for each task (i.e. each column) should be 100%. Note that each student can contribute to multiple tasks and each task can involve multiple students.
Besides the report and code required, you also need to submit a recorded video presentation to present your work as a group. The content of the presentation should cover background, a literature review on existing solutions, proposed method, evaluation results and conclusions & discussion. The presentation should be less than 10 minutes and involve all group members (preparing the slides, presenting, or both). Save the video in .mp4 format and submit it on Moodle (file size should be less than 250MB).
8. Marking Criteria
|
Elements |
% mark |
|
Performance on test set (objective) |
25% |
|
Code quality (e.g. comments, easy to read, robustness, etc) |
10% |
|
Description of Method |
25% |
|
Explanation and presentation of the results obtained |
10% |
|
Discussion of the strengths and weaknesses of the chosen method |
10% |
|
Scientific writing and clarity |
10% |
|
Presentation |
10% |
Plagiarism check will apply, meaning that high similarities across different groups are not expected. Late submissions in each assignment will result in a 5% penalty per day (days rounded up to the next integer).
9. Common Q&As
- What is the performance of each task we are expecting to achieve?
It is a real-world dataset for a challenging clinical task, hence I don’t have an estimation of performance. However, a >90% classification accuracy is too good to be true for this task. For the RFS estimation is even more challenging. The performances are expected to vary across groups. You need to consider practical issues, including missing data in both training and testing sets, data imbalance issues, etc. You have the freedom to use any machine learning methods that are not restricted to the methods introduced in the lectures.
- Why don’t we use an anonymised peer-assessment form to score the contribution of each member?
Anonymised peer-assessment form. was used in previous years. Occasionally, members can not settle on an agreed distribution and it may involve several rounds of interviews to decide the final percentage of contribution. Hence, it is changed to a more transparent and quantitative contribution table.
You should split the tasks and agree on the percentage of contributions before starting the assignment, then add/reduce the percentage depending on the final delivery and quality of completion by each member. Therefore, no surprises when you see your individual mark. Remember that each group is a team rather than individual competitors. An ideal case for a group of 5 students is that each member contributes to ~20%, but I don’t expect it to happen for all groups. Please split the tasks depending on your group experience learned from Assignment 1. The highest mark a member can get is the group mark, which is based on the quality of the work. Hence marking down the contributions of other members won’t get the top performer a higher mark. So help each other rather than kill each other.
