代做COMP5310 Project Stage 1 Explore, clean, summarise and analyse the data代做Python编程
- 首页 >> C/C++编程COMP5310 Project Stage 1
Explore, clean, summarise and analyse the data
Due: 11:59PM on 5th of September 2024 (Week 6)
This assignment is worth 15% of the final mark of the unit of study.
DATASETs
For this assignment, each member of a group needs to work on a different dataset. We have provided three groups of datasets: Group A, Group B, and Group C.
Group A:
• Laptop Prices
• Player Scores
Group B:
• Credit Score
• Malware Detection
Group C:
• Card Fraud Detection
• Rental Prices
If your group has 2 members, one member must select a dataset from Group A, while the other member must select a dataset from Group B. Alternatively, if your group has 3 members, one member must select a dataset from Group A, another member must select a dataset from Group B, and the third member must select a dataset from Group C.
In Stage 1, each member will work on a different dataset, and then by the end of Stage 1, your group needs to agree on a single dataset to use for Stage 2 of the project.
GROUPS
This assignment is done in groups of 2 or 3. All students in a group must be attending the same lab session.
Note: there is work required from each member separately, but the project is handed in as a combined effort, and it is marked as a whole: there will be individual and group components to the marks, all based on the single submitted document.
Group formation procedure
In Week 3 lab session there will bean opportunity to meet other students and form a group with help from the tutor. Students must be in project groups with others who are all timetabled in the same lab session.
Once you have formed a group and chosen the datasets you will work on for this assignment, you need to submit your UNIKEYS and dataset choices by completing the following form. https://forms.office.com/r/GgcGsuDpvz.
Important Notes:
• You can only submit the form ONCE. Ensure your Unikeys and datasetselections are correct before submission.
• You will NOT be allowed to change the chosen datasets after submission, so choose wisely.
In Week 3 lab:
• Exchange names and contact information (e.g., which social media platforms you prefer for coordinating).
• Arrange when to get together: at least one meeting per week (in addition to your scheduled lab session) is vital, but more frequent coordination is even better.
Dispute resolution
If during the course of the assignment work there is a dispute among group members that you can’t resolve or that will impact your group’s capacity to complete the task well, you need to inform the unit coordinatormaryam.khaniannajafabadi@sydney.edu.auor one of the TAs:
daniela.rivasromero@sydney.edu.au, ewon6930@uni.sydney.edu.au, or [email protected]. Make sure that your email specifies the lab session and group name and is explicit about the difficulty; also make sure this email is copied to all group members (including anyone you are complaining about) and your lab tutor.
We need to know about problems in time to help fix them, so set early deadlines for group members, and deal with non-performance promptly (don’t wait till a few days before the work is due to complain that someone is not delivering on their tasks). If necessary, the coordinator will split a group and leave anyone who didn’t participate effectively in a group by themselves (they will need to achieve all the outcomes on their own). This option is only available up until Friday Week 5, which is the last day with time to resolve the issue before the due date. For any group issues that arise after this time, you will need to try to resolve the problem on your own, and you will continue to be treated as a single group which all get the same mark for this stage, based on whatever is submitted (though you should still let the coordinator, TAs and lab tutor know about them). If this is the case, groups may be changed after Stage 1 is finished.
PROJECT
Overview
The objective of Stage 1 of the project is to acquire and meticulously clean the dataset, followed by a comprehensive analysis to derive meaningful insights. Additionally, you will define a research question based on a research or business requirement, which you aim to answer in Stage 2.
DELIVERABLES
Report
The report must have a maximum of 3 pages for each individual section and a maximum of 2 pages for the group section for a group of 2, and 3 pages for a group of 3. You must use the high-level headings provided below to indicate the different sections and sub-sections of the report. You must use line spacing of at least 1.15pt, margins of at least 1.8cm, and body font size of at least 10pt. The goal is to convey the problem clearly and concisely.
The report must have a front page that gives: the group number and activity code, and the list of members involved (giving their SIDs AND unikeys, NOT their names).
The body of the report must have a structure as follows:
Individual Component
The report must begin with a section per group member (state the member’s unikey to identify each individual section). Each individual section must include:
1. Topic and research question: Provide a comprehensive and insightful description of the problem, highlighting the business/research need, clearly state your research question, and indicate some groups of stakeholders and how they could be helped by answering the research question.
2. Data description: Provide a description of the data, indicating the number of attributes and instances, and state the relevant metadata about this dataset, including a data dictionary which indicates the attributes on your dataset, a description of each attribute, and the data type of each attribute (int, float, string, date, etc.).
Note: The data dictionary can be included as an appendix and will not be counted towards the page limit.
3. Data ingestion and cleaning: Describe the data ingestion and data quality assurance and cleaning process, including:
3.1. Data ingestion: Describe any data ingestion steps, indicating if you used a Pandas data frame or a database in PostgreSQL, and briefly describe the data structure or schema.
3.2. Data quality assurance and cleaning: Describe how you ensured data quality, if there were any quality problems, describe what they were and how you cleaned the data. Remember to justify your decisions, for example, if you decide to remove any rows with missing data, explain why you decided to do this and how your decision might impact data quality. Indicate which tools you used to ingest and clean the data, for example, indicate which Python functions you used to clean your data.
Note: You don’t have to include the code on the report, as you will submit it separately.
4. Exploratory data analysis (EDA): Describe in detail any exploratory data analysis you performed which provided you relevant information to answer your research question. This analysis must include TWO supporting figures and a detailed discussion of the results obtained, indicating what they tell you about your data and how these results could impact the modelling results in the next stage of the project. Do not include a matrix of figures of multiple analysis of all attributes, you need to select and highlight the TWO most important results from your analysis.
Group Component
Finally, you will need to include a group section at the end of the report, including:
1. Discussion: Discuss your thoughts on the strengths and limitations of each dataset, for the purpose of investigating the question of interest. Discuss and critically analyse the exploratory data analysis performed in each individual sections, highlighting the strengths and limitations of each approach.
2. Conclusion: Summarise the most important outcomes from the exploratory data analysis performed by all members of the group, and include a recommendation, with reasons, on which dataset to use for the next stage of the project.
Note: The different report sections and sub-sections are aligned with the marking rubric. Therefore, please includeonly the requested contents and do not mix or merge the sections, as this will interfere with the marking process. For example, don’t include the data cleaning steps under the exploratory data analysis section, this must be included in the data cleaning section. If you fail to do so, this won’t be considered for the marking.
Code and Dataset
You must also submit the Python code used in this assignment as a single zip or tar.gz folder. This compressed folder must contain one subfolder for each member of the group, named using their unikey. These subfolders must include the following:
1. The Jupyter notebook with the code each member used to perform. their work, named “unikey_A1_Code.ipynb” .
2. The final clean dataset in CSV format, named “ unikey_A1_CleanDataset.csv” .
MARKING
Marking Criteria Marks
Individual Component
Topic and research question 1
Data description 1
Data quality and cleaning 3
Exploratory data analysis 3
Supporting figures 1
Code quality 1
Group Component
Discussion 3
Conclusion 1
Report format and presentation 1
TOTAL 15
Deductions
• 1 mark will be deducted if your section of the report exceeds the maximum number of pages. If the group section exceeds the maximum number of pages, the deduction will apply to all group members.
• 5% of the maximum awardable mark will be deducted per day of late submission. Zero marks will be awarded after 10 calendar days from the due date.