讲解COMP723留学生、R/Weka/Python辅导、解析R/Python语言、R/Python设计报告讲解
- 首页 >> Algorithm 算法COMP723 Data Mining and Knowledge Engineering
Assignment 2 – Text Classification (50%)
1Objective
To develop a broad understanding of text mining by performing a representative task, Text Classification.
2Collaborative Learning Requirement
As part of the assignment you will be required to demonstrate SUSTAINED collaborative approach into completing the assignment in pairs. This requires you to find a partner to work with in the 7th week of the semester and register into one of the groups created for collaboration on Blackboard. After this you will have to use the group’s discussion board, email and other tools to communicate, discuss, strategize, distribute work and share documents in order to produce the final assignment. This forum will be used to evaluate you on your contribution and research work carried out for the assignment. Note that your activity is time stamped, hence this will be used as evidence of sustained, collaborative learning.
The first page of your report should include a half page summary of the activities of each partner through the second half of the semester, up to the submission of the assignment as illustrated on the Blackboard Discussion Board. This should have the title “Contributions – group name”.
3Task Specification
This assignment requires you to extend your data mining skills and knowledge from the structured context to unstructured context, where the items to be classified are “free text” snippets. You are required to use a chosen text mining tool to train two different classification algorithms on the given dataset, analyse the results, and present a report of your findings.
3.1Due Dates and Submission
This assignment is to be done in pairs. The report should clearly state the name and student ID of both members of the team. Furthermore, the contributions made by each team member must be clearly stated in the section “Contributions – group name” at the beginning of the report.
The written part of your assignment is due on 26 October at midnight.
You are required to submit only an electronic copy of the assignment via the Turnitin assignment Submission tab (on the course homepage) on Blackboard. Only one member from the pair needs to submit the assignment.
3.2Marking
This assignment will be marked out of 100 marks and is worth 50% of the overall mark for the paper.
To pass this module you must pass each assessment separately, and gain at least 50% in total. The minimum pass mark for this assignment is 40%.
4Assignment Details
The objective of the assignment is to classify text into two categories. The tasks described are generic to text classification so that you are able to use R, Weka or Python to get the results. The class labs will all be done using Python, so if you want to use either R or Weka, you will have to learn the tools on your own.
4.1Dataset
The data set to be used for this assignment is available from Blackboard as part of the assignment package. The data set is a large corpus of emails organised into 5 folders named enron1, enron2, enron3, enron4 and enron5. Each of these folders contains two folders named “ham” and “spam” containing emails belonging to each of the two categories. The package also contains two papers which gives you a background on the dataset and examples of use for text classification using Na?ve Bayes and Support Vector Machine. You will need to acknowledge the use of this dataset appropriately in your report.
4.2Assignment Tasks
1.Find a partner to work with and enrol in a group on Blackboard. This should be done in week 7 of the semester.
2.Download the zipped file containing the dataset from Blackboard under the Assignment 2 folder. Unzip it into a working folder which you will use for this assignment. The zipped file contains a total of 5 folders as described above. The files represent 5 sets of data consisting of emails classified into ham and spam.
3.Choose whether you want to use Python, R or Weka for this project.
4.The objective of this assignment is determine whether your chosen classifier is able to perform better than a bench mark model based on Na?ve Bayes. To do this you can use any combination of the pre-processing tasks in order to build features to be used for the two machine learning algorithms. They don’t need to be consistent for the two algorithms.
5.In order to produce valid conclusion for the research question in (4) above, you should should do test by slicing the data for your experiments in the following 2 ways:
6.
I.Conflate the data from the 5 folders and make them into one dataset. Then split the conflated dataset into 70% training set and 30% test set while maintaining the ham:spam ratio. Use these for training and testing for both the algorithms.
II.Use environ1, environ3 and environ5 for training and environ2 and environ4 for testing. Use these for training and testing for both the algorithms.
III.To prove or disprove the hypothesis in (4) above you should also experiment with various forms and combinations of features as covered in lectures and your own online research. Your strategies and decisions should be backed by systematic testing, hence a rationale, and this should be discussed within the group using the Group Discussion Board tools so that it can be evaluated.
IV. You should report all performances in terms of Precision, Recall and F-values.
4.2.1Written Report
You will write a minimum of 6 and a maximum of 12 page report (excluding the references and appendix) describing the results of your experiment.
You are required to write a coherent report describing all aspects of the experiment as an attempt to prove or disprove the hypothesis. Any screen shots or large result outputs that doesn’t directly contribute to your argument should be included in the appendix, rather than as part of the main report.
You are also required to submit well documented code as part of the appendix.
You are not required to have a table of contents or executive summary for this report.
There is no fixed format for the report. You can format it close to an academic paper containing the usual sections such as Abstract, Introduction, Data Description, Results, Discussion, Conclusion and a bibliography.
As a minimum your report should contain a discussion of the following points
1.A brief introductory discussion of applications of text classification.
2.A description of the dataset and its characteristics.
3.A discussion of the similarity and the differences between the two classifiers that you are comparing as applicable to text classification.
4.The differences in the manner in which classifiers are applied in a structured data scenario (such as what you did for the first assignment) and a non-structured text mining scenario.
5.Presentation and discussion of the results obtained. You should use the correct evaluation metrics in your discussion. This part of your write up should include:
The effect of the variations of the dataset used.
Your perception of the possible rationale for doing the tasks.
A thorough discussion of the comparison of the results leading to the conclusion that answers the hypothesis.
6.A reflection of what you learnt from this assignment and what you would do differently if you were to do the assignment again.
5Marking Scheme
The following approximate matrix would be used to grade your assignment.
Written Report
Formatting, Language and Presentation10%
Discussion to demonstrate an understanding of the experimental tasks in the context of text mining25%
Satisfactory completion of the tasks for the hypothesis.25%
Discussion and presentation of the results leading to the conclusion.30%
Use of collaboration to accomplish the task10%
**********************End of Assignment Specification**********************