XML 程序辅导辅导XML 、辅导Preprocessing text data
- 首页 >> 其他Assessment 3: Preprocessing text data
Please carefully review all the requirements below to ensure you have a good
understanding of what is required for your assessment.
1. Due Date
2. Instructions & Brief
3. Assessment Resources
4. Assessment Criteria
1. Grading Rubric
2. Penalties
5. How to Submit
1. Due Date
This specific due date and time can be viewed below in the Grading
Summary.
2. Assessment Description
Text documents, such as long recordings and meeting transcripts, are usually
comprised of topically coherent text segments, each of which contains some
number of text passages. Within each topically coherent segment, one would
expect that the word usage demonstrates more consistent lexical
distributions than that across segments. A linear partition of texts into topic
segments can be used for text analysis tasks, such as passage retrieval in IR,
NAVIGATION
Dashboard
Site home
Site pages
Current unit
FIT5196-S1-
2018
Participant
s
FIT5196 Data wrangling S1 2018
Home BTBL @ Monash Library Report a Moodle Fault SETU - Unit Evaluation
All Unit Guides Your Responsibilities Need Help?
Assignment 18518 14:21
https:moodle.vle.monash.edumodassignview.php"id=4764105 ᒫ 2 ᶭҁو 7 ᶭ҂
document summarization, and discourse analysis. In this assessment, you
are required to write Python code to preprocess a set of meeting transcripts
and convert them into numerical representations suitable for input into topic
segmentation algorithms.
This is an individual assignment and worth 30% of your total mark for
FIT5196.
The detailed tasks are as follows:
Task 1: Reconstruct meeting transcripts with topical boundaries. The
original meeting transcripts are stored in three different types of XML
files, which are ending with ".words.xml", ".topic.xml" and
".segments.xml". (The details about the three types of files can be found
in Section 3 below). The task here is to reconstruct the original meeting
transcripts with the corresponding topical and paragraph boundaries
from these files. Please note that
A meeting transcript must be generated for each of the "*.topic.xml"
file. For example, "ES2002a.txt" will be generated for
"ES2002a.topic.xml".
All the generated meeting transcripts with the ".txt" file extension
must be saved in the folder "txt_files".
The topical boundaries must be denoted with "**********"(i.e., 10
asterisks).
All the tokens, including punctuations, must be separated by a white
space. For example, "Alright , okay . Okay ."
Besides the topical boundaries, the paragraph boundaries must also
be reconstructed with the "*.segments.xml" file.
The input files to your notebook "task_1.ipynb" must be the three
types of XML files. The output must be the meeting transcripts saved
in a set of txt files.
A sample meeting transcript is provided in the "txt_file" folder.
Task 2: Generate sparse representations for the meeting transcripts.
The aim of this task is to build sparse representations for the meeting
transcripts generated in task 1, which includes word tokenization,
vocabulary generation, and the generation of sparse representations.
Please note that
The word tokenization must use the following regular expression,
"\w+(?:[-']\w+)?", and all the words must be converted into the lower
case.
The stop words list (i.e, stopwords_en.txt) provided in the zip file
must be used.
The words, whose document frequencies are greater than 132, must
be removed.
Generating multi-word phrases (i.e., collocations) are not needed.
The output of this task must contain the following files:
https:moodle.vle.monash.edumodassignview.php"id=4764105 ᒫ 3 ᶭҁو 7 ᶭ҂
vocab.txt: It contains the unigram vocabulary in the following
format, word_string:integer_index. Words in the vocabulary
must be sorted in alphabetic order. For example, "absolute:22" in
the following figure means that the 23rd word in the vocabulary is
"absolute".
topic_seg.txt: It contains the topic boundaries encoded in
boolean vectors. For example, if a meeting transcript,
"ES2018d.txt" contains 10 paragraphs in total after being
preprocessed, and there are topic boundaries after the 2nd,
5th, and 7th paragraphs, the boolean vector must be
"ES2018d:0,1,0,0,1,0,1,0,0,1". Every line in topic_seg.txt
corresponds to one meeting transcript.
./sparse_files/*.txt : Each txt file in the "sparse_files" folder
corresponds to one of the meeting transcripts in the "txt_files"
folder, and they have the same file name. For example,
"./sparse_files/ES2002a.txt" corresponds to
"./txt_files/ES2002a.txt". Each file in "/sparse_files" contains the
sparse representations for all its paragraphs as
22 Apr)
Assignment 18518 14:21
https:moodle.vle.monash.edumodassignview.php"id=4764105 ᒫ 4 ᶭҁو 7 ᶭ҂
where 1) each line is a paragraph and the order of the lines must
match the paragraph order in the corresponding meeting
transcript. 2) the integer before ":" is the word index in the
vocabulary and the one after is the frequency of the word in the
corresponding paragraph; 3) empty paragraphs after
preprocessing must be excluded.
3. Assessment Resources
Before you start writing your code, you will need to download the file
meeting_transcripts
Unzipping the file, you will find that
There are three types of XML files in the given folder :
1. ./topics/*.topic.xml contains the information about topic segments.
Each topic tag directly linked to the root indicates one topic
segment that is required in text segmentation task. Each topic
segment can contain a number of paragraphs given by different
meeting attendees. It can also contain sub-topics.
2. ./words/*.words.xml contains the word tokens generated with the
force alignment technique. Each word is associated with its start
time and end time in the meeting transcript.
3. ./segments/*.segments.xml contains the paragraph boundaries,
the start and end of which are denoted by the corresponding word
IDs.
./spase_files: the file folder used to store the generated sparse
https:moodle.vle.monash.edumodassignview.php"id=4764105 ᒫ 5 ᶭҁو 7 ᶭ҂
representations for all the meeting transcripts.
./txt_files: the file folder used to save the reconstructed meeting
transcripts.
./stopwords_en.txt: the stopword list used in word tokenization.
./topic_segs.txt: the file used to save the topical boundaries.
./vocab.txt: the file used to save the vocabulary.
./task_1.ipynb: the python code you are going to write for task 1
./task_2.ipynb: the python code you are going to write for task 2
4. Assessment Criteria
The following outlines the criteria which you will be assessed against.
4.1 Mark allocation and general marking criteria
1. The submitted scripts in the notebook should work without any errors
and must give the correct results. If the submitted notebook cannot be
run by the assessor, which will be double-checked by the head tutor
and the lecturer, zero marks will then be given to the corresponding
task.
task 1: 14 out of 30
task 2: 14 out of 30
task 2 will be assessed if and only if task 1 is successfully finished
and receives a full mark (i.e., 14).
2. The code should be well structured and properly commented. (1 out of
30)
3. The notebook should be structured in a logical way so that it clearly
shows how students finish the tasks in the assessment. (1 out of 30)
4. Criteria 2 and 3 will be assessed if and only if the mark for criteria 1 is
greater than and equal to 25.
4.2 Penalties
Late submission: for all assessment items handed in after the official due
date, and without an approved extension, a 10% penalty applies to the
student's mark for each day after the due date (including weekends, and
public holidays) for up to 5 days. Assessment items handed in after 5
days will not be considered!
Submission: please do follow Section 5 How to Submit to submit your
assignment. Otherwise, a 5% penalty will be applied.
Assignment 18518 14:21
https:moodle.vle.monash.edumodassignview.php"id=4764105 ᒫ 6 ᶭҁو 7 ᶭ҂
Add submission
Make changes to your submission
5. How to Submit
Once you have completed your work, take the following steps to submit your
work.
1. Only one zip file needs to be submitted: Once you finished the
tasks, please zip the folder only contains the files specified in
Section 3, including the original XML files. The zip file should be
named as 5196_assessment_3_Surname_StudentID.zip
2. Click the Add Submission button below to submit and upload
your assignment. Please do remember to accept the
submission statement! Only the submitted assignments will
be marked. Those shown as a draft won't be marked!
If you need further guidance on how to submit an assessment item, please
review the Submitting an assessment overview. If you need further
assistance, please go to Help & Support.
Submission status
Attempt number This is attempt 1.
Submission status No attempt
Grading status Not graded
Due date Sunday, 27 May 2018, 11:55 PM
Time remaining 9 days 9 hours
Last modified -
Submission
comments
Comments (0)
Assignment 18518 14:21