Software讲解、辅导Java程序语言、Java设计讲解、辅导System 辅导留学生 Statistics统计、回归、迭代|解析Has
- 首页 >> 其他 Software Technology 2 (7170 & 9073)
Assignment 1
Automatic Language Identification System
Submission date: 23:59 Sunday 31/03/2019 (Week 7)
Type: Individual assignment
Total mark: 15
Proportion of unit assessment: 15%
Late submission: 5% of the total mark (i.e., 0.75 mark) per day.
Language: Java (use Google Java Style https://google.github.io/styleguide/javaguide.html)
Note: – 5 marks if Google Java Style is not applied.
Aims: The aim of this project is to apply class design, algorithms, and Java programming to building an automatic language identification system. The project also aims to provide background knowledge in Natural Language Processing, Pattern Recognition and Machine Learning that are parts of Artificial Intelligence (AI), one of the top technology trends in 2019.
Submission:
A Word document that contains your student ID, student name, Java class diagram designed for the system, and something you want your tutor know before marking your assignment.
A compressed file that contains all required files for the Language Identification system. Submit this compressed file via the Canvas site of this unit. Email submission is not accepted.
Automatic Language Identification System
Your task is to implement an automatic language identification system that can identify 5 written languages (English, French, German, Italian and Spanish). The system will input a text and output the language identified for this text. Assume that all words in the input text are written in the same language.
The system consists of two stages: Learning and Identification. Details of the system are as follows.
Stage 1: Learning languages from given text files using the n-gram technique (here n = 2, bigram). The following steps are required:
Access to a given folder named Learning and verify that it contains 5 text files which are English.txt, French.txt, German.txt, Italian.txt, and Spanish.txt. These text files are in UTF-8 format.
Do the following for each of the 5 text files:
oOpen the current text file, read its content, change all uppercase letters to lowercase ones, and remove non-alphabetic characters such as ~ ` ! @ # $ % ^ & * ( ) - _ + = < , > . ? / 1 2 3 4 5 6 7 8 9 0, whitespace, new line, and 1-letter word. The remaining is a sequence of words.
oConvert each word in this sequence into letter-level bigrams (bigram: a pair of consecutive written letters). For example, the word "language" is converted into the following bigrams "la an ng gu ua ag ge".
oFor each bigram found, calculate the probability of this bigram (probability = number of occurrences of a bigram / number of occurrences of all bigrams).
oSort all bigrams in alphabetical order then output them and their probability to a file. This file is called language model and the filename is of the form XYZModel.txt where XYZ is the input filename.
Save all model files to a folder named Models.
Below is an example of learning 3 languages which are English, German and French.
Stage 2: Identifying language for an input text
Read text from standard input or a file (this file must be in a folder named Testing).
Remove non-alphabetic characters from the input text as seen above. The remaining is a sequence of words.
Convert each word in this sequence into letter-level bigrams as seen above.
Open all model files in the Models folder, read all contents.
For each model file do the following: find the input bigrams in the current model, get their probability p1, p2, …, pM, where M is the number of bigrams, then calculate the matching score that is the product of all of these probabilities: matching score = p1*p2*…*pM. You will have 5 matching scores for 5 model files.
Find the maximum score from the 5 matching scores calculated from the previous step.
Return the language that has the maximum matching score as the identified language.
Below is an example of testing the word bleu in the file Unknown.txt with 3 language models.
------------------------------
Hints will be given in lectures and tutorials.
Plagiarism and Extension: Please review the Unit Outlines that is available on Canvas site of this unit.
Assignment 1
Automatic Language Identification System
Submission date: 23:59 Sunday 31/03/2019 (Week 7)
Type: Individual assignment
Total mark: 15
Proportion of unit assessment: 15%
Late submission: 5% of the total mark (i.e., 0.75 mark) per day.
Language: Java (use Google Java Style https://google.github.io/styleguide/javaguide.html)
Note: – 5 marks if Google Java Style is not applied.
Aims: The aim of this project is to apply class design, algorithms, and Java programming to building an automatic language identification system. The project also aims to provide background knowledge in Natural Language Processing, Pattern Recognition and Machine Learning that are parts of Artificial Intelligence (AI), one of the top technology trends in 2019.
Submission:
A Word document that contains your student ID, student name, Java class diagram designed for the system, and something you want your tutor know before marking your assignment.
A compressed file that contains all required files for the Language Identification system. Submit this compressed file via the Canvas site of this unit. Email submission is not accepted.
Automatic Language Identification System
Your task is to implement an automatic language identification system that can identify 5 written languages (English, French, German, Italian and Spanish). The system will input a text and output the language identified for this text. Assume that all words in the input text are written in the same language.
The system consists of two stages: Learning and Identification. Details of the system are as follows.
Stage 1: Learning languages from given text files using the n-gram technique (here n = 2, bigram). The following steps are required:
Access to a given folder named Learning and verify that it contains 5 text files which are English.txt, French.txt, German.txt, Italian.txt, and Spanish.txt. These text files are in UTF-8 format.
Do the following for each of the 5 text files:
oOpen the current text file, read its content, change all uppercase letters to lowercase ones, and remove non-alphabetic characters such as ~ ` ! @ # $ % ^ & * ( ) - _ + = < , > . ? / 1 2 3 4 5 6 7 8 9 0, whitespace, new line, and 1-letter word. The remaining is a sequence of words.
oConvert each word in this sequence into letter-level bigrams (bigram: a pair of consecutive written letters). For example, the word "language" is converted into the following bigrams "la an ng gu ua ag ge".
oFor each bigram found, calculate the probability of this bigram (probability = number of occurrences of a bigram / number of occurrences of all bigrams).
oSort all bigrams in alphabetical order then output them and their probability to a file. This file is called language model and the filename is of the form XYZModel.txt where XYZ is the input filename.
Save all model files to a folder named Models.
Below is an example of learning 3 languages which are English, German and French.
Stage 2: Identifying language for an input text
Read text from standard input or a file (this file must be in a folder named Testing).
Remove non-alphabetic characters from the input text as seen above. The remaining is a sequence of words.
Convert each word in this sequence into letter-level bigrams as seen above.
Open all model files in the Models folder, read all contents.
For each model file do the following: find the input bigrams in the current model, get their probability p1, p2, …, pM, where M is the number of bigrams, then calculate the matching score that is the product of all of these probabilities: matching score = p1*p2*…*pM. You will have 5 matching scores for 5 model files.
Find the maximum score from the 5 matching scores calculated from the previous step.
Return the language that has the maximum matching score as the identified language.
Below is an example of testing the word bleu in the file Unknown.txt with 3 language models.
------------------------------
Hints will be given in lectures and tutorials.
Plagiarism and Extension: Please review the Unit Outlines that is available on Canvas site of this unit.