辅导JAVA程序编程，辅导Java编程-LanguageGuesser.java

2018.04.18 - 首页 >> Java编程

CS209 - LAB2

You will in this lab (that will extend on several weeks) use the Tokenizer object

created in Lab1. You can use your own Tokenizer if it works well; if not, a

correct Tokenizer is available in Sakai.

In a nutshell, the ultimate goal is to write a program that is able to read a file

written in a European language and, by comparing the words found in this file

to words that are very common in a language, to "guess" the language in which

the file is written.

We shall of course proceed in several steps.

The first goal is to find the files containing frequent words for a number

of languages. Your program shall use a configuration file named

language.cnf, expected to be found in the current directory, that you

must load and read with a Properties object. This configuration file will

contain a single parameter called stop_words_dir, which will be

associated with the name of a directory that contains stop word files.

"Stop words" is the name given to these words that are extremely

frequent in a language (such as "the", "is" or "and" in English). In the

directory, we expect to find files named <language>.txt, for instance

english.txt, french.txt or spanish.txt, which contain respectively stop

words for English, for French and for Spanish (those files will be

provided, and they are all UTF-8 encoded). Note that you don't know in

advance how many files there are. Any file with an extension other

than .txt can be ignored. Those files only contain words that are very

common in a language.

Things to look for: your program should handle the case when the value

associated with stop_words_dir isn't an existing directory (it may have

been mistyped, or renamed). It should also handle the case when

no .txt files are found in the directory.

You may find the following link useful :

The second goal is to load all these words in memory. You have to think

about how best to store words and languages, knowing that you will

search the words to find a language. You must also keep in mind that

some related languages also share stop words - for instance, "la" is very

common in French, Spanish and Italian; "de" is also very common

(with different meanings) in many languages. One word will be

associated with several languages, and the most probable language will

be the one for which you find the greatest number of words.

Finally, the third goal will be to analyze a (UTF-8 text) file the name of

which will be provided on the command line (for instance

java LanguageGuesser filename.txt

if your program is called LanguageGuesser.java).

Your program will compare the words in this file to the stop words, and

simply display the name (without path or extension) of the stop word

file for which you have found the greatest number of matches, for

instance:

java LanguageGuesser filename.txt

swedish

if most of the recognized words were found in swedish.txt.