辅导JAVA程序编程,辅导Java编程-LanguageGuesser.java
- 首页 >> Java编程CS209 - LAB2
You will in this lab (that will extend on several weeks) use the Tokenizer object
created in Lab1. You can use your own Tokenizer if it works well; if not, a
correct Tokenizer is available in Sakai.
In a nutshell, the ultimate goal is to write a program that is able to read a file
written in a European language and, by comparing the words found in this file
to words that are very common in a language, to "guess" the language in which
the file is written.
We shall of course proceed in several steps.
The first goal is to find the files containing frequent words for a number
of languages. Your program shall use a configuration file named
language.cnf, expected to be found in the current directory, that you
must load and read with a Properties object. This configuration file will
contain a single parameter called stop_words_dir, which will be
associated with the name of a directory that contains stop word files.
"Stop words" is the name given to these words that are extremely
frequent in a language (such as "the", "is" or "and" in English). In the
directory, we expect to find files named <language>.txt, for instance
english.txt, french.txt or spanish.txt, which contain respectively stop
words for English, for French and for Spanish (those files will be
provided, and they are all UTF-8 encoded). Note that you don't know in
advance how many files there are. Any file with an extension other
than .txt can be ignored. Those files only contain words that are very
common in a language.
Things to look for: your program should handle the case when the value
associated with stop_words_dir isn't an existing directory (it may have
been mistyped, or renamed). It should also handle the case when
no .txt files are found in the directory.
You may find the following link useful :
The second goal is to load all these words in memory. You have to think
about how best to store words and languages, knowing that you will
search the words to find a language. You must also keep in mind that
some related languages also share stop words - for instance, "la" is very
common in French, Spanish and Italian; "de" is also very common
(with different meanings) in many languages. One word will be
associated with several languages, and the most probable language will
be the one for which you find the greatest number of words.
Finally, the third goal will be to analyze a (UTF-8 text) file the name of
which will be provided on the command line (for instance
java LanguageGuesser filename.txt
if your program is called LanguageGuesser.java).
Your program will compare the words in this file to the stop words, and
simply display the name (without path or extension) of the stop word
file for which you have found the greatest number of matches, for
instance:
java LanguageGuesser filename.txt
swedish
if most of the recognized words were found in swedish.txt.