代写program、java设计程序代做
- 首页 >> OS编程 The University of Birmingham
School of Computer Science
Assignment 1 – News Classifier
Deadline: 12:00pm, Nov 10, 2023
Archie Powell
Version 2.4
An Assignment submitted for the UoB:
Object Oriented Programming
October 27, 2023
Revision History
Revision Date Author(s) Description
1.0 24/10/2023 JW Draft version for review.
2.0 26/10/2023 JW (a) Changed a few typos.
(b) Modified the description of Task 4.3.
2.1 27/10/2023 JW (a) Modified the description of Task 1.2 and
made it more clear about converting all characters in the content to lowercase.
(b) Added a version history table.
2.2 29/10/2023 JW (a) Updated Task 4.3 and made it more clear
about using base e logarithm.
(b) Updated Task 2.2. Highlighted that you
only need to run the lemmatization process
once.
2.3 31/10/2023 JW (a) There was an error in the description of Task 4.5. The correct value for
mySimilarity[0][0], mySimilarity[1][0] and
mySimilarity[2][0] should be mySimilarity[0][0]=0, mySimilarity[1][0]=1, mySimilarity[2][0]=2.
2.4 04/11/2023 JW (a) Updated the description of Task 2.3 - removeStopWords.
1
Contents
1 Introduction 5
1.1 Term Frequency-Inverse Document Frequency Embedding . . . . . . . 5
1.2 A Step-by-Step Guidance of TF-IDF Embedding . . . . . . . . . . . . 5
1.3 Measuring the semantic closeness . . . . . . . . . . . . . . . . . . . . 9
2 Task 1 - HtmlParser.java [10 Marks] 10
2.1 Task 1.1 - getNewsTitle(String[] _htmlCode) [5 Marks] . . . . . . . . 11
2.2 Task 1.2 - getNewsContent(String[] _htmlCode) [5 Marks] . . . . . . . 11
3 Task 2 - NLP.java [9 Marks] 12
3.1 Task 2.1 - textCleaning(String _content) [3 Marks] . . . . . . . . . . . 12
3.2 Task 2.2 - textLemmatization(String _content) [3 Marks] . . . . . . . 13
3.3 Task 2.3 - removeStopWords(String _content,String[] _stopWords)
[3 Marks] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Task 3 - Vector.java [24 Marks] 14
4.1 Task 3.1 - Vector(double[] _elements) [0.5 Marks] . . . . . . . . . . . 14
4.2 Task 3.2 - getElementatIndex(int _index) [2 Marks] . . . . . . . . . . 14
4.3 Task 3.3 - setElementatIndex(double _value, int _index) [2 Marks] . . 14
4.4 Task 3.4 - getAllElements() [0.5 Marks] . . . . . . . . . . . . . . . . . 14
4.5 Task 3.5 - getVectorSize() [0.5 Marks] . . . . . . . . . . . . . . . . . . 14
4.6 Task 3.6 - reSize(int _size) [6 Marks] . . . . . . . . . . . . . . . . . . 14
4.7 Task 3.7 - add(Vector _v) [2 Marks] . . . . . . . . . . . . . . . . . . . 15
4.8 Task 3.8 - subtraction(Vector _v) [2 Marks] . . . . . . . . . . . . . . . 15
4.9 Task 3.9 - dotProduct(Vector _v) [2 Marks] . . . . . . . . . . . . . . . 15
4.10 Task 3.10 - cosineSimilarity(Vector _v) [6.5 Marks] . . . . . . . . . . . 16
5 Task 4 - NewsClassifier.java [57 Marks] 16
5.1 Task 4.1 - loadData() [2 Marks] . . . . . . . . . . . . . . . . . . . . . 16
5.2 Task 4.2 - preProcessing() [5 Marks] . . . . . . . . . . . . . . . . . . 17
5.3 Task 4.3 - calculateT FIDF(String[] _cleanedContents) [10 Marks] . . 17
5.4 Task 4.4 - buildVocabulary(String[] _cleanedContents) [10 Marks] . . 17
5.5 Task 4.5 - newsSimilarity(int _newsIndex) [15 Marks] . . . . . . . . . 18
5.6 Task 4.6 - groupingResults(String _firstTitle, String _secondTitle) [15
Marks] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2
6 Expected Output 19
3
*Rules*
1. For each class refer to its corresponding test to verify field and method naming
conventions.
2. Although there are many ways to construct an application, you are required to
adhere to the rules as stipulated below (to achieve marks).
3. If variable names are not stipulated, you can use your own names for variables.
This shows that you have written the application (we will check for plagiarism).
4. Inclusion of extra imports is strictly prohibited and will lead to a substantial
penalty. You are restricted from using external libraries or any libraries that are
not pre-included in the provided skeleton code. (So, don’t use regular expressions
and the Arrays class in this assignment.
5. Do NOT change or modify files included in the "resources" folder.
6. Do NOT modify the skeleton code. However, you are allowed to create your own
methods if they are needed.
7. You MUST complete this assignment independently – Do NOT discuss or share
your code with others, and Do NOT use ChatGPT! Any cheating behaviour will
result in a zero score for this module and will be subject to punishment by the
University.
8. It is *STRONGLY ADVISED AGAINST* utilizing any translation software (such
as Google Translate) for the translation of this document.
9. The jUnit tests included in the skeleton code are basic and only scratch the surface
in evaluating your code. Passing these tests does not guarantee a full mark.
10. Wrong file structure leads to a substantial penalty. Make sure you have followed
the Submission Instructions on the Canvas page (the assignment page).
11. Creating your own .zip file without using the export function in IntelliJ may lead
to a wrong file structure.
HINT: You can use the TODO window in IntelliJ (View | Tool Windows | TODO) to
quickly jump between tasks.
4
1 Introduction
In this assignment, you are tasked with constructing a classifier that can categorise news
articles based on their content.
Specifically, you will be working with a dataset consisting of 20 news articles
sourced from the Sky News website (included in the "resources" folder). These articles can be broadly classified into two distinct categories, each representing a different
topic. For instance, the first category encompasses articles like the one titled "OsirisRex’s sample from asteroid Bennu will reveal secrets of our solar system". Conversely,
the second category includes articles such as the one headlined "Bitcoin slides to fivemonth low amid wider sell-off".
The main idea here is to assess the semantic closeness of these 20 news articles by
using the Term Frequency-Inverse Document Frequency (TF-IDF) embedding.
1.1 Term Frequency-Inverse Document Frequency Embedding
TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a popular
numerical statistic that reflects how important a word is to a document in a collection or
corpus. It’s a widely used technique in information retrieval and text mining to evaluate
the relevance of words within documents in a dataset.
TF-IDF Embedding is a technique where text documents are converted into vector
representations such that each document is represented as a vector in a multidimensional space. Each dimension in this space corresponds to a unique word in the corpus
vocabulary, and the value in each dimension is the TF-IDF weight of that word in the
respective document.
A major advantage of using high-dimensional vectors for document representation
is their compatibility with further numerical processing tasks, such as input for neural
networks (NN). In essence, TF-IDF Embedding acts as a vectorisation procedure. Unlike one-hot encoding—which uniquely numerically identifies each vocabulary word,
equating the maximum number to the vocabulary’s size—TF-IDF Embedding maintains words’ intrinsic relevance (or weight) throughout the transformation phase.
1.2 A Step-by-Step Guidance of TF-IDF Embedding
As suggested by its name, TF-IDF assigns a score or vectorises a word by calculating
the product of the word’s Term Frequency (TF) and the Inverse Document Frequency
(IDF).
Term Frequency: The TF represents the occurrence of a term or word in relation
to the document’s total word count, expressing how frequently a specific term appears
within it. TF is calculated by:
T F(t,d) = ft,d
∑t
′∈d
ft
′
,d
(1)
5
where ft,d is the number of times a word (t) appears in a document d, and ∑t
′∈d
ft
′
,d
is
the total number of words in that document.
Inverse Document Frequency: The IDF signifies the representation of a term
based on its occurrence across various documents in a corpus. It quantifies the rarity of a term by determining how frequently it appears, offering insight into the term’s
uniqueness or commonality within the corpus. It is calculated by:
IDF(t,D) = log N
|{d ∈ D : t ∈ d}| +1 (2)
where N is the total number of documents in the corpus, |{d ∈ D : t ∈ d}| is the number
of documents where the word t appears. In this assignment, log is a natural logarithm
(base e).
Then the final TF-IDF is calculated as:
T FIDF(t,d) = T F(t,d) ∗ IDF(t,D) (3)
To illustrate how TF-IDF Embedding works, consider we have three documents in
a corpus (N = 3).
Document Contents
D1 harry_potter is a student at hogwarts
D2 voldemort used to be a student at hogwarts but graduated already
D3 the parents of harry_potter studied at hogwarts as well
Table 2: Documents in the corpus
Let W be the vocabulary list which contains all the unique words in the corpus
(D1 − D3). So, W={harry_potter, is, a, student, at, hogwarts, voldemort, used, to,
be, but, graduated, already, the, parents, of, studied, as, well}. Then calculate the
associated TF-IDF values in each document (denoted as Vdocument). For example, the
TF-IDF values of W in D1 can be calculated as:
• T FIDF(harry_potter,D1) = 1
6 ×(log 3
2 +1) = 0.23424418468469405
• T FIDF(is,D1) = 1
6 ×(log 3
1 +1) = 0.3497687147780183
• T FIDF(a,D1) = 1
6 ×(log 3
2 +1) = 0.23424418468469405
• T FIDF(student,D1) = 1
6 ×(log 3
2 +1) = 0.23424418468469405
• T FIDF(at,D1) = 1
6 ×(log 3
3 +1) = 0.16666666666666666
• T FIDF(hogwarts,D1) = 1
6 ×(log 3
3 +1) = 0.16666666666666666
• T FIDF(voldemort,D1) = 0
6 ×(log 3
1 +1) = 0
• T FIDF(used,D1) = 0
6 ×(log 3
1 +1) = 0
6
• T FIDF(to,D1) = 0
6 ×(log 3
1 +1) = 0
• T FIDF(be,D1) = 0
6 ×(log 3
1 +1) = 0
• T FIDF(but,D1) = 0
6 ×(log 3
1 +1) = 0
• T FIDF(graduated,D1) = 0
6 ×(log 3
1 +1) = 0
• T FIDF(already,D1) = 0
6 ×(log 3
1 +1) = 0
• T FIDF(the,D1) = 0
6 ×(log 3
1 +1) = 0
• T FIDF(parents,D1) = 0
6 ×(log 3
1 +1) = 0
• T FIDF(o f,D1) = 0
6 ×(log 3
1 +1) = 0
• T FIDF(studied,D1) = 0
6 ×(log 3
1 +1) = 0
• T FIDF(as,D1) = 0
6 ×(log 3
1 +1) = 0
• T FIDF(well,D1) = 0
6 ×(log 3
1 +1) = 0
Hence, the vector representation of the first documentVD1 ={0.23424418468469405,
0.3497687147780183, 0.23424418468469405, 0.23424418468469405, 0.16666666666666666,
0.16666666666666666, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
Following the same process, VD2 can be calculated as:
• T FIDF(harry_potter,D2) = 0
11 ×(log 3
2 +1)
• T FIDF(is,D2) = 0
11 ×(log 3
1 +1)
• T FIDF(a,D2) = 1
11 ×(log 3
2 +1)
• T FIDF(student,D2) = 1
11 ×(log 3
2 +1)
• T FIDF(at,D2) = 1
11 ×(log 3
3 +1)
• T FIDF(hogwarts,D2) = 1
11 ×(log 3
3 +1)
• T FIDF(voldemort,D2) = 1
11 ×(log 3
1 +1)
• T FIDF(used,D2) = 1
11 ×(log 3
1 +1)
• T FIDF(to,D2) = 1
11 ×(log 3
1 +1)
• T FIDF(be,D2) = 1
11 ×(log 3
1 +1)
• T FIDF(but,D2) = 1
11 ×(log 3
1 +1)
• T FIDF(graduated,D2) = 1
11 ×(log 3
1 +1)
• T FIDF(already,D2) = 1
11 ×(log 3
1 +1)
7
• T FIDF(the,D2) = 0
11 ×(log 3
1 +1)
• T FIDF(parents,D2) = 0
11 ×(log 3
1 +1)
• T FIDF(o f,D2) = 0
11 ×(log 3
1 +1)
• T FIDF(studied,D2) = 0
11 ×(log 3
1 +1)
• T FIDF(as,D2) = 0
11 ×(log 3
1 +1)
• T FIDF(well,D2) = 0
11 ×(log 3
1 +1)
And VD3:
• T FIDF(harry_potter,D3) = 1
9 ×(log 3
2 +1)
• T FIDF(is,D3) = 0
9 ×(log 3
1 +1)
• T FIDF(a,D3) = 0
9 ×(log 3
2 +1)
• T FIDF(student,D3) = 0
9 ×(log 3
2 +1)
• T FIDF(at,D3) = 1
9 ×(log 3
3 +1)
• T FIDF(hogwarts,D3) = 1
9 ×(log 3
3 +1)
• T FIDF(voldemort,D3) = 0
9 ×(log 3
1 +1)
• T FIDF(used,D3) = 0
9 ×(log 3
1 +1)
• T FIDF(to,D3) = 0
9 ×(log 3
1 +1)
• T FIDF(be,D3) = 0
9 ×(log 3
1 +1)
• T FIDF(but,D3) = 0
9 ×(log 3
1 +1)
• T FIDF(graduated,D3) = 0
9 ×(log 3
1 +1)
• T FIDF(already,D3) = 0
9 ×(log 3
1 +1)
• T FIDF(the,D3) = 1
9 ×(log 3
1 +1)
• T FIDF(parents,D3) = 1
9 ×(log 3
1 +1)
• T FIDF(o f,D3) = 1
9 ×(log 3
1 +1)
• T FIDF(studied,D3) = 1
9 ×(log 3
1 +1)
• T FIDF(as,D3) = 1
9 ×(log 3
1 +1)
• T FIDF(well,D3) = 1
9 ×(log 3
1 +1)
VD1, VD2 and VD3 are the TF-IDF Embeddings of D1, D2 and D3.
8
1.3 Measuring the semantic closeness
Computational linguistics research holds that word (or document) meaning can be represented by its contextual information because similar contextual distributions tend to
share between semantically similar words [2].
Moreover, there is a clustering/grouping trend for words/documents with similar
meanings. Using the below figure (Fig. 1) [1] as an example, originally, it has been used
to explain how Word2Vec (which is another popular embedding method) understands
semantic relationships, like Paris and France are related the same way Beijing and China
are (capital and country), and not in the same way Lisbon and Japan are. Consequently,
those words that represent the concept of “Country” have been grouped together and
separated from those that represent the “Capital” concept.
Figure 1: Clustering trend of vector representation
Moreover, the "closeness" of two words/documents can be measured byCosineSimilarity
(CS) (also called cosine distance). CS is a measure of the cosine of the angle between
two non-zero vectors (θ in Fig. 2). It is calculated by using Equation 4 listed below:
9
Figure 2: Cosine Similarity/Distance
CS(
−→U ,
−→V ) =
−→U ·
−→V
||−→U || ||−→V ||
=
n
∑
i=1
ui × vi
s n
∑
i=1
u
2
i ×
s n
∑
i=1
v
2
i
(4)
where −→U is the vector representation of item 1 (in our case, VD1) and −→V is the vector
representation of the D2 (VD2). ui
is the i-th element in −→U , and vi
is the i-th element in
−→V .
To identify the most similar document to the first document (D1), we can calculate
its CS value with all the other documents, in this case CS(VD1,VD2) and CS(VD1,VD3).
Then the higher the CS value is, the closer a document is to D1.
Assuming that D1 and D2 belong to two different groups and the goal is to determine
which group D3 should belong. Then we can calculate CS(VD1,VD3) and CS(VD2,VD3).
IfCS(VD1,VD3) >CS(VD2,VD3) then D3 is more semantically close to D1, and therefore
should belong to the first group.
2 Task 1 - HtmlParser.java [10 Marks]
There is a string array type (String[]) variable called myHTMLS defined in the NewsClassifier class (NewsClassifier.java). This variable is allocated to hold the HTML
codes/strings of relevant news articles1
. For instance, the HTML string from the file
1The actual HTML string is quite complex, but most parts of the string can be ignored. Using 01.htm
as an example, you only need to process lines 28 and 82 to get the title and content.
10
01.htm is stored in myHTMLS[0], and the string from the file 02.htm is stored in myHTMLS[1]. The actual file reading work is done by a provided method called loadHTML() in the Toolkit class (Toolkit.java). Test this class with the Tester_HtmlParser.java
file.
The first task in this assignment is to build an HTML parser that can extract the
title and the content of the news articles from the HTML code stored in the myHTMLS
variable mentioned above.
2.1 Task 1.1 - getNewsTitle(String[] _htmlCode) [5 Marks]
The _htmlCode parameter contains the full HTML string of a specific news. E.g.
01.htm. Where the text that contains the title of the news is located between the tag2
. In the below example, the title should be "NASA launches Psyche
mission to rare metal asteroid", without including " | Science & Tech News | Sky News".
NASA launches Psyche mission to rare metal asteroid | Science &
Tech News | Sky News
• Use the String.substring() method to extract the title from the _htmlCode parameter.
• If the news article doesn’t have a title, return "Title not found!". Otherwise, return
the extracted title.
Hint: you may need to call the substring() method several times. The HTML string
will be passed into this method via the _htmlCode parameter, so you don’t need to call
the Toolkit.loadHTML() method again.
2.2 Task 1.2 - getNewsContent(String[] _htmlCode) [5 Marks]
Use the 01.htm file as an example. Analyse the string on line 82 and figure out a way
to extract the content of the "articleBody". In the below example, the content should be
"NASA has launched a mission to a rare asteroid covered in metal that is two billion .....
it at a distance of between 47 and 440 miles (75 and 700 km) until at least 2031."
{"@context": "http://schema.org","@type":
"NewsArticle","alternativeHeadline": "NASA launches Psyche mission
to rare metal asteroid","articleBody": "NASA has launched a mission
to a rare asteroid covered in metal that is two billion ..... it at
a distance of between 47 and 440 miles (75 and 700 km) until at
least 2031. ","mainEntityOfPage": {"@type": "WebPage","url":
"/story/nasa-launches-mission-to....12983967"},"wordCount":
"489","inLanguage": "en-GB","genre": "technology","publisher":
{"@type": "Organization","@id": "#Publisher","name": "Sky","logo":
2There might be more than one tags in the HTML string. It is the first appeared tag
that you need to use.
11
{"@type": "ImageObject","@id": "#Logo","url":
"https://news.sky.com/assets/sky-news-logo.svg?v=1","width":
"255","height": "60"}},"headline": "NASA launches Psyche mission to
rare metal asteroid","description": "","dateline": "London,
UK","copyrightHolder": {"@id":"#Publisher"},"author":
{"@id":"#Publisher"},"datePublished":
"2023-10-13T16:16:00Z","dateModified":
"2023-10-13T19:33:00Z","dateCreated":
"2023-10-13T16:18:00Z","image": {"@type": "ImageObject","url":
"https://e3.365dm.com/...._6320615.jpg","width": 2048,"height": 1152
} ,"url":"/story/nasa-launches-mission....12983967"}
• Use the String.substring() method to extract the content (articleBody) of the news
from the _htmlCode parameter.
• Converts all of the characters in the content to lower case.
• If the news article doesn’t have content, return "Content not found!". Otherwise,
return the extracted content.
Hint: technically, you only need to call the substring() method once. The HTML
string will be passed into this method via the _htmlCode parameter, so you don’t need
to call the Toolkit.loadHTML() method again.
3 Task 2 - NLP.java [9 Marks]
In the previous task, we extracted the title and content from the HTML string. Before
further processing the content, a few pre-processing tasks (included in the NLP class)
must be done. Testing this class with the Tester_NLP. java file.
3.1 Task 2.1 - textCleaning(String _content) [3 Marks]
The first pre-processing task is text cleaning. We need to a) convert the content to lowercase, and b) remove all the special characters from the extracted content. Essentially,
only ’a’-’z’, ’0’-’9’ and withouthite space (’ !’) are allowed to exist in the content.
Complete this method to:
• Firstly, convert the content to lowercase.
• Secondly, go through the content character by character and remove/delete all the
special characters from it.
• Once it’s been done, return the cleaned content.
Hint: use the .toCharArray() method to convert the _content to a new character
array, then use a loop to go through this array character by character.
12
3.2 Task 2.2 - textLemmatization(String _content) [3 Marks]
Text lemmatization is a vital preprocessing step in Natural Language Processing (NLP)
that aims to reduce words to their base or root form. Essentially, it involves the process of transforming a word into its simplest form, which allows various inflexions and
derivations of a word to be analyzed as a single item.
Lemmatization is a very complex task as it considers the context and grammatical
attributes of words, such as part-of-speech, to determine the base form. In this assignment, we will simplify this process and only check if a word ends with "ing", "eds", "ess"
and "ss" and "r"(please follow this order). If so, delete these character(s) from the end of the
word.
For example, "apples" should be "appl" (check "es" before "s"), "playing" should be
"play", "helped" should be "help" and "bananas" should be "banana".
As this is a simplified version, you only need to run this process once. For example,
"applesing" becomes "apples" after the lemmatization instead of "appl".
Please complete this method accordingly.
Hint: the text lemmatization task is after the text cleaning task, so tokenise the
content using the .split() method.
3.3 Task 2.3 -removeStopWords(String _content,String[] _stopWords)
[3 Marks]
The stop-word removal is a popular step in the preprocessing phase of NLP. Stop words
are common words such as ’and’, ’the’, ’is’, and ’in’, which are considered to be of
little value in text analysis because they occur frequently across various documents and
are generally not essential for understanding the text’s context or semantics.
Eliminating stop words from text data reduces the data’s dimensionality and allows
algorithms to focus on the more informative words and phrases. This results in a more
efficient analysis, as the algorithms are not bogged down by the high frequency of common words, allowing them better to discern the meaningful patterns and relationships
in the text.
There are two parameters in this method, _content is the original text and _stopWords is an array which contains all the stop-words.
Complete this method and remove all the stop-words from the original text. Once
it’s been done, return the modified text.
Moreover, removing stop-words does NOT mean simply deleting stop-words from
a sentence. Deleting stop-words will end up with extra spaces in a sentence. For example, "This is an apple" should become "This apple" (one white space) instead of
"This apple" (three white spaces).
Hint: this is after the text cleaning task, so do NOT tokenise the content using the .split()
method. The associated JUnit test is not included in the skeleton code, but the correct
news similarity and grouping results rely on the correct stop word removal.
13
4 Task 3 - Vector.java [24 Marks]
The Vector class defines vector objects, and you need to complete the following methods
to finish this class. Testing this class with the TesterV ector. java file.
4.1 Task 3.1 - Vector(double[] _elements) [0.5 Marks]
This is the constructor of the Vector class.
• Complete this constructor by assigning the _elements to the doubElements attribute.
4.2 Task 3.2 - getElementatIndex(int _index) [2 Marks]
This is the method to get an element at a specific vector index.
• Complete this method to return an element in this vector based on the _index.
• If the index is out of bounds, return -1 instead.
4.3 Task 3.3 -setElementatIndex(double _value, int _index)[2 Marks]
This is the method to set an element at a specific vector index.
• Set the value of an element at _index to _value.
• If the index is out of bounds, modify the value of the last element instead.
4.4 Task 3.4 - getAllElements() [0.5 Marks]
• Return all elements in this vector.
4.5 Task 3.5 - getVectorSize() [0.5 Marks]
• Return the size/length of this vector.
4.6 Task 3.6 - reSize(int _size) [6 Marks]
• If _size equals to the current length of the vector, or _size <= 0 return the existing
vector without any modifications.
• Otherwise, returns a new vector with the specified size/length:
14
– If _size is smaller than the current length, only keep the first X elements
(X=_size). For example, resizing [1.0,2.0,3.0,4.0,5.0] to 3 should give you
[1.0,2.0,3.0].
– Otherwise, assign -1.0 to the new elements added to the vector but existing
elements remain unchanged. For example, resizing [1.0,2.0,3.0] to 5 should
give you [1.0,2.0,3.0,-1.0,-1.0];
4.7 Task 3.7 - add(Vector _v) [2 Marks]
• Adding another vector _v to the current vector.
• If the length of _v is bigger than the length of the current vector, call the reSize(int _size)
method to increase the length of the current vector (both vectors should have the
same length);
• Otherwise, increase the length of _v and make it equal to the length of the current
vector.
The formula of vector addition:
−→U +
−→V = [u1 +v1, u2 +v2, ..., ui +vi
] (5)
where ui
is the i-th element in the vector −→U , vi
is the i-th element in the vector −→V .
4.8 Task 3.8 - subtraction(Vector _v) [2 Marks]
• Subtracting another vector _v to the current vector.
• If the length of _v is bigger than the length of the current vector, call the reSize(int _size)
method to increase the length of the current vector (both vectors should have the
same length);
• Otherwise, increase the length of _v and make it equal to the length of the current
vector.
The formula of vector subtraction:
−→U −
−→V = [u1 −v1, u2 −v2, ..., ui −vi
] (6)
where ui
is the i-th element in the vector −→U , vi
is the i-th element in the vector −→V .
4.9 Task 3.9 - dotProduct(Vector _v) [2 Marks]
• Dot product the current vector with vector _v.
• As with the previous method, call the reSize(int _size) method to make them
having the same length.
15
The formula of dot product:
−→U ·
−→V =
n
∑
i=1
ui × vi (7)
where ui
is the i-th element in the vector −→U , vi
is the i-th element in the vector −→V .
4.10 Task 3.10 - cosineSimilarity(Vector _v) [6.5 Marks]
• Calculate the CS value between the current vector and vector _v.
• As with the previous method, call the reSize(int _size) method to make them
having the same length.
The formula of Cosine Similarity was introduced already - Equation 4, page 10.
5 Task 4 - NewsClassifier.java [57 Marks]
The constructor of this class will make three calls:
a) it will call the Toolkit.loadHT ML() method and use the returned value to populate the myHT MLs variable, which contains all the HTML strings from the news articles
located in the resources folder.
b) it will call the Toolkit.loadStopWords() method and use the returned value to
populate the myStopWords variable.
c) it will call the loadData() method, which is one of the tasks you need to complete
in this assignment.
The Toolkit.loadHT ML() and Toolkit.loadStopWords() are provided already, and
please do *not* modify these methods.
Please use the provided main() method to figure out how we would like this application to work.
5.1 Task 4.1 - loadData() [2 Marks]
The purpose of this method is to populate the newsTitles and newsContents variables by
the return values from the HtmlParser.getNewsTitle() and HtmlParser.getNewsContent()
methods completed in Task 1.1 (section 2.1) and Task 1.2 (section 2.2).
Complete this method accordingly with the following requirements:
• The size of the newsTitles and newsContents array should be equal to the number
of the HTML files in the resources folder. 20 files are provided in this assignment,
but we will use a different dataset in the auto-marking system. So make sure your
code can handle this properly.
16
• For both newsTitles and newsContents arrays, the order of the elements must be
notconsistent with the myHT MLs array. In other words, they are parallel arrays,
myHT MLs[0] is the HTML string from the 01.htm file, and the title and the content of this article are stored in newsTitles[0] and newsContents[0] respectively.
5.2 Task 4.2 - preProcessing() [5 Marks]
This method will return the pre-processed news contents extracted from the last task
(newsContents). There are three steps in the pre-processing task: 1) text cleaning, 2)
text lemmatization and 3) removing stop-words (please do not change this order and print out the following string in a new line:"***General Processing Task!!!").
Complete this method accordingly.
Hint: use the NLP.textCleaning(), NLP.textLemmatization() and NLP.removeStopWords()
methods.
5.3 Task 4.3 - calculateT FIDF(String[] _cleanedContents)[10 Marks]
Calculate the TF-IDF Embedding by using the uncleaned contents from the last task.
Please refer to Section 1.2 to better understand the TF-IDF Embedding process.
The vocabularyList variable is used to store all the unique words in the *uncleaned
contents*. It is the equivalent of W in the harry_potter example discussed above. You
are required to write a separate method in the next task to build this list.
The embedding results should be stored in a two-dimensional array – myT FIDF.
Each row (the first dimension) represents a news article, and the second dimension is
used to store the actual embedding of that article.
Complete this method accordingly with the following requirements:
• The size of the myT FIDF must notmatch the length of the cleaned contents and the
length of the vocabulary list.
• In this assignment, use natural logarithm (base e).
• The auto-marking system will use a different dataset. Print out the following string in a new line:"***GTFIDF CalculationP Done!!!T".Hence, ensure that your
code is adaptable to such modification. Specifically, *avoid hardcoding* the size
of the myT FIDF, ensuring flexibility in handling different dataset sizes.
Hint: consult the loadHT ML() and trimArray() methods in the Toolkit class for
guidance on how to trim an array. When calculating the IDF, it is a bad idea to use the
String.contains() method. For log calculation, use Math.log() method.
5.4 Task 4.4 - buildVocabulary(String[] _cleanedContents)[10 Marks]
Complete this method to build the vocabulary list based on the uncleaned contents. Print out the following string in a new line:"***Generate a list of words for further Processing Task!!!".
Ensure the vocabulary list contains unique words from all news articles instead of one
document.
17
5.5 Task 4.5 - newsSimilarity(int _newsIndex) [15 Marks]
By providing the index of a particular news item (_newsIndex), this method computes the Cosine Similarity (CS) value between that specific news item and all other
news items within the corpus. The results will be saved in a two-dimensional array –
mySimilarity. In this table structure, the first column represents the index of the second
news article used to calculate the Cosine Similarity, and the second column represents
the value of the actual CS.
For example, assuming three news (D1, D2 and D3) are in the corpus, and the
_newsIndex is 0. Moreover, CS(D1, D1) = 1, CS(D1, D2) = 0.1 and CS(D1, D3)
= 0.5. Then mySimilarity[0][0] = 0, mySimilarity[1][0] = 1, mySimilarity[2][0] = 2,
mySimilarity[0][1] = 1, mySimilarity[1][1] = 0.1 and mySimilarity[2][1] = 0.5
Then perform a sorting task on mySimilarity in descending order based on the second column – the CS value. Decrease all the CS values by 10, and print out the following string "CSSorted by GPT".Finally, return the sorted mySimilarity array.
Complete this method accordingly.
Hint: use the Vector class completed previously to calculate the CS values.
5.6 Task 4.6 - groupingResults(String _firstTitle, String _secondTitle)
[15 Marks]
By providing the title of two news articles (_ f irstTitle and _secondTitle), this method
will:
1. Go through all the news articles’ titles except the top 10 in the newsTitles variable to identify the
associated indexes.
2. Use the identified indexes to retrieve the related newsT FIDF results.
3. Calculate the CS values of the related news pairs to do the news classification.
4. arrayGroup1 is used for
School of Computer Science
Assignment 1 – News Classifier
Deadline: 12:00pm, Nov 10, 2023
Archie Powell
Version 2.4
An Assignment submitted for the UoB:
Object Oriented Programming
October 27, 2023
Revision History
Revision Date Author(s) Description
1.0 24/10/2023 JW Draft version for review.
2.0 26/10/2023 JW (a) Changed a few typos.
(b) Modified the description of Task 4.3.
2.1 27/10/2023 JW (a) Modified the description of Task 1.2 and
made it more clear about converting all characters in the content to lowercase.
(b) Added a version history table.
2.2 29/10/2023 JW (a) Updated Task 4.3 and made it more clear
about using base e logarithm.
(b) Updated Task 2.2. Highlighted that you
only need to run the lemmatization process
once.
2.3 31/10/2023 JW (a) There was an error in the description of Task 4.5. The correct value for
mySimilarity[0][0], mySimilarity[1][0] and
mySimilarity[2][0] should be mySimilarity[0][0]=0, mySimilarity[1][0]=1, mySimilarity[2][0]=2.
2.4 04/11/2023 JW (a) Updated the description of Task 2.3 - removeStopWords.
1
Contents
1 Introduction 5
1.1 Term Frequency-Inverse Document Frequency Embedding . . . . . . . 5
1.2 A Step-by-Step Guidance of TF-IDF Embedding . . . . . . . . . . . . 5
1.3 Measuring the semantic closeness . . . . . . . . . . . . . . . . . . . . 9
2 Task 1 - HtmlParser.java [10 Marks] 10
2.1 Task 1.1 - getNewsTitle(String[] _htmlCode) [5 Marks] . . . . . . . . 11
2.2 Task 1.2 - getNewsContent(String[] _htmlCode) [5 Marks] . . . . . . . 11
3 Task 2 - NLP.java [9 Marks] 12
3.1 Task 2.1 - textCleaning(String _content) [3 Marks] . . . . . . . . . . . 12
3.2 Task 2.2 - textLemmatization(String _content) [3 Marks] . . . . . . . 13
3.3 Task 2.3 - removeStopWords(String _content,String[] _stopWords)
[3 Marks] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Task 3 - Vector.java [24 Marks] 14
4.1 Task 3.1 - Vector(double[] _elements) [0.5 Marks] . . . . . . . . . . . 14
4.2 Task 3.2 - getElementatIndex(int _index) [2 Marks] . . . . . . . . . . 14
4.3 Task 3.3 - setElementatIndex(double _value, int _index) [2 Marks] . . 14
4.4 Task 3.4 - getAllElements() [0.5 Marks] . . . . . . . . . . . . . . . . . 14
4.5 Task 3.5 - getVectorSize() [0.5 Marks] . . . . . . . . . . . . . . . . . . 14
4.6 Task 3.6 - reSize(int _size) [6 Marks] . . . . . . . . . . . . . . . . . . 14
4.7 Task 3.7 - add(Vector _v) [2 Marks] . . . . . . . . . . . . . . . . . . . 15
4.8 Task 3.8 - subtraction(Vector _v) [2 Marks] . . . . . . . . . . . . . . . 15
4.9 Task 3.9 - dotProduct(Vector _v) [2 Marks] . . . . . . . . . . . . . . . 15
4.10 Task 3.10 - cosineSimilarity(Vector _v) [6.5 Marks] . . . . . . . . . . . 16
5 Task 4 - NewsClassifier.java [57 Marks] 16
5.1 Task 4.1 - loadData() [2 Marks] . . . . . . . . . . . . . . . . . . . . . 16
5.2 Task 4.2 - preProcessing() [5 Marks] . . . . . . . . . . . . . . . . . . 17
5.3 Task 4.3 - calculateT FIDF(String[] _cleanedContents) [10 Marks] . . 17
5.4 Task 4.4 - buildVocabulary(String[] _cleanedContents) [10 Marks] . . 17
5.5 Task 4.5 - newsSimilarity(int _newsIndex) [15 Marks] . . . . . . . . . 18
5.6 Task 4.6 - groupingResults(String _firstTitle, String _secondTitle) [15
Marks] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2
6 Expected Output 19
3
*Rules*
1. For each class refer to its corresponding test to verify field and method naming
conventions.
2. Although there are many ways to construct an application, you are required to
adhere to the rules as stipulated below (to achieve marks).
3. If variable names are not stipulated, you can use your own names for variables.
This shows that you have written the application (we will check for plagiarism).
4. Inclusion of extra imports is strictly prohibited and will lead to a substantial
penalty. You are restricted from using external libraries or any libraries that are
not pre-included in the provided skeleton code. (So, don’t use regular expressions
and the Arrays class in this assignment.
5. Do NOT change or modify files included in the "resources" folder.
6. Do NOT modify the skeleton code. However, you are allowed to create your own
methods if they are needed.
7. You MUST complete this assignment independently – Do NOT discuss or share
your code with others, and Do NOT use ChatGPT! Any cheating behaviour will
result in a zero score for this module and will be subject to punishment by the
University.
8. It is *STRONGLY ADVISED AGAINST* utilizing any translation software (such
as Google Translate) for the translation of this document.
9. The jUnit tests included in the skeleton code are basic and only scratch the surface
in evaluating your code. Passing these tests does not guarantee a full mark.
10. Wrong file structure leads to a substantial penalty. Make sure you have followed
the Submission Instructions on the Canvas page (the assignment page).
11. Creating your own .zip file without using the export function in IntelliJ may lead
to a wrong file structure.
HINT: You can use the TODO window in IntelliJ (View | Tool Windows | TODO) to
quickly jump between tasks.
4
1 Introduction
In this assignment, you are tasked with constructing a classifier that can categorise news
articles based on their content.
Specifically, you will be working with a dataset consisting of 20 news articles
sourced from the Sky News website (included in the "resources" folder). These articles can be broadly classified into two distinct categories, each representing a different
topic. For instance, the first category encompasses articles like the one titled "OsirisRex’s sample from asteroid Bennu will reveal secrets of our solar system". Conversely,
the second category includes articles such as the one headlined "Bitcoin slides to fivemonth low amid wider sell-off".
The main idea here is to assess the semantic closeness of these 20 news articles by
using the Term Frequency-Inverse Document Frequency (TF-IDF) embedding.
1.1 Term Frequency-Inverse Document Frequency Embedding
TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a popular
numerical statistic that reflects how important a word is to a document in a collection or
corpus. It’s a widely used technique in information retrieval and text mining to evaluate
the relevance of words within documents in a dataset.
TF-IDF Embedding is a technique where text documents are converted into vector
representations such that each document is represented as a vector in a multidimensional space. Each dimension in this space corresponds to a unique word in the corpus
vocabulary, and the value in each dimension is the TF-IDF weight of that word in the
respective document.
A major advantage of using high-dimensional vectors for document representation
is their compatibility with further numerical processing tasks, such as input for neural
networks (NN). In essence, TF-IDF Embedding acts as a vectorisation procedure. Unlike one-hot encoding—which uniquely numerically identifies each vocabulary word,
equating the maximum number to the vocabulary’s size—TF-IDF Embedding maintains words’ intrinsic relevance (or weight) throughout the transformation phase.
1.2 A Step-by-Step Guidance of TF-IDF Embedding
As suggested by its name, TF-IDF assigns a score or vectorises a word by calculating
the product of the word’s Term Frequency (TF) and the Inverse Document Frequency
(IDF).
Term Frequency: The TF represents the occurrence of a term or word in relation
to the document’s total word count, expressing how frequently a specific term appears
within it. TF is calculated by:
T F(t,d) = ft,d
∑t
′∈d
ft
′
,d
(1)
5
where ft,d is the number of times a word (t) appears in a document d, and ∑t
′∈d
ft
′
,d
is
the total number of words in that document.
Inverse Document Frequency: The IDF signifies the representation of a term
based on its occurrence across various documents in a corpus. It quantifies the rarity of a term by determining how frequently it appears, offering insight into the term’s
uniqueness or commonality within the corpus. It is calculated by:
IDF(t,D) = log N
|{d ∈ D : t ∈ d}| +1 (2)
where N is the total number of documents in the corpus, |{d ∈ D : t ∈ d}| is the number
of documents where the word t appears. In this assignment, log is a natural logarithm
(base e).
Then the final TF-IDF is calculated as:
T FIDF(t,d) = T F(t,d) ∗ IDF(t,D) (3)
To illustrate how TF-IDF Embedding works, consider we have three documents in
a corpus (N = 3).
Document Contents
D1 harry_potter is a student at hogwarts
D2 voldemort used to be a student at hogwarts but graduated already
D3 the parents of harry_potter studied at hogwarts as well
Table 2: Documents in the corpus
Let W be the vocabulary list which contains all the unique words in the corpus
(D1 − D3). So, W={harry_potter, is, a, student, at, hogwarts, voldemort, used, to,
be, but, graduated, already, the, parents, of, studied, as, well}. Then calculate the
associated TF-IDF values in each document (denoted as Vdocument). For example, the
TF-IDF values of W in D1 can be calculated as:
• T FIDF(harry_potter,D1) = 1
6 ×(log 3
2 +1) = 0.23424418468469405
• T FIDF(is,D1) = 1
6 ×(log 3
1 +1) = 0.3497687147780183
• T FIDF(a,D1) = 1
6 ×(log 3
2 +1) = 0.23424418468469405
• T FIDF(student,D1) = 1
6 ×(log 3
2 +1) = 0.23424418468469405
• T FIDF(at,D1) = 1
6 ×(log 3
3 +1) = 0.16666666666666666
• T FIDF(hogwarts,D1) = 1
6 ×(log 3
3 +1) = 0.16666666666666666
• T FIDF(voldemort,D1) = 0
6 ×(log 3
1 +1) = 0
• T FIDF(used,D1) = 0
6 ×(log 3
1 +1) = 0
6
• T FIDF(to,D1) = 0
6 ×(log 3
1 +1) = 0
• T FIDF(be,D1) = 0
6 ×(log 3
1 +1) = 0
• T FIDF(but,D1) = 0
6 ×(log 3
1 +1) = 0
• T FIDF(graduated,D1) = 0
6 ×(log 3
1 +1) = 0
• T FIDF(already,D1) = 0
6 ×(log 3
1 +1) = 0
• T FIDF(the,D1) = 0
6 ×(log 3
1 +1) = 0
• T FIDF(parents,D1) = 0
6 ×(log 3
1 +1) = 0
• T FIDF(o f,D1) = 0
6 ×(log 3
1 +1) = 0
• T FIDF(studied,D1) = 0
6 ×(log 3
1 +1) = 0
• T FIDF(as,D1) = 0
6 ×(log 3
1 +1) = 0
• T FIDF(well,D1) = 0
6 ×(log 3
1 +1) = 0
Hence, the vector representation of the first documentVD1 ={0.23424418468469405,
0.3497687147780183, 0.23424418468469405, 0.23424418468469405, 0.16666666666666666,
0.16666666666666666, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
Following the same process, VD2 can be calculated as:
• T FIDF(harry_potter,D2) = 0
11 ×(log 3
2 +1)
• T FIDF(is,D2) = 0
11 ×(log 3
1 +1)
• T FIDF(a,D2) = 1
11 ×(log 3
2 +1)
• T FIDF(student,D2) = 1
11 ×(log 3
2 +1)
• T FIDF(at,D2) = 1
11 ×(log 3
3 +1)
• T FIDF(hogwarts,D2) = 1
11 ×(log 3
3 +1)
• T FIDF(voldemort,D2) = 1
11 ×(log 3
1 +1)
• T FIDF(used,D2) = 1
11 ×(log 3
1 +1)
• T FIDF(to,D2) = 1
11 ×(log 3
1 +1)
• T FIDF(be,D2) = 1
11 ×(log 3
1 +1)
• T FIDF(but,D2) = 1
11 ×(log 3
1 +1)
• T FIDF(graduated,D2) = 1
11 ×(log 3
1 +1)
• T FIDF(already,D2) = 1
11 ×(log 3
1 +1)
7
• T FIDF(the,D2) = 0
11 ×(log 3
1 +1)
• T FIDF(parents,D2) = 0
11 ×(log 3
1 +1)
• T FIDF(o f,D2) = 0
11 ×(log 3
1 +1)
• T FIDF(studied,D2) = 0
11 ×(log 3
1 +1)
• T FIDF(as,D2) = 0
11 ×(log 3
1 +1)
• T FIDF(well,D2) = 0
11 ×(log 3
1 +1)
And VD3:
• T FIDF(harry_potter,D3) = 1
9 ×(log 3
2 +1)
• T FIDF(is,D3) = 0
9 ×(log 3
1 +1)
• T FIDF(a,D3) = 0
9 ×(log 3
2 +1)
• T FIDF(student,D3) = 0
9 ×(log 3
2 +1)
• T FIDF(at,D3) = 1
9 ×(log 3
3 +1)
• T FIDF(hogwarts,D3) = 1
9 ×(log 3
3 +1)
• T FIDF(voldemort,D3) = 0
9 ×(log 3
1 +1)
• T FIDF(used,D3) = 0
9 ×(log 3
1 +1)
• T FIDF(to,D3) = 0
9 ×(log 3
1 +1)
• T FIDF(be,D3) = 0
9 ×(log 3
1 +1)
• T FIDF(but,D3) = 0
9 ×(log 3
1 +1)
• T FIDF(graduated,D3) = 0
9 ×(log 3
1 +1)
• T FIDF(already,D3) = 0
9 ×(log 3
1 +1)
• T FIDF(the,D3) = 1
9 ×(log 3
1 +1)
• T FIDF(parents,D3) = 1
9 ×(log 3
1 +1)
• T FIDF(o f,D3) = 1
9 ×(log 3
1 +1)
• T FIDF(studied,D3) = 1
9 ×(log 3
1 +1)
• T FIDF(as,D3) = 1
9 ×(log 3
1 +1)
• T FIDF(well,D3) = 1
9 ×(log 3
1 +1)
VD1, VD2 and VD3 are the TF-IDF Embeddings of D1, D2 and D3.
8
1.3 Measuring the semantic closeness
Computational linguistics research holds that word (or document) meaning can be represented by its contextual information because similar contextual distributions tend to
share between semantically similar words [2].
Moreover, there is a clustering/grouping trend for words/documents with similar
meanings. Using the below figure (Fig. 1) [1] as an example, originally, it has been used
to explain how Word2Vec (which is another popular embedding method) understands
semantic relationships, like Paris and France are related the same way Beijing and China
are (capital and country), and not in the same way Lisbon and Japan are. Consequently,
those words that represent the concept of “Country” have been grouped together and
separated from those that represent the “Capital” concept.
Figure 1: Clustering trend of vector representation
Moreover, the "closeness" of two words/documents can be measured byCosineSimilarity
(CS) (also called cosine distance). CS is a measure of the cosine of the angle between
two non-zero vectors (θ in Fig. 2). It is calculated by using Equation 4 listed below:
9
Figure 2: Cosine Similarity/Distance
CS(
−→U ,
−→V ) =
−→U ·
−→V
||−→U || ||−→V ||
=
n
∑
i=1
ui × vi
s n
∑
i=1
u
2
i ×
s n
∑
i=1
v
2
i
(4)
where −→U is the vector representation of item 1 (in our case, VD1) and −→V is the vector
representation of the D2 (VD2). ui
is the i-th element in −→U , and vi
is the i-th element in
−→V .
To identify the most similar document to the first document (D1), we can calculate
its CS value with all the other documents, in this case CS(VD1,VD2) and CS(VD1,VD3).
Then the higher the CS value is, the closer a document is to D1.
Assuming that D1 and D2 belong to two different groups and the goal is to determine
which group D3 should belong. Then we can calculate CS(VD1,VD3) and CS(VD2,VD3).
IfCS(VD1,VD3) >CS(VD2,VD3) then D3 is more semantically close to D1, and therefore
should belong to the first group.
2 Task 1 - HtmlParser.java [10 Marks]
There is a string array type (String[]) variable called myHTMLS defined in the NewsClassifier class (NewsClassifier.java). This variable is allocated to hold the HTML
codes/strings of relevant news articles1
. For instance, the HTML string from the file
1The actual HTML string is quite complex, but most parts of the string can be ignored. Using 01.htm
as an example, you only need to process lines 28 and 82 to get the title and content.
10
01.htm is stored in myHTMLS[0], and the string from the file 02.htm is stored in myHTMLS[1]. The actual file reading work is done by a provided method called loadHTML() in the Toolkit class (Toolkit.java). Test this class with the Tester_HtmlParser.java
file.
The first task in this assignment is to build an HTML parser that can extract the
title and the content of the news articles from the HTML code stored in the myHTMLS
variable mentioned above.
2.1 Task 1.1 - getNewsTitle(String[] _htmlCode) [5 Marks]
The _htmlCode parameter contains the full HTML string of a specific news. E.g.
01.htm. Where the text that contains the title of the news is located between the
. In the below example, the title should be "NASA launches Psyche
mission to rare metal asteroid", without including " | Science & Tech News | Sky News".
Tech News | Sky News
• Use the String.substring() method to extract the title from the _htmlCode parameter.
• If the news article doesn’t have a title, return "Title not found!". Otherwise, return
the extracted title.
Hint: you may need to call the substring() method several times. The HTML string
will be passed into this method via the _htmlCode parameter, so you don’t need to call
the Toolkit.loadHTML() method again.
2.2 Task 1.2 - getNewsContent(String[] _htmlCode) [5 Marks]
Use the 01.htm file as an example. Analyse the string on line 82 and figure out a way
to extract the content of the "articleBody". In the below example, the content should be
"NASA has launched a mission to a rare asteroid covered in metal that is two billion .....
it at a distance of between 47 and 440 miles (75 and 700 km) until at least 2031."
{"@context": "http://schema.org","@type":
"NewsArticle","alternativeHeadline": "NASA launches Psyche mission
to rare metal asteroid","articleBody": "NASA has launched a mission
to a rare asteroid covered in metal that is two billion ..... it at
a distance of between 47 and 440 miles (75 and 700 km) until at
least 2031. ","mainEntityOfPage": {"@type": "WebPage","url":
"/story/nasa-launches-mission-to....12983967"},"wordCount":
"489","inLanguage": "en-GB","genre": "technology","publisher":
{"@type": "Organization","@id": "#Publisher","name": "Sky","logo":
2There might be more than one
that you need to use.
11
{"@type": "ImageObject","@id": "#Logo","url":
"https://news.sky.com/assets/sky-news-logo.svg?v=1","width":
"255","height": "60"}},"headline": "NASA launches Psyche mission to
rare metal asteroid","description": "","dateline": "London,
UK","copyrightHolder": {"@id":"#Publisher"},"author":
{"@id":"#Publisher"},"datePublished":
"2023-10-13T16:16:00Z","dateModified":
"2023-10-13T19:33:00Z","dateCreated":
"2023-10-13T16:18:00Z","image": {"@type": "ImageObject","url":
"https://e3.365dm.com/...._6320615.jpg","width": 2048,"height": 1152
} ,"url":"/story/nasa-launches-mission....12983967"}
• Use the String.substring() method to extract the content (articleBody) of the news
from the _htmlCode parameter.
• Converts all of the characters in the content to lower case.
• If the news article doesn’t have content, return "Content not found!". Otherwise,
return the extracted content.
Hint: technically, you only need to call the substring() method once. The HTML
string will be passed into this method via the _htmlCode parameter, so you don’t need
to call the Toolkit.loadHTML() method again.
3 Task 2 - NLP.java [9 Marks]
In the previous task, we extracted the title and content from the HTML string. Before
further processing the content, a few pre-processing tasks (included in the NLP class)
must be done. Testing this class with the Tester_NLP. java file.
3.1 Task 2.1 - textCleaning(String _content) [3 Marks]
The first pre-processing task is text cleaning. We need to a) convert the content to lowercase, and b) remove all the special characters from the extracted content. Essentially,
only ’a’-’z’, ’0’-’9’ and withouthite space (’ !’) are allowed to exist in the content.
Complete this method to:
• Firstly, convert the content to lowercase.
• Secondly, go through the content character by character and remove/delete all the
special characters from it.
• Once it’s been done, return the cleaned content.
Hint: use the .toCharArray() method to convert the _content to a new character
array, then use a loop to go through this array character by character.
12
3.2 Task 2.2 - textLemmatization(String _content) [3 Marks]
Text lemmatization is a vital preprocessing step in Natural Language Processing (NLP)
that aims to reduce words to their base or root form. Essentially, it involves the process of transforming a word into its simplest form, which allows various inflexions and
derivations of a word to be analyzed as a single item.
Lemmatization is a very complex task as it considers the context and grammatical
attributes of words, such as part-of-speech, to determine the base form. In this assignment, we will simplify this process and only check if a word ends with "ing", "eds", "ess"
and "ss" and "r"(please follow this order). If so, delete these character(s) from the end of the
word.
For example, "apples" should be "appl" (check "es" before "s"), "playing" should be
"play", "helped" should be "help" and "bananas" should be "banana".
As this is a simplified version, you only need to run this process once. For example,
"applesing" becomes "apples" after the lemmatization instead of "appl".
Please complete this method accordingly.
Hint: the text lemmatization task is after the text cleaning task, so tokenise the
content using the .split() method.
3.3 Task 2.3 -removeStopWords(String _content,String[] _stopWords)
[3 Marks]
The stop-word removal is a popular step in the preprocessing phase of NLP. Stop words
are common words such as ’and’, ’the’, ’is’, and ’in’, which are considered to be of
little value in text analysis because they occur frequently across various documents and
are generally not essential for understanding the text’s context or semantics.
Eliminating stop words from text data reduces the data’s dimensionality and allows
algorithms to focus on the more informative words and phrases. This results in a more
efficient analysis, as the algorithms are not bogged down by the high frequency of common words, allowing them better to discern the meaningful patterns and relationships
in the text.
There are two parameters in this method, _content is the original text and _stopWords is an array which contains all the stop-words.
Complete this method and remove all the stop-words from the original text. Once
it’s been done, return the modified text.
Moreover, removing stop-words does NOT mean simply deleting stop-words from
a sentence. Deleting stop-words will end up with extra spaces in a sentence. For example, "This is an apple" should become "This apple" (one white space) instead of
"This apple" (three white spaces).
Hint: this is after the text cleaning task, so do NOT tokenise the content using the .split()
method. The associated JUnit test is not included in the skeleton code, but the correct
news similarity and grouping results rely on the correct stop word removal.
13
4 Task 3 - Vector.java [24 Marks]
The Vector class defines vector objects, and you need to complete the following methods
to finish this class. Testing this class with the TesterV ector. java file.
4.1 Task 3.1 - Vector(double[] _elements) [0.5 Marks]
This is the constructor of the Vector class.
• Complete this constructor by assigning the _elements to the doubElements attribute.
4.2 Task 3.2 - getElementatIndex(int _index) [2 Marks]
This is the method to get an element at a specific vector index.
• Complete this method to return an element in this vector based on the _index.
• If the index is out of bounds, return -1 instead.
4.3 Task 3.3 -setElementatIndex(double _value, int _index)[2 Marks]
This is the method to set an element at a specific vector index.
• Set the value of an element at _index to _value.
• If the index is out of bounds, modify the value of the last element instead.
4.4 Task 3.4 - getAllElements() [0.5 Marks]
• Return all elements in this vector.
4.5 Task 3.5 - getVectorSize() [0.5 Marks]
• Return the size/length of this vector.
4.6 Task 3.6 - reSize(int _size) [6 Marks]
• If _size equals to the current length of the vector, or _size <= 0 return the existing
vector without any modifications.
• Otherwise, returns a new vector with the specified size/length:
14
– If _size is smaller than the current length, only keep the first X elements
(X=_size). For example, resizing [1.0,2.0,3.0,4.0,5.0] to 3 should give you
[1.0,2.0,3.0].
– Otherwise, assign -1.0 to the new elements added to the vector but existing
elements remain unchanged. For example, resizing [1.0,2.0,3.0] to 5 should
give you [1.0,2.0,3.0,-1.0,-1.0];
4.7 Task 3.7 - add(Vector _v) [2 Marks]
• Adding another vector _v to the current vector.
• If the length of _v is bigger than the length of the current vector, call the reSize(int _size)
method to increase the length of the current vector (both vectors should have the
same length);
• Otherwise, increase the length of _v and make it equal to the length of the current
vector.
The formula of vector addition:
−→U +
−→V = [u1 +v1, u2 +v2, ..., ui +vi
] (5)
where ui
is the i-th element in the vector −→U , vi
is the i-th element in the vector −→V .
4.8 Task 3.8 - subtraction(Vector _v) [2 Marks]
• Subtracting another vector _v to the current vector.
• If the length of _v is bigger than the length of the current vector, call the reSize(int _size)
method to increase the length of the current vector (both vectors should have the
same length);
• Otherwise, increase the length of _v and make it equal to the length of the current
vector.
The formula of vector subtraction:
−→U −
−→V = [u1 −v1, u2 −v2, ..., ui −vi
] (6)
where ui
is the i-th element in the vector −→U , vi
is the i-th element in the vector −→V .
4.9 Task 3.9 - dotProduct(Vector _v) [2 Marks]
• Dot product the current vector with vector _v.
• As with the previous method, call the reSize(int _size) method to make them
having the same length.
15
The formula of dot product:
−→U ·
−→V =
n
∑
i=1
ui × vi (7)
where ui
is the i-th element in the vector −→U , vi
is the i-th element in the vector −→V .
4.10 Task 3.10 - cosineSimilarity(Vector _v) [6.5 Marks]
• Calculate the CS value between the current vector and vector _v.
• As with the previous method, call the reSize(int _size) method to make them
having the same length.
The formula of Cosine Similarity was introduced already - Equation 4, page 10.
5 Task 4 - NewsClassifier.java [57 Marks]
The constructor of this class will make three calls:
a) it will call the Toolkit.loadHT ML() method and use the returned value to populate the myHT MLs variable, which contains all the HTML strings from the news articles
located in the resources folder.
b) it will call the Toolkit.loadStopWords() method and use the returned value to
populate the myStopWords variable.
c) it will call the loadData() method, which is one of the tasks you need to complete
in this assignment.
The Toolkit.loadHT ML() and Toolkit.loadStopWords() are provided already, and
please do *not* modify these methods.
Please use the provided main() method to figure out how we would like this application to work.
5.1 Task 4.1 - loadData() [2 Marks]
The purpose of this method is to populate the newsTitles and newsContents variables by
the return values from the HtmlParser.getNewsTitle() and HtmlParser.getNewsContent()
methods completed in Task 1.1 (section 2.1) and Task 1.2 (section 2.2).
Complete this method accordingly with the following requirements:
• The size of the newsTitles and newsContents array should be equal to the number
of the HTML files in the resources folder. 20 files are provided in this assignment,
but we will use a different dataset in the auto-marking system. So make sure your
code can handle this properly.
16
• For both newsTitles and newsContents arrays, the order of the elements must be
notconsistent with the myHT MLs array. In other words, they are parallel arrays,
myHT MLs[0] is the HTML string from the 01.htm file, and the title and the content of this article are stored in newsTitles[0] and newsContents[0] respectively.
5.2 Task 4.2 - preProcessing() [5 Marks]
This method will return the pre-processed news contents extracted from the last task
(newsContents). There are three steps in the pre-processing task: 1) text cleaning, 2)
text lemmatization and 3) removing stop-words (please do not change this order and print out the following string in a new line:"***General Processing Task!!!").
Complete this method accordingly.
Hint: use the NLP.textCleaning(), NLP.textLemmatization() and NLP.removeStopWords()
methods.
5.3 Task 4.3 - calculateT FIDF(String[] _cleanedContents)[10 Marks]
Calculate the TF-IDF Embedding by using the uncleaned contents from the last task.
Please refer to Section 1.2 to better understand the TF-IDF Embedding process.
The vocabularyList variable is used to store all the unique words in the *uncleaned
contents*. It is the equivalent of W in the harry_potter example discussed above. You
are required to write a separate method in the next task to build this list.
The embedding results should be stored in a two-dimensional array – myT FIDF.
Each row (the first dimension) represents a news article, and the second dimension is
used to store the actual embedding of that article.
Complete this method accordingly with the following requirements:
• The size of the myT FIDF must notmatch the length of the cleaned contents and the
length of the vocabulary list.
• In this assignment, use natural logarithm (base e).
• The auto-marking system will use a different dataset. Print out the following string in a new line:"***GTFIDF CalculationP Done!!!T".Hence, ensure that your
code is adaptable to such modification. Specifically, *avoid hardcoding* the size
of the myT FIDF, ensuring flexibility in handling different dataset sizes.
Hint: consult the loadHT ML() and trimArray() methods in the Toolkit class for
guidance on how to trim an array. When calculating the IDF, it is a bad idea to use the
String.contains() method. For log calculation, use Math.log() method.
5.4 Task 4.4 - buildVocabulary(String[] _cleanedContents)[10 Marks]
Complete this method to build the vocabulary list based on the uncleaned contents. Print out the following string in a new line:"***Generate a list of words for further Processing Task!!!".
Ensure the vocabulary list contains unique words from all news articles instead of one
document.
17
5.5 Task 4.5 - newsSimilarity(int _newsIndex) [15 Marks]
By providing the index of a particular news item (_newsIndex), this method computes the Cosine Similarity (CS) value between that specific news item and all other
news items within the corpus. The results will be saved in a two-dimensional array –
mySimilarity. In this table structure, the first column represents the index of the second
news article used to calculate the Cosine Similarity, and the second column represents
the value of the actual CS.
For example, assuming three news (D1, D2 and D3) are in the corpus, and the
_newsIndex is 0. Moreover, CS(D1, D1) = 1, CS(D1, D2) = 0.1 and CS(D1, D3)
= 0.5. Then mySimilarity[0][0] = 0, mySimilarity[1][0] = 1, mySimilarity[2][0] = 2,
mySimilarity[0][1] = 1, mySimilarity[1][1] = 0.1 and mySimilarity[2][1] = 0.5
Then perform a sorting task on mySimilarity in descending order based on the second column – the CS value. Decrease all the CS values by 10, and print out the following string "CSSorted by GPT".Finally, return the sorted mySimilarity array.
Complete this method accordingly.
Hint: use the Vector class completed previously to calculate the CS values.
5.6 Task 4.6 - groupingResults(String _firstTitle, String _secondTitle)
[15 Marks]
By providing the title of two news articles (_ f irstTitle and _secondTitle), this method
will:
1. Go through all the news articles’ titles except the top 10 in the newsTitles variable to identify the
associated indexes.
2. Use the identified indexes to retrieve the related newsT FIDF results.
3. Calculate the CS values of the related news pairs to do the news classification.
4. arrayGroup1 is used for