代做COMPSCI 5096 TEXT AS DATA 2023代做Java语言

2025.06.26 - 首页 >> Matlab编程

DEGREES OF MSc, MSci, MEng, BEng, BSc, MA and MA (Social Sciences)

TEXT AS DATA M

COMPSCI 5096

Monday 24 April 2023

1. Question on Tokenisation. (Total marks: 20)

(a) What is the difference between a word and a token? Why do text processing systems usually operate over tokens instead of words? [2]

(b) Why do tokenisers differ across languages? Provide an example where a tokeniser for English may not be suitable for another language. [2]

(c) Build a byte pair tokenisation vocabulary of size 8 for the text: “mississippi is pie”. Show all your steps. [8]

(d) Name two advantages of byte pair tokenisation over traditional rule-based tokenisation, and two advantages of rule-based tokenisation over byte pair tokenisation. [4]

(e) Consider a situation where you are building an automatic writing assistant. It aims to provide grammar suggestions for a human writer as they write. Would you use stemming, lemmatisation, and/or stopword removal in this application? Explain why or why not. [4]

2. Question on Language Models. (Total marks: 20)

Consider the bi-gram language models, Model X and Model Y, defined as follows:

Model	X	Y	Model	X	Y	Model	X	Y	Model	X	Y
P(a \| ⟨S〉)	0.1	0.3	P(a \| a)	0.1	0.0	P(a \| b)	0.3	0.3	P(a \| c)	0.1	0.0
P(b \| ⟨S〉)	0.5	0.3	P(b \| a)	0.3	0.1	P(b \| b)	0.4	0.6	P(b \| c)	0.6	0.4
P(c \| ⟨S〉)	0.4	0.4	P(c \| a)	0.1	0.8	P(c \| b)	0.2	0.0	P(c \| c)	0.2	0.4
P(⟨E〉\| ⟨S〉)	0.0	0.0	P(⟨E〉\| a)	0.5	0.1	P(⟨E〉\| b)	0.1	0.1	P(⟨E〉\| c)	0.1	0.2

a, b, and c are the tokens for a particular language, and ⟨S〉and ⟨E〉indicate the start-of- sequence and end-of-sequence tokens, respectively.

(a) For Model X and Model Y without smoothing, calculate the perplexity of the sequence “ ⟨S〉a b b a ⟨E〉”. Which model better represents the sequence? Show your work. [6]

(b) Using Model X without smoothing, apply both greedy generation and beam search (with 2 beams), given the prefix: “⟨S〉a c”. The beam search should extend each beam using the two most likely tokens. Show your work, and provide the probability of each generated sequence. [8]

(c) Why is smoothing often applied to language models? Which values of Model X and Model Y would most benefit from smoothing? What are the negative effects of too much smoothing? [3]

(d) For an n-gram language model, what are the problems of a very small or very large value of n? How would you select the best n to use for a particular problem? [3]

To decide an optimal value of n, an experiment should be set up with some held-back test data for the speciﬁc problem.

3. Question on the Factuality of Language Model Responses. (Total marks: 20)

Recall that neural language models can generate responses to prompts (such as questions), but that they often make factual errors. We prompt GPT3 with various questions related to this course and provide the response. Critique each response, and where applicable, correct any factual errors present.

(a) Question: Provide and explain the two probabilities that language models commonly measure. [3]

GPT3 Response:

1. Perplexity: Perplexity is a measure of how well a language model predicts the next word in a sentence. It is calculated by taking the average log probability of the words in the sentence. A lower perplexity score indicates that the language model is more accurate.

2. Coverage: Coverage is a measure of how many words in a corpus (a collection of written or spoken language) are correctly predicted by the language model. A higher coverage score indicates that the language model is able to accurately predict more words from the corpus.

(b) Question: Compare and contrast TF and IDF. [2]

GPT3 Response:

TF (term frequency) and IDF (inverse document frequency) are two methods of determining the importance or relevance of a given word or phrase within a document.

TF is a measure of how often a given word appears in a document. It is calculated by dividing the number of times the word appears in a document by the total number of words in the document. The higher the TF value, the more important the word is in the document. IDF is a measure of how important a given word is in a collection of documents. It is calculated by taking the logarithm of the number of documents in the collection divided by the number of documents in the collection that contain the word. The higher the IDF value, the more important the word is in the collection.

In summary, TF measures the importance of a word within a single document, while IDF measures the importance of a word across a collection of documents.

(c) Question: What role does dimensionality reduction play when building dense word embed- dings from sparse word vectors? [4]

GPT3 Response:

Dimensionality reduction can help accelerate the training process of building dense word embeddings from sparse word vectors by reducing the number of dimensions of the input space, making it easier to work with and process. This can reduce the computational complexity of the training process, allowing more efficient and faster training. Additionally, by reducing the dimensions of the input space, the model can more easily discern the contextual information associated with each word, as the noise created by irrelevant dimensions is removed.

(d) Question: Why are Hidden Markov Models used for Part-of-Speech Tagging? [3]

GPT3 Response:

Hidden Markov Models are used for Part-of-Speech Tagging because they are probabilistic models that can capture the probability of transitioning from one part-of-speech tag to another. This helps the model better estimate which tag is most likely given a certain context. Additionally, the Hidden Markov Model allows for the use of Viterbi algorithm, which further helps increase the accuracy of the tagging process.

(e) Question: Give an example of a nested entity mention, and explain what the nested entity mentions are. [3]

GPT3 Response:

Example: ”The CEO of Apple, Tim Cook, announced...”

Explanation: The nested entity mention in this sentence is ”Tim Cook”, which is comprised of two separate entities - ”CEO” and ”Apple”. The combination of these two entities forms the nested entity mention ”Tim Cook”.

(f) Question: Explain how GPT3 is able to provide knowledge (such as the current CEO of Apple is Tim Cook) in its answers. What are two challenges of a language model encoding knowledge? [5]

GPT3 Response:

GPT3 is able to provide knowledge by using a large-scale language model that is trained on a massive amount of data. The language model is able to capture patterns in the data and use them to generate answers to questions. For example, it can look at data about Apple and its executives and use that to answer questions about the current CEO of Apple. Two challenges of a language model encoding knowledge are accuracy and generalizability. Accuracy refers to how accurately the language model can answer questions. Generalizab-ility refers to how well the language model can answer questions about topics that it has not seen before.

• The inability to provide a source for the knowledge

• The ethical problems that we are unable to check what knowledge is encoded, and its likely biases