CS1026A代做、Python设计程序代写
- 首页 >> C/C++编程 CS1026A Fall 2024
Assignment 3: YouTube Emotions
Important Notes:
• Read the whole assignment document before you begin coding. This is a more
complex speciffcation than in past assignments and the examples and templates
near the end of this document will be important in solving this assignment.
• Assignments are to be completed individually. Use of tools to generate code,
working with another person, or copying from online resources are not allowed and
will result in a zero on this assignment regardless of how much was copied.
• A code template is given in Section 6 (on page 17) for your main.py and
emotions.py ffles. We highly recommend using these as a starting point for your
assignment. The code is also attached to the assignment on OWL.
Change Log:
• Nov. 4
th
: The comments.csv ffle attached to Brightspace had an unexpected
unicode character in one of the comments the changed the outcome of some of the
examples given in this document. comments.csv has now been corrected and the
examples in this document to match.
• Nov. 13
th
: A type-o was found in the example for make_report() in section 5. This has
now been corrected. The output shown at the end of the document in section 7 was
still correct. This change has no impact on the autograder (it was marking correctly).
1. Learning Outcomes
By completing this assignment, you will gain skills relating to
• Functions
• Dictionaries and lists
• Complex data structures
• Text processing
• Working with TSV and CSV ffles
• File input and output
• Exceptions in Python
• Simple module use
• Writing code that adheres to a given speciffcation
• Working with real world problem
2. Background
With the emergence of social media sites such as YouTube, Facebook, Reddit, Twitter (also
known as X), LinkedIn, and WhatsApp, more and more data is being produced and made
accessible online in a textual format. This textual data, such as YouTube comments,
Tweets, or Facebook posts, can be hard to process but is incredibly important for
organizations as it offers a current snapshot of the public’s emotions (affinity) or sentiment
about a topic at a current point in time. Having a live view of your customer’s current affinity
towards your products or the public’s view of your political campaign can be critical for
success.
Much work has been done towards the goal of creating large datasets of word affinity or
sentiment. One such effort is the National Research Council (NRC) Emotion Lexicon which
is a list of English words and their associations with eight basic emotions (anger, fear,
anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and
positive).
Our goal in this assignment is to use a simpliffed version of the NRC Emotion Lexicon to
classify YouTube comments based on one of the following emotions anger, joy, fear, trust,
sadness, or anticipation. Based on the emotion contained in each comment for a particular
video we then want to generate a report that details the most common emotions YouTube
users have towards that video based on their comments.
3. Datasets
Your Python program will deal with two datasets, a keywords data set that contains a
simpliffed version of the NRC Emotion Lexicon (this dataset will remain the same for all
tests of your program) and a Comma-Separated Values (CSV) ffle that contains the
comments for a particular YouTube video (this dataset will change for each test of your
program).
3.1 Keywords Dataset (TSV File)
The keywords.tsv ffle attached to this assignment contains a simpliffed version of the NRC
Emotion Lexicon. This is a Tab-Separated Values (TSV) ffle in which each line of the ffle
contains a single word and its emotional classiffcation based on six emotions (anger, joy,
fear, trust, sadness, and anticipation). Each word in the ffle may be classiffed as having one
or more emotions. The following is an example of the ffrst 10 lines of this ffle where tab (\t) characters are
represented by arrows (→):
abacus→0→0→0→1→0→0
abandon→0→0→1→0→1→0
abandoned→1→0→1→0→1→0
abandonment→1→0→1→0→1→0
abbot→0→0→0→1→0→0
abduction→0→0→1→0→1→0
abhor→1→0→1→0→0→0
abhorrent→1→0→1→0→0→0
abolish→1→0→0→0→0→0
abominable→0→0→1→0→0→0
Each line starts with a keyword and is followed by a score (0 or 1) for each emotion in this
order: anger, joy, fear, trust, sadness, and anticipation. If a 1 is present it means that
keyword is related to that emotion. If a 0 is present the keyword is unrelated to that
emotion.
For example, according to the above the word “abacus” is related to the emotion of trust
and no other emotions. The word “abandon” is related to the emotions fear and sadness
and no other emotions.
All words in the dataset will be related to at least one emotion. This ffle’s contents will
remain the same for all tests but may be given a different fflename based on the users input
(e.g. it could be named keys.tsv or words.tsv rather than keywords.tsv).
3.2 Comments Dataset (CSV File)
The user will provide a Comma-Separated Values (CSV) ffle that contains a set of YouTube
comments for a particular video. The name of this ffle will change based on the user’s input
but will always end in .csv and have the same format.
The following is an example of a possible line from this ffle (the ffle may contain one or
more lines). Note that this document wraps the line on to multiple lines but in the ffle this is
one line ended by a line break (\n):
2,PixelPioneer24,brazil,The excavation scenes in the movie were
excellent but the unnecessary derision of the hero's motives seemed
unfair. His eventuality of success was not adequately showcased. Each line of this ffle will contain four values separated by a single comma character (,). The
values will always be in the following order:
Comment ID, Username, Country, Comment Text
Comment ID is a unique positive integer identiffer for the comment. Username is the
username of the user who posted the comment. Country is the user’s home country, and
comment text is the text the of the comment posted by the user.
No value will contain a line break or a comma character. The capitalization of country
names could be different for each line even if it is for the same country, but the country will
always be spelled the same.
Space characters will only occur in the comment text or country name.
4. Tasks
In this assignment, you will write two Python ffles, emotions.py and main.py, that will
attempt to determine the most common emotion expressed in a YoutTube video’s
comments. You will create a number of functions (as speciffed in the Functional
Speciffcation in Section 5) that will perform simple sentiment analysis on the YouTube
comments.
To accomplish this, you will need to do the following:
1. Accept input from the user: The user will specify the ffle names of the keywords
and comments data sets as well as the name of the report ffle your program will
create. The user will also input the name of the country they wish to fflter the
comments by.
2. Read. Your program will read in the keyword and comments datasets and store
them in the formats described in the functional speciffcation (in Section 5).
3. Clean. The text of the comments will be cleaned to remove any punctuation and
convert them to all lowercase letters.
4. Determine Emotion. You will use the keyword’s dataset to determine the overall
emotion expressed in each comment.
5. Generate Report. Based on your analysis of each comment, you will create a report
ffle that contains a summary of the most common emotion expressed as well as
how common each emotion was (as speciffed in Section 5). Additionally, you must follow the functional speciffcation presented in Section 5 and the
rules and requirements in Section 8.
5. Functional Speciffcation
5.1 emotions.py
The functions described in this section should be present in your emotions.py ffle and must
be used in some way in your program to read, clean, process, analyze, or report on the
comments in the given dataset. Each function and its parameters must have the same
name and spelling as speciffed below:
clean_text(comment)
This function should have one parameter, comment, which is a string that contains the
text of a single comment from the comments dataset. The function should clean this
text by replacing any characters that are not letters (A to Z) and replacing them with
space characters. It should also convert the comment’s text to all lower case.
This function should return the cleaned text as a string.
Example:
clean_text("This4is-an example. It's a b*t silly.")
will result in this output:
this is an example it s a b t silly
make_keyword_dict(keyword_file_name)
This function should read the Tab-Separated Values (TSV) keywords ffle as described in
Section 3.1. keyword_ffle_name is a string containing the name of the keywords ffle.
This function can safely assume that this ffle exists, is in the current working directory,
and is properly formatted. Checks on the ffle’s existence will be done in the main.py ffle
described later in this document.
The function should return a dictionary with keys for each word in the ffle and the values
of this dictionary should be a new dictionary for each keyword that contains a value for
each emotion (anger, joy, fear, trust, sadness, and anticipation). Example:
Assuming that keywords.tsv contains the following three lines (where → is a tab
character):
abacus→0→0→0→1→0→0
abandon→0→0→1→0→1→0
abandoned→1→0→1→0→1→0
then calling
make_keyword_dict("keywords.tsv")
should result in the following nested dictionary data structure:
{'abacus': {'anger': 0,
'joy': 0,
'fear': 0,
'trust': 1,
'sadness': 0,
'anticipation': 0},
'abandon': {'anger': 0,
'joy': 0,
'fear': 1,
'trust': 0,
'sadness': 1,
'anticipation': 0},
'abandoned': {'anger': 1,
'joy': 0,
'fear': 1,
'trust': 0,
'sadness': 1,
'anticipation': 0}
} Note that to pass the Gradescope tests this function must return a dictionary and not
another collection such as a list, the keyword keys must be spelled exactly as listed in
keywords.tsv, and the emotions must be spelled correctly and in lower case.
Hint: You may find a number of the Python string methods helpful when creating this
function.
make_comments_list(filter_country, comments_file_name)
This function should read the Comma-Separated Values (CSV) file as described in
Section 3.2. comments_file_name is a string containing the name of the CSV file and
filter_country is a string containing either a country name or the string “all”. This
function should read the CSV file and return a list containing only comments for the
given country listed in filter_country (or all countries if the string “all” is given).
The list should contain one element for each comment in the file that matches the
country in the filter (or all comments if “all” is given). Each element in the list should be a
dictionary that contains a key for the Comment ID, Username, Country and Comment
Text. The keys should be named 'comment_id', 'username', 'country', and 'text'
respectively.
The comment text should be stripped of any leading and trailing whitespace.
Example 1:
Assuming that comments.csv only contains the following two lines (note that the line is
wrapped in this document and in the .csv file this is only two lines):
1,RetroRealm77,united states,I was a bit disappointed with the
film's portrayal of childhood heroism. It felt like the classic
elements were just concealed under layers of unnecessary savagery
and violence.
2,PixelPioneer24,brazil,The excavation scenes in the movie were
excellent but the unnecessary derision of the hero's motives seemed
unfair. His eventuality of success was not adequately showcased.
then calling
make_comments_list("all", "comments.csv")
should result in the following nested list and dictionary data structure: [ {'comment_id': 1,
'username': 'RetroRealm77',
'country': 'united states',
'text': 'I was a bit disappointed with the film's portrayal of
childhood heroism. It felt like the classic elements were just
concealed under layers of unnecessary savagery and violence.'},
{'comment_id': 2,
'username': 'PixelPioneer24',
'country': 'brazil',
'text': 'The excavation scenes in the movie were excellent but
the unnecessary derision of the hero's motives seemed unfair. His
eventuality of success was not adequately showcased.'} ]
Example 2:
Given the same contents of comments.csv as in Example 1, if the following function call
with the country name brazil was made:
make_comments_list("brazil", "comments.csv")
then the only element in the returned list would be:
[ {'comment_id': 2,
'username': 'PixelPioneer24',
'country': 'brazil',
'text': 'The excavation scenes in the movie were excellent but
the unnecessary derision of the hero's motives seemed unfair. His
eventuality of success was not adequately showcased.'} ]
Example 3:
Given the same contents of comments.csv as in Example 1, if the function was called
with a country name that was not present in the file such as:
make_comments_list("not a real country", "comments.csv")
then the resulting list would be empty:
[]
Note that to pass the Gradescope tests this function must return a list and not another
collection such as a set or dictionary, the values of each list element must be a
dictionary, and the keys used in that dictionary must match the spelling and lowercase
capitalization given in this section.
classify_comment_emotion(comment, keywords)
This function takes the text of a comment and the keywords dictionary created by the
make_keyword_dict function as parameters and classifies the comment as one of the
possible emotions (anger, joy, fear, trust, sadness, and anticipation), returning the
emotion as a string.
A comment is classified by first cleaning the text (using the clean_text function) and
then checking each word in the comment against the keywords dictionary. A total for
each possible emotion should be kept with each word in the comment matching a
keyword adding to the totals (based on the values in the keywords dictionary).
Example:
For the comment:
The excavation scenes in the movie were excellent but the
unnecessary derision of the hero's motives seemed unfair. His
eventuality of success was not adequately showcased.
the text should be first cleaned using clean_text to get:
the excavation scenes in the movie were excellent but the
unnecessary derision of the hero s motives seemed unfair his
eventuality of success was not adequately showcased then each word should be checked against the keywords dictionary and the totals for
each emotion kept. Words not matching any words in the dictionary (shown in black
above) do not add to the scores. For example, using the full keywords.tsv dataset the
words shown in blue above have matches in the keyword dataset and would result in
the following totals:
Word anger joy fear trust sadness anticipation
excavation 0 0 0 0 0 1
excellent 0 1 0 1 0 0
derision 1 0 0 0 0 0
hero 0 1 0 1 0 1
unfair 1 0 0 0 1 0
eventuality 0 0 1 0 0 1
success 0 1 0 0 0 1
Total: 2 3 1 2 1 4
Therefore, this comment would be classified as having the emotion of anticipation and
the string “anticipation” should be returned by the function as it as the highest score.
In the event of a tie, the emotions should be given priority in this order: 1) anger, 2) joy, 3)
fear, 4) trust, 5) sadness, and 6) anticipation.
Hint: You may find the string split method useful for looping through words rather than
characters.
make_report(comment_list, keywords, report_filename)
This function takes the comment_list (created by the make_comments_list function),
the keywords dictionary (created by the make_keyword_dict function), and a string
containing the file name of the report to generate (report_filename) as parameters.
A new file should be created with the file name in report_filename and it should contain
the name of the most common emotion classification in the comment_list dataset as
well as a count of the number of comments classified as each emotion. In the event of a
tie the emotions should be given priority in this order: 1) anger, 2) joy, 3) fear, 4) trust, 5)
sadness, and 6) anticipation.
The format of the report should match the following example which is based on the
attached comments.csv and keywords.tsv with a country filter of “all”:
Most common emotion: anger
Emotion Totals
anger: 5 (33.33%)
joy: 2 (13.33%)
fear: 1 (6.67%)
trust: 3 (20.0%)
sadness: 3 (20.0%)
anticipation: 1 (6.67%)
The emotion totals should occur in the same order (regardless of the counts) but the
values would be different depending on the comment_list and keywords dictionary
passed to the function.
All percentages should be rounded to two digits and all six emotions should always be
listed even if their count is zero. Important: in your report file each percentage must be
written with one or two decimal places. A value such as 20.000% or 6.6700% would be
wrong even though it is technically rounded as there are too many decimal places. Your
output must be formatted exactly as shown in the example above including the spacing
and line breaks.
Return
The function should return the name of the most common emotion; in this example it
would be “anger”.
Exception
In the event that the comment_list contains no comments (i.e. it is an empty list), the
function should raise a RuntimeError containing the text “No comments in dataset!”.
Reminder: The report should be saved to a file and not output to the screen or returned
by the function. Only the name of the most common emotion should be returned.
5.2 main.py
The program in main.py should ask the user for the file names of the keyword file and
comments file that the data will be read from, as well as the name of the report file that will
be created. It must use the functions defined in the emotions.py file to perform the tasks
described in Section 4 and write the final report.
Your main.py file must contain the following two functions (ask_user_for_input and main)
as specified:
ask_user_for_input()
This function takes no parameters but asks the user to input the file names of the
keywords TSV file, the comments CSV file, the country to filter by, and the file name of the
report to be generated. These three filenames and the country name are returned in a
tuple in this order: 1) keyword filename, 2) comment fflename, 3) country name
(converted to lower case), and 4) report filename.
Example (of valid input):
Input keyword file (ending in .tsv): keywords.tsv
Input comment file (ending in .csv): comments.csv
Input a country to analyze (or "all" for all countries): Canada
Input the name of the report file (ending in .txt): report.txt
User input is shown in green and input prompts in black. Note that the filenames and
country are based on the user’s input and can not be hardcoded to one set value. This
means that the filenames could be different depending on the values input by the user.
In this case the following tuple would be returned:
('keywords.tsv', 'comments.csv', 'canada', 'report.txt')
Note that the country name was converted to all lowercase.
Exceptions
Your ask_user_for_input() method must complete the following checks on the user input.
If the input does not pass a check, an Exception should be raised causing the function to
exit immediately. Exceptions should be raised as soon as the invalid input is given. For example, if the
keyword file does not exist, an exception should be raised before asking the user to input
the comments file name.
Check 1: File Extension
For each of the three filenames, if the user inputs a filename ending in the wrong file
extension (.csv, .tsv, or .txt) the function should raise a ValueError exception with a
message stating that the file extension is incorrect such as “Keyword file does not end in
.tsv!”. The text of this message must be exactly the following for each file:
• Keyword File: “Keyword file does not end in .tsv!”
• Comments File: “Comments file does not end in .csv!”
• Report File: “Report file does not end in .txt!”
Check 2: Files Exist
For the keyword and comment files you must check if the file exists using the
os.path.exists function. If it does not, your function must raise a IOError exception with
text explaining that the function does not exist. The message should have the text “name> does not exist!” where is replaced with the filename such as
“keywords.tsv does not exist!", where keywords.tsv is the missing file.
For the report file, if the file already exists, an IOError should be raised with text stating
that “ already exists!” where is the name of the report file. For
example “report.txt already exists!” where the report file is named report.txt. This is to
help prevent accidentally overwriting any files.
Check 3: Valid Country
Lastly you must check that the country input is either “all” or one of the following
countries: 'bangladesh', 'brazil', 'canada', 'china', 'egypt', 'france', 'germany', 'india', 'iran',
'japan', 'mexico', 'nigeria', 'pakistan', 'russia', 'south korea', 'turkey', 'united kingdom', or
'united states'. If any other country or word is input, a ValueError should be raised with
the text “ is not a valid country to filter by!” where is the country the
user input. This subset of countries was chosen as they tend to occur in the datasets, we are using
more than others. In more realistic scenario you would likely want to include all valid
country names in this list, but this assignment limit to the above-mentioned countries.
Keep in mind that this only limits the countries a user can filter by, it does not limit what
country names can occur in the dataset.
main()
This function handles calling the other functions in main.py and emotions.py to perform
the tasks listed in Section 3. It should check for any exceptions being raised by the
ask_user_for_input function, output the error message contained in the exception (this
can be done by simply printing the exception with print()), and ask the user to input the
values again if any exception is raised.
Once valid input has been received, it should call the functions from emotions.py
required to analyze the comments and generate the report.
Lastly it should output to the screen the most common emotion in the comment data set.
This should be displayed as “Most common emotion is:” where emotion
name is the name of the emotion such as “Most common emotion is: anger” if the
emotion is anger.
If the make_report function raises a RuntimeError exception (e.g. the comment list was
empty), it should output the message contained in that error.
Example 1:
For the values in the attached keywords.tsv and comments.csv files:
Input keyword file (ending in .tsv): keywords.tsv
Input comment file (ending in .csv): comments.csv
Input a country to analyze (or "all" for all countries): all
Input the name of the report file (ending in .txt): report.txt
Most common emotion is: anger
User input is shown in green and the contents of the outputted report.txt file is:
Most common emotion: anger
Emotion Totals
anger: 5 (33.33%) joy: 2 (13.33%)
fear: 1 (6.67%)
trust: 3 (20.0%)
sadness: 3 (20.0%)
anticipation: 1 (6.67%)
Example 2:
For the same values in keywords.tsv and comments.csv but a country of “Canada”:
Input keyword file (ending in .tsv): keywords.tsv
Input comment file (ending in .csv): comments.csv
Input a country to analyze (or "all" for all countries): Canada
Input the name of the report file (ending in .txt): report_cad.txt
Most common emotion is: sadness
And the contents of report_cad.txt would be:
Most common emotion: sadness
Emotion Totals
anger: 1 (16.67%)
joy: 0 (0.0%)
fear: 0 (0.0%)
trust: 2 (33.33%)
sadness: 3 (50.0%)
anticipation: 0 (0.0%)
Example 3:
In this example invalid inputs are given, and the user is asked to input them again.
Input keyword file (ending in .tsv): not_a_real_file.tsv
Error: not_a_real_file.tsv does not exist!
Input keyword file (ending in .tsv): real_file_wrong_extension.txt
Error: Keyword file does not end in .tsv!
Input keyword file (ending in .tsv): keys.tsv
Input comment file (ending in .csv): not_a_real_file.csv
Error: not_a_real_file.csv does not exist!
Input keyword file (ending in .tsv): keys.tsv
Input comment file (ending in .csv): bad_file_extension.tsv
Error: Comment file does not end in .csv!
Input keyword file (ending in .tsv): keys.tsv
Input comment file (ending in .csv): c.csv
Input a country to analyze (or "all" for all countries): Duck
Error: duck is not a valid country to filter by!
Input keyword file (ending in .tsv): keys.tsv
Input comment file (ending in .csv): c.csv
Input a country to analyze (or "all" for all countries): Belgium
Error: belgium is not a valid country to filter by!
Input keyword file (ending in .tsv): keys.tsv
Input comment file (ending in .csv): c.csv
Input a country to analyze (or "all" for all countries): FrAnCe
Input the name of the report file (ending in .txt): report.txt
Error: report.txt exists, the report file can not already exist!
Input keyword file (ending in .tsv): keys.tsv
Input comment file (ending in .csv): c.csv
Input a country to analyze (or "all" for all countries): FrAnCe
Input the name of the report file (ending in .txt): report_france.txt
Error: No comments in dataset!
Note that the above is one run of the program. It should keep asking for input again if an
exception occurs in the ask_user_for_input function. Also note that in this example,
keys.tsv and c.csv are valid files that exist and the file report.txt already exists. “Belgium”
is not in the list of valid countries so it is rejected and “FrAnCe” is accepted despite it’s
odd capitalization as the ask_user_for_input function should convert it to lowercase.
In this example, the c.csv file contained no comments for France, so the exception “No
comments in dataset!” was raised by make_report function.
6. Templates
This section gives some starter code you should use in your program. You may not alter the
names of any function or the parameters the functions take (this includes adding or
removing parameters). You may not import any libraries or modules not included in the
template code and all code you add should be inside a function (adding code outside of a
function may cause the Gradescope tests to fail). You may add additional helper functions
as needed.
emotions.py
# add a comment here with your name, email, and student number
# you can not add any import lines to this file
EMOTIONS = ['anger', 'joy', 'fear', 'trust', 'sadness', 'anticipation']
def clean_text(comment):
# add your code here and remove the pass keyword on the next line
pass
def make_keyword_dict(keyword_file_name):
# add your code here and remove the pass keyword on the next line
pass
def classify_comment_emotion(comment, keywords):
# add your code here and remove the pass keyword on the next line
pass
def make_comments_list(filter_country, comments_file_name):
# add your code here and remove the pass keyword on the next line
pass
def make_report(comment_list, keywords, report_filename):
# add your code here and remove the pass keyword on the next line
pass
main.py
# add a comment here with your name, email, and student number.
# do not add any additional import lines to this file.
import os.path
from emotions import *
VALID_COUNTRIES = ['bangladesh', 'brazil', 'canada', 'china', 'egypt',
'france', 'germany', 'india', 'iran', 'japan', 'mexico',
'nigeria', 'pakistan', 'russia', 'south korea', 'turkey',
'united kingdom', 'united states']
def ask_user_for_input():
# add your code here and remove the pass keyword on the next line
pass
def main():
# add your code here and remove the pass keyword on the next line
pass
if __name__ == "__main__":
main()
Note About Imports
It is important to import the files in the correct order and from the correct files. Main.py
should import emotions.py as shown in the template above and not the other way around.
7. Extra Example
The files keywords.tsv and comments.csv should be attached to this assignment on
OWL. The result of running them with the following countries is given below:
Example 1: Country of “All”
Input/Output:
Input keyword file (ending in .tsv): keywords.tsv
Input comment file (ending in .csv): comments.csv
Input a country to analyze (or "all" for all countries): all
Input the name of the report file (ending in .txt): my_report.txt
Most common emotion is: anger
Contents of my_report.txt: Most common emotion: anger
Emotion Totals
anger: 5 (33.33%)
joy: 2 (13.33%)
fear: 1 (6.67%)
trust: 3 (20.0%)
sadness: 3 (20.0%)
anticipation: 1 (6.67%)
Example 2: Country of “brazil”
Input/Output:
Input keyword file (ending in .tsv): keywords.tsv
Input comment file (ending in .csv): comments.csv
Input a country to analyze (or "all" for all countries): brazil
Input the name of the report file (ending in .txt): report_brazil.txt
Most common emotion is: fear
Contents of report_brazil.txt:
Most common emotion: fear
Emotion Totals
anger: 0 (0.0%)
joy: 0 (0.0%)
fear: 1 (50.0%)
trust: 0 (0.0%)
sadness: 0 (0.0%)
anticipation: 1 (50.0%)
Example 3: Country of “germany” (there are no comments for this country in the data set)
Input keyword file (ending in .tsv): keywords.tsv
Input comment file (ending in .csv): comments.csv
Input a country to analyze (or "all" for all countries): germany
Input the name of the report file (ending in .txt): report.txt
Error: No comments in dataset!
Example 4: Invalid Inputs (these files do not exist or have the wrong extension)
Input keyword file (ending in .tsv): badfile.pizza
Error: Keyword file does not end in .tsv!
Input keyword file (ending in .tsv): this_file_does_not_exist.tsv
Error: this_file_does_not_exist.tsv does not exist!
Input keyword file (ending in .tsv): keywords.tsv
Input comment file (ending in .csv): badcsvfile.duck
Error: Comment file does not end in .csv!
Input keyword file (ending in .tsv): keywords.tsv
Input comment file (ending in .csv): not_a_real_csv_file.csv
Error: not_a_real_csv_file.csv does not exist!
Input keywor
Assignment 3: YouTube Emotions
Important Notes:
• Read the whole assignment document before you begin coding. This is a more
complex speciffcation than in past assignments and the examples and templates
near the end of this document will be important in solving this assignment.
• Assignments are to be completed individually. Use of tools to generate code,
working with another person, or copying from online resources are not allowed and
will result in a zero on this assignment regardless of how much was copied.
• A code template is given in Section 6 (on page 17) for your main.py and
emotions.py ffles. We highly recommend using these as a starting point for your
assignment. The code is also attached to the assignment on OWL.
Change Log:
• Nov. 4
th
: The comments.csv ffle attached to Brightspace had an unexpected
unicode character in one of the comments the changed the outcome of some of the
examples given in this document. comments.csv has now been corrected and the
examples in this document to match.
• Nov. 13
th
: A type-o was found in the example for make_report() in section 5. This has
now been corrected. The output shown at the end of the document in section 7 was
still correct. This change has no impact on the autograder (it was marking correctly).
1. Learning Outcomes
By completing this assignment, you will gain skills relating to
• Functions
• Dictionaries and lists
• Complex data structures
• Text processing
• Working with TSV and CSV ffles
• File input and output
• Exceptions in Python
• Simple module use
• Writing code that adheres to a given speciffcation
• Working with real world problem
2. Background
With the emergence of social media sites such as YouTube, Facebook, Reddit, Twitter (also
known as X), LinkedIn, and WhatsApp, more and more data is being produced and made
accessible online in a textual format. This textual data, such as YouTube comments,
Tweets, or Facebook posts, can be hard to process but is incredibly important for
organizations as it offers a current snapshot of the public’s emotions (affinity) or sentiment
about a topic at a current point in time. Having a live view of your customer’s current affinity
towards your products or the public’s view of your political campaign can be critical for
success.
Much work has been done towards the goal of creating large datasets of word affinity or
sentiment. One such effort is the National Research Council (NRC) Emotion Lexicon which
is a list of English words and their associations with eight basic emotions (anger, fear,
anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and
positive).
Our goal in this assignment is to use a simpliffed version of the NRC Emotion Lexicon to
classify YouTube comments based on one of the following emotions anger, joy, fear, trust,
sadness, or anticipation. Based on the emotion contained in each comment for a particular
video we then want to generate a report that details the most common emotions YouTube
users have towards that video based on their comments.
3. Datasets
Your Python program will deal with two datasets, a keywords data set that contains a
simpliffed version of the NRC Emotion Lexicon (this dataset will remain the same for all
tests of your program) and a Comma-Separated Values (CSV) ffle that contains the
comments for a particular YouTube video (this dataset will change for each test of your
program).
3.1 Keywords Dataset (TSV File)
The keywords.tsv ffle attached to this assignment contains a simpliffed version of the NRC
Emotion Lexicon. This is a Tab-Separated Values (TSV) ffle in which each line of the ffle
contains a single word and its emotional classiffcation based on six emotions (anger, joy,
fear, trust, sadness, and anticipation). Each word in the ffle may be classiffed as having one
or more emotions. The following is an example of the ffrst 10 lines of this ffle where tab (\t) characters are
represented by arrows (→):
abacus→0→0→0→1→0→0
abandon→0→0→1→0→1→0
abandoned→1→0→1→0→1→0
abandonment→1→0→1→0→1→0
abbot→0→0→0→1→0→0
abduction→0→0→1→0→1→0
abhor→1→0→1→0→0→0
abhorrent→1→0→1→0→0→0
abolish→1→0→0→0→0→0
abominable→0→0→1→0→0→0
Each line starts with a keyword and is followed by a score (0 or 1) for each emotion in this
order: anger, joy, fear, trust, sadness, and anticipation. If a 1 is present it means that
keyword is related to that emotion. If a 0 is present the keyword is unrelated to that
emotion.
For example, according to the above the word “abacus” is related to the emotion of trust
and no other emotions. The word “abandon” is related to the emotions fear and sadness
and no other emotions.
All words in the dataset will be related to at least one emotion. This ffle’s contents will
remain the same for all tests but may be given a different fflename based on the users input
(e.g. it could be named keys.tsv or words.tsv rather than keywords.tsv).
3.2 Comments Dataset (CSV File)
The user will provide a Comma-Separated Values (CSV) ffle that contains a set of YouTube
comments for a particular video. The name of this ffle will change based on the user’s input
but will always end in .csv and have the same format.
The following is an example of a possible line from this ffle (the ffle may contain one or
more lines). Note that this document wraps the line on to multiple lines but in the ffle this is
one line ended by a line break (\n):
2,PixelPioneer24,brazil,The excavation scenes in the movie were
excellent but the unnecessary derision of the hero's motives seemed
unfair. His eventuality of success was not adequately showcased. Each line of this ffle will contain four values separated by a single comma character (,). The
values will always be in the following order:
Comment ID, Username, Country, Comment Text
Comment ID is a unique positive integer identiffer for the comment. Username is the
username of the user who posted the comment. Country is the user’s home country, and
comment text is the text the of the comment posted by the user.
No value will contain a line break or a comma character. The capitalization of country
names could be different for each line even if it is for the same country, but the country will
always be spelled the same.
Space characters will only occur in the comment text or country name.
4. Tasks
In this assignment, you will write two Python ffles, emotions.py and main.py, that will
attempt to determine the most common emotion expressed in a YoutTube video’s
comments. You will create a number of functions (as speciffed in the Functional
Speciffcation in Section 5) that will perform simple sentiment analysis on the YouTube
comments.
To accomplish this, you will need to do the following:
1. Accept input from the user: The user will specify the ffle names of the keywords
and comments data sets as well as the name of the report ffle your program will
create. The user will also input the name of the country they wish to fflter the
comments by.
2. Read. Your program will read in the keyword and comments datasets and store
them in the formats described in the functional speciffcation (in Section 5).
3. Clean. The text of the comments will be cleaned to remove any punctuation and
convert them to all lowercase letters.
4. Determine Emotion. You will use the keyword’s dataset to determine the overall
emotion expressed in each comment.
5. Generate Report. Based on your analysis of each comment, you will create a report
ffle that contains a summary of the most common emotion expressed as well as
how common each emotion was (as speciffed in Section 5). Additionally, you must follow the functional speciffcation presented in Section 5 and the
rules and requirements in Section 8.
5. Functional Speciffcation
5.1 emotions.py
The functions described in this section should be present in your emotions.py ffle and must
be used in some way in your program to read, clean, process, analyze, or report on the
comments in the given dataset. Each function and its parameters must have the same
name and spelling as speciffed below:
clean_text(comment)
This function should have one parameter, comment, which is a string that contains the
text of a single comment from the comments dataset. The function should clean this
text by replacing any characters that are not letters (A to Z) and replacing them with
space characters. It should also convert the comment’s text to all lower case.
This function should return the cleaned text as a string.
Example:
clean_text("This4is-an example. It's a b*t silly.")
will result in this output:
this is an example it s a b t silly
make_keyword_dict(keyword_file_name)
This function should read the Tab-Separated Values (TSV) keywords ffle as described in
Section 3.1. keyword_ffle_name is a string containing the name of the keywords ffle.
This function can safely assume that this ffle exists, is in the current working directory,
and is properly formatted. Checks on the ffle’s existence will be done in the main.py ffle
described later in this document.
The function should return a dictionary with keys for each word in the ffle and the values
of this dictionary should be a new dictionary for each keyword that contains a value for
each emotion (anger, joy, fear, trust, sadness, and anticipation). Example:
Assuming that keywords.tsv contains the following three lines (where → is a tab
character):
abacus→0→0→0→1→0→0
abandon→0→0→1→0→1→0
abandoned→1→0→1→0→1→0
then calling
make_keyword_dict("keywords.tsv")
should result in the following nested dictionary data structure:
{'abacus': {'anger': 0,
'joy': 0,
'fear': 0,
'trust': 1,
'sadness': 0,
'anticipation': 0},
'abandon': {'anger': 0,
'joy': 0,
'fear': 1,
'trust': 0,
'sadness': 1,
'anticipation': 0},
'abandoned': {'anger': 1,
'joy': 0,
'fear': 1,
'trust': 0,
'sadness': 1,
'anticipation': 0}
} Note that to pass the Gradescope tests this function must return a dictionary and not
another collection such as a list, the keyword keys must be spelled exactly as listed in
keywords.tsv, and the emotions must be spelled correctly and in lower case.
Hint: You may find a number of the Python string methods helpful when creating this
function.
make_comments_list(filter_country, comments_file_name)
This function should read the Comma-Separated Values (CSV) file as described in
Section 3.2. comments_file_name is a string containing the name of the CSV file and
filter_country is a string containing either a country name or the string “all”. This
function should read the CSV file and return a list containing only comments for the
given country listed in filter_country (or all countries if the string “all” is given).
The list should contain one element for each comment in the file that matches the
country in the filter (or all comments if “all” is given). Each element in the list should be a
dictionary that contains a key for the Comment ID, Username, Country and Comment
Text. The keys should be named 'comment_id', 'username', 'country', and 'text'
respectively.
The comment text should be stripped of any leading and trailing whitespace.
Example 1:
Assuming that comments.csv only contains the following two lines (note that the line is
wrapped in this document and in the .csv file this is only two lines):
1,RetroRealm77,united states,I was a bit disappointed with the
film's portrayal of childhood heroism. It felt like the classic
elements were just concealed under layers of unnecessary savagery
and violence.
2,PixelPioneer24,brazil,The excavation scenes in the movie were
excellent but the unnecessary derision of the hero's motives seemed
unfair. His eventuality of success was not adequately showcased.
then calling
make_comments_list("all", "comments.csv")
should result in the following nested list and dictionary data structure: [ {'comment_id': 1,
'username': 'RetroRealm77',
'country': 'united states',
'text': 'I was a bit disappointed with the film's portrayal of
childhood heroism. It felt like the classic elements were just
concealed under layers of unnecessary savagery and violence.'},
{'comment_id': 2,
'username': 'PixelPioneer24',
'country': 'brazil',
'text': 'The excavation scenes in the movie were excellent but
the unnecessary derision of the hero's motives seemed unfair. His
eventuality of success was not adequately showcased.'} ]
Example 2:
Given the same contents of comments.csv as in Example 1, if the following function call
with the country name brazil was made:
make_comments_list("brazil", "comments.csv")
then the only element in the returned list would be:
[ {'comment_id': 2,
'username': 'PixelPioneer24',
'country': 'brazil',
'text': 'The excavation scenes in the movie were excellent but
the unnecessary derision of the hero's motives seemed unfair. His
eventuality of success was not adequately showcased.'} ]
Example 3:
Given the same contents of comments.csv as in Example 1, if the function was called
with a country name that was not present in the file such as:
make_comments_list("not a real country", "comments.csv")
then the resulting list would be empty:
[]
Note that to pass the Gradescope tests this function must return a list and not another
collection such as a set or dictionary, the values of each list element must be a
dictionary, and the keys used in that dictionary must match the spelling and lowercase
capitalization given in this section.
classify_comment_emotion(comment, keywords)
This function takes the text of a comment and the keywords dictionary created by the
make_keyword_dict function as parameters and classifies the comment as one of the
possible emotions (anger, joy, fear, trust, sadness, and anticipation), returning the
emotion as a string.
A comment is classified by first cleaning the text (using the clean_text function) and
then checking each word in the comment against the keywords dictionary. A total for
each possible emotion should be kept with each word in the comment matching a
keyword adding to the totals (based on the values in the keywords dictionary).
Example:
For the comment:
The excavation scenes in the movie were excellent but the
unnecessary derision of the hero's motives seemed unfair. His
eventuality of success was not adequately showcased.
the text should be first cleaned using clean_text to get:
the excavation scenes in the movie were excellent but the
unnecessary derision of the hero s motives seemed unfair his
eventuality of success was not adequately showcased then each word should be checked against the keywords dictionary and the totals for
each emotion kept. Words not matching any words in the dictionary (shown in black
above) do not add to the scores. For example, using the full keywords.tsv dataset the
words shown in blue above have matches in the keyword dataset and would result in
the following totals:
Word anger joy fear trust sadness anticipation
excavation 0 0 0 0 0 1
excellent 0 1 0 1 0 0
derision 1 0 0 0 0 0
hero 0 1 0 1 0 1
unfair 1 0 0 0 1 0
eventuality 0 0 1 0 0 1
success 0 1 0 0 0 1
Total: 2 3 1 2 1 4
Therefore, this comment would be classified as having the emotion of anticipation and
the string “anticipation” should be returned by the function as it as the highest score.
In the event of a tie, the emotions should be given priority in this order: 1) anger, 2) joy, 3)
fear, 4) trust, 5) sadness, and 6) anticipation.
Hint: You may find the string split method useful for looping through words rather than
characters.
make_report(comment_list, keywords, report_filename)
This function takes the comment_list (created by the make_comments_list function),
the keywords dictionary (created by the make_keyword_dict function), and a string
containing the file name of the report to generate (report_filename) as parameters.
A new file should be created with the file name in report_filename and it should contain
the name of the most common emotion classification in the comment_list dataset as
well as a count of the number of comments classified as each emotion. In the event of a
tie the emotions should be given priority in this order: 1) anger, 2) joy, 3) fear, 4) trust, 5)
sadness, and 6) anticipation.
The format of the report should match the following example which is based on the
attached comments.csv and keywords.tsv with a country filter of “all”:
Most common emotion: anger
Emotion Totals
anger: 5 (33.33%)
joy: 2 (13.33%)
fear: 1 (6.67%)
trust: 3 (20.0%)
sadness: 3 (20.0%)
anticipation: 1 (6.67%)
The emotion totals should occur in the same order (regardless of the counts) but the
values would be different depending on the comment_list and keywords dictionary
passed to the function.
All percentages should be rounded to two digits and all six emotions should always be
listed even if their count is zero. Important: in your report file each percentage must be
written with one or two decimal places. A value such as 20.000% or 6.6700% would be
wrong even though it is technically rounded as there are too many decimal places. Your
output must be formatted exactly as shown in the example above including the spacing
and line breaks.
Return
The function should return the name of the most common emotion; in this example it
would be “anger”.
Exception
In the event that the comment_list contains no comments (i.e. it is an empty list), the
function should raise a RuntimeError containing the text “No comments in dataset!”.
Reminder: The report should be saved to a file and not output to the screen or returned
by the function. Only the name of the most common emotion should be returned.
5.2 main.py
The program in main.py should ask the user for the file names of the keyword file and
comments file that the data will be read from, as well as the name of the report file that will
be created. It must use the functions defined in the emotions.py file to perform the tasks
described in Section 4 and write the final report.
Your main.py file must contain the following two functions (ask_user_for_input and main)
as specified:
ask_user_for_input()
This function takes no parameters but asks the user to input the file names of the
keywords TSV file, the comments CSV file, the country to filter by, and the file name of the
report to be generated. These three filenames and the country name are returned in a
tuple in this order: 1) keyword filename, 2) comment fflename, 3) country name
(converted to lower case), and 4) report filename.
Example (of valid input):
Input keyword file (ending in .tsv): keywords.tsv
Input comment file (ending in .csv): comments.csv
Input a country to analyze (or "all" for all countries): Canada
Input the name of the report file (ending in .txt): report.txt
User input is shown in green and input prompts in black. Note that the filenames and
country are based on the user’s input and can not be hardcoded to one set value. This
means that the filenames could be different depending on the values input by the user.
In this case the following tuple would be returned:
('keywords.tsv', 'comments.csv', 'canada', 'report.txt')
Note that the country name was converted to all lowercase.
Exceptions
Your ask_user_for_input() method must complete the following checks on the user input.
If the input does not pass a check, an Exception should be raised causing the function to
exit immediately. Exceptions should be raised as soon as the invalid input is given. For example, if the
keyword file does not exist, an exception should be raised before asking the user to input
the comments file name.
Check 1: File Extension
For each of the three filenames, if the user inputs a filename ending in the wrong file
extension (.csv, .tsv, or .txt) the function should raise a ValueError exception with a
message stating that the file extension is incorrect such as “Keyword file does not end in
.tsv!”. The text of this message must be exactly the following for each file:
• Keyword File: “Keyword file does not end in .tsv!”
• Comments File: “Comments file does not end in .csv!”
• Report File: “Report file does not end in .txt!”
Check 2: Files Exist
For the keyword and comment files you must check if the file exists using the
os.path.exists function. If it does not, your function must raise a IOError exception with
text explaining that the function does not exist. The message should have the text “
“keywords.tsv does not exist!", where keywords.tsv is the missing file.
For the report file, if the file already exists, an IOError should be raised with text stating
that “
example “report.txt already exists!” where the report file is named report.txt. This is to
help prevent accidentally overwriting any files.
Check 3: Valid Country
Lastly you must check that the country input is either “all” or one of the following
countries: 'bangladesh', 'brazil', 'canada', 'china', 'egypt', 'france', 'germany', 'india', 'iran',
'japan', 'mexico', 'nigeria', 'pakistan', 'russia', 'south korea', 'turkey', 'united kingdom', or
'united states'. If any other country or word is input, a ValueError should be raised with
the text “
user input. This subset of countries was chosen as they tend to occur in the datasets, we are using
more than others. In more realistic scenario you would likely want to include all valid
country names in this list, but this assignment limit to the above-mentioned countries.
Keep in mind that this only limits the countries a user can filter by, it does not limit what
country names can occur in the dataset.
main()
This function handles calling the other functions in main.py and emotions.py to perform
the tasks listed in Section 3. It should check for any exceptions being raised by the
ask_user_for_input function, output the error message contained in the exception (this
can be done by simply printing the exception with print()), and ask the user to input the
values again if any exception is raised.
Once valid input has been received, it should call the functions from emotions.py
required to analyze the comments and generate the report.
Lastly it should output to the screen the most common emotion in the comment data set.
This should be displayed as “Most common emotion is:
name is the name of the emotion such as “Most common emotion is: anger” if the
emotion is anger.
If the make_report function raises a RuntimeError exception (e.g. the comment list was
empty), it should output the message contained in that error.
Example 1:
For the values in the attached keywords.tsv and comments.csv files:
Input keyword file (ending in .tsv): keywords.tsv
Input comment file (ending in .csv): comments.csv
Input a country to analyze (or "all" for all countries): all
Input the name of the report file (ending in .txt): report.txt
Most common emotion is: anger
User input is shown in green and the contents of the outputted report.txt file is:
Most common emotion: anger
Emotion Totals
anger: 5 (33.33%) joy: 2 (13.33%)
fear: 1 (6.67%)
trust: 3 (20.0%)
sadness: 3 (20.0%)
anticipation: 1 (6.67%)
Example 2:
For the same values in keywords.tsv and comments.csv but a country of “Canada”:
Input keyword file (ending in .tsv): keywords.tsv
Input comment file (ending in .csv): comments.csv
Input a country to analyze (or "all" for all countries): Canada
Input the name of the report file (ending in .txt): report_cad.txt
Most common emotion is: sadness
And the contents of report_cad.txt would be:
Most common emotion: sadness
Emotion Totals
anger: 1 (16.67%)
joy: 0 (0.0%)
fear: 0 (0.0%)
trust: 2 (33.33%)
sadness: 3 (50.0%)
anticipation: 0 (0.0%)
Example 3:
In this example invalid inputs are given, and the user is asked to input them again.
Input keyword file (ending in .tsv): not_a_real_file.tsv
Error: not_a_real_file.tsv does not exist!
Input keyword file (ending in .tsv): real_file_wrong_extension.txt
Error: Keyword file does not end in .tsv!
Input keyword file (ending in .tsv): keys.tsv
Input comment file (ending in .csv): not_a_real_file.csv
Error: not_a_real_file.csv does not exist!
Input keyword file (ending in .tsv): keys.tsv
Input comment file (ending in .csv): bad_file_extension.tsv
Error: Comment file does not end in .csv!
Input keyword file (ending in .tsv): keys.tsv
Input comment file (ending in .csv): c.csv
Input a country to analyze (or "all" for all countries): Duck
Error: duck is not a valid country to filter by!
Input keyword file (ending in .tsv): keys.tsv
Input comment file (ending in .csv): c.csv
Input a country to analyze (or "all" for all countries): Belgium
Error: belgium is not a valid country to filter by!
Input keyword file (ending in .tsv): keys.tsv
Input comment file (ending in .csv): c.csv
Input a country to analyze (or "all" for all countries): FrAnCe
Input the name of the report file (ending in .txt): report.txt
Error: report.txt exists, the report file can not already exist!
Input keyword file (ending in .tsv): keys.tsv
Input comment file (ending in .csv): c.csv
Input a country to analyze (or "all" for all countries): FrAnCe
Input the name of the report file (ending in .txt): report_france.txt
Error: No comments in dataset!
Note that the above is one run of the program. It should keep asking for input again if an
exception occurs in the ask_user_for_input function. Also note that in this example,
keys.tsv and c.csv are valid files that exist and the file report.txt already exists. “Belgium”
is not in the list of valid countries so it is rejected and “FrAnCe” is accepted despite it’s
odd capitalization as the ask_user_for_input function should convert it to lowercase.
In this example, the c.csv file contained no comments for France, so the exception “No
comments in dataset!” was raised by make_report function.
6. Templates
This section gives some starter code you should use in your program. You may not alter the
names of any function or the parameters the functions take (this includes adding or
removing parameters). You may not import any libraries or modules not included in the
template code and all code you add should be inside a function (adding code outside of a
function may cause the Gradescope tests to fail). You may add additional helper functions
as needed.
emotions.py
# add a comment here with your name, email, and student number
# you can not add any import lines to this file
EMOTIONS = ['anger', 'joy', 'fear', 'trust', 'sadness', 'anticipation']
def clean_text(comment):
# add your code here and remove the pass keyword on the next line
pass
def make_keyword_dict(keyword_file_name):
# add your code here and remove the pass keyword on the next line
pass
def classify_comment_emotion(comment, keywords):
# add your code here and remove the pass keyword on the next line
pass
def make_comments_list(filter_country, comments_file_name):
# add your code here and remove the pass keyword on the next line
pass
def make_report(comment_list, keywords, report_filename):
# add your code here and remove the pass keyword on the next line
pass
main.py
# add a comment here with your name, email, and student number.
# do not add any additional import lines to this file.
import os.path
from emotions import *
VALID_COUNTRIES = ['bangladesh', 'brazil', 'canada', 'china', 'egypt',
'france', 'germany', 'india', 'iran', 'japan', 'mexico',
'nigeria', 'pakistan', 'russia', 'south korea', 'turkey',
'united kingdom', 'united states']
def ask_user_for_input():
# add your code here and remove the pass keyword on the next line
pass
def main():
# add your code here and remove the pass keyword on the next line
pass
if __name__ == "__main__":
main()
Note About Imports
It is important to import the files in the correct order and from the correct files. Main.py
should import emotions.py as shown in the template above and not the other way around.
7. Extra Example
The files keywords.tsv and comments.csv should be attached to this assignment on
OWL. The result of running them with the following countries is given below:
Example 1: Country of “All”
Input/Output:
Input keyword file (ending in .tsv): keywords.tsv
Input comment file (ending in .csv): comments.csv
Input a country to analyze (or "all" for all countries): all
Input the name of the report file (ending in .txt): my_report.txt
Most common emotion is: anger
Contents of my_report.txt: Most common emotion: anger
Emotion Totals
anger: 5 (33.33%)
joy: 2 (13.33%)
fear: 1 (6.67%)
trust: 3 (20.0%)
sadness: 3 (20.0%)
anticipation: 1 (6.67%)
Example 2: Country of “brazil”
Input/Output:
Input keyword file (ending in .tsv): keywords.tsv
Input comment file (ending in .csv): comments.csv
Input a country to analyze (or "all" for all countries): brazil
Input the name of the report file (ending in .txt): report_brazil.txt
Most common emotion is: fear
Contents of report_brazil.txt:
Most common emotion: fear
Emotion Totals
anger: 0 (0.0%)
joy: 0 (0.0%)
fear: 1 (50.0%)
trust: 0 (0.0%)
sadness: 0 (0.0%)
anticipation: 1 (50.0%)
Example 3: Country of “germany” (there are no comments for this country in the data set)
Input keyword file (ending in .tsv): keywords.tsv
Input comment file (ending in .csv): comments.csv
Input a country to analyze (or "all" for all countries): germany
Input the name of the report file (ending in .txt): report.txt
Error: No comments in dataset!
Example 4: Invalid Inputs (these files do not exist or have the wrong extension)
Input keyword file (ending in .tsv): badfile.pizza
Error: Keyword file does not end in .tsv!
Input keyword file (ending in .tsv): this_file_does_not_exist.tsv
Error: this_file_does_not_exist.tsv does not exist!
Input keyword file (ending in .tsv): keywords.tsv
Input comment file (ending in .csv): badcsvfile.duck
Error: Comment file does not end in .csv!
Input keyword file (ending in .tsv): keywords.tsv
Input comment file (ending in .csv): not_a_real_csv_file.csv
Error: not_a_real_csv_file.csv does not exist!
Input keywor