C Assignment 留学生辅导讲解、讲解C Assignment 程序、辅导C/C++程序

2018.08.29 - 首页 >> C/C++编程

Programming Assignment 3 (PA3) - Spam Checker

Due Date: Wednesday, August 29th @ 11:59 pm

Getting Started Detailed Overview Unit Testing Extra Credit

Example Input Functions to be Written README Questions Turnin Summary

Assignment Overview

In this assignment you are writing a two part interactive program meant to check if an email is in a list of spam

addresses or not. The first part will read in email data files and populate hash tables with these email

addresses, then write out these populated tables into an output file. The second part will allow for user

interaction with the data. It will first read in the tables that were written to the output file from the first part and

will then allow the user to enter email addresses and check if they are spam. If the email address is found in all

of the tables then there is a high chance that the email is a spam email. In our program, if the email address is

found in all of the tables, then it is definitely a spam. This program is inspired by Bloom Filtering, which you

may want to read about to fully understand what you are trying to accomplish.

The educational purpose of this programming assignment is to provide you with more practice with dynamic

memory allocation, more complex file I/O, hashtables and separate-chaining hashtables.

We will try not to do too much hand-holding in this PA. You have some experience now and you need to learn

how to look up information yourself. You will not be given nearly as much help in the upper-division classes,

so now is the time to start doing more on your own.

Grading

● README: 10 points - See README Requirements here and questions below

○ http://cseweb.ucsd.edu/~ricko/CSE30READMEGuidelines.pdf

● Compiling: 5 points - Using our Makefile; no warnings. If what you turn in does not compile with the

given Makefile, you will receive 0 points for this assignment. NO EXCEPTIONS!

● Style: 10 points - See Style Requirements here

○ http://cseweb.ucsd.edu/~ricko/CSE30StyleGuidelines.pdf

● Correctness: 75 points

○ Make sure you have all files tracked in Git.

● Extra Credit: 5 points - View Extra Credit section for more information.

● Wrong Language: You will lose 10 points for each module in the wrong language, C vs. Assembly or

vice versa.

NOTE: If what you turn in does not compile with given Makefile, you will receive 0 points for this assignment.

Getting Started

Follow these steps to acquire the starter files and prepare your Git repository.

Gathering Starter Files:

The first step is to gather all the appropriate files for this assignment. Connect to pi-cluster via ssh.

$ ssh cs30yyz@pi-cluster.ucsd.edu

Create and enter the pa3 working directory.

$ mkdir ~/pa3

$ cd ~/pa3

Copy the starter files from the public directory.

$ cp -r ~/../public/pa3StarterFiles/* ~/pa3/

Copy over the following file from PA2. You are responsible for making sure these functions have good style

and work correctly as they are required for this assignment.

$ cp ~/pa2/checkRange.s ~/pa3

Starter files provided:

pa3.h

test.h

Spam emails provided:

/emails/emails_<num>

pa3Strings.h

testrevHash.c

/emails/emails_all

pa3Globals.c

Makefile

*num means the number of spam email addresses in the file

** You can assume no duplicates in the email list

Preparing Git Repository:

You are required to use Git with this and all future programming assignments. Refer to the PA0 writeup for how

to set up your local git repository.

Example Input

Two sample stripped executables provided for you to try and compare your output against are available in the

public directory. Note that you cannot copy them to your own directory; you can only run them using the

following commands (where you will also pass in the command line arguments):

$ ~/../public/pa3testCreate

$ ~/../public/pa3testSearch

NOTE:

1. The output of your program MUST match exactly as it appears in the pa3testCreate and

pa3testSearch output. You need to pay attention to everything in the output, from the order of the

error messages to the small things like extra newlines at the end (or beginning, or middle, or

everywhere)!

2. We are not giving you any sample outputs, instead you are provided some example inputs.

You are responsible for trying out all functionality of the program; the list of example inputs is

not exhaustive or complete. It is important that you fully understand how the program works

and you test your final solution thoroughly against the executable.

Usage for create:

Usage: ./create -i dataFile -o outFile [-s size] [-h]

-i/--infile dataFile -- The file containing the data

-o/--output outFile -- The file to output hash tables to

-s/--size size -- The size of the hash tables.

If not specified defaults to 5000.

-h/--help -- Displays this long usage message

Example input that has error output:

cs30yyz@pi-cluster-001:pa3$ ./create

cs30yyz@pi-cluster-001:pa3$ ./create -o -s

cs30yyz@pi-cluster-001:pa3$ ./create -s 1000000

cs30yyz@pi-cluster-001:pa3$ ./create -i emails/emails_1000 -s 10000

cs30yyz@pi-cluster-001:pa3$ ./search

cs30yyz@pi-cluster-001:pa3$ ./search -i outfile -o

Usage for search:

Usage: ./search -i inFile [-h]

-i inFile -- The file containing the hash tables to search

-h -- Displays this long usage message

Example input that has normal output:

cs30yyz@pi-cluster-001:pa3$ ./create -i emails/emails_10000 -s 10000 -o outfile

cs30yyz@pi-cluster-001:pa3$ ./create -i emails/emails_10000 -o outfile

cs30yyz@pi-cluster-001:pa3$ ./search -i outfile -h

cs30yyz@pi-cluster-001:pa3$ ./search -i outfile

Note: An easy way to check your output against the sample executable is by using diff:

cs30yyz@pi-cluster-001:pa3$ ./create -i emails/emails_10 -o myOutfile

cs30yyz@pi-cluster-001:pa3$ ~/../public/pa3testCreate -i emails/emails_10 -o refOutfile

cs30yyz@pi-cluster-001:pa3$ diff -s myOutfile refOutfile

Detailed Overview

The function prototypes for the various C and Assembly functions are as follows.

C routines:

int main( int argc, char * argv[] ); (in the file create.c)

int main( int argc, char * argv[] ); (in the file search.c)

void launchUserQuery(table_t * htbl, table_t * rtbl, table_t * eotbl);

int checkTables( char * str, table_t * htbl, table_t * rtbl, table_t * eotbl );

int evenOddHash( char * str );

void freeLinkedList( linkedList_t * head );

void * retrieveData( int hashVal, table_t * table );

void populateTables( table_t * htbl, table_t * rtbl, table_t * eotbl,

FILE * dataFile );

void insertData( void ** head, char * str );

void readTables( FILE * inFile, table_t * htbl, table_t * rtbl,

table_t * eotbl );

void writeTables( FILE * outFile, table_t * htbl, table_t * rtbl,

table_t * eotbl );

Assembly routines:

int hash( char * str );

long checkRange( long value, long minRange, long maxRange );

int revHash( char * str );

Process Overview:

The following is an explanation of the main tasks of the assignment, and how the individual functions work

together to form the whole program. There are two diagrams for this program since there are two separate

main functions.

The first portion of this assignment is create, which will be used to read data and produce hash tables that get

written to a file.

Create workflow:

1. Use getopt_long() to parse all the command line arguments passed into the program. Do all the

required error checking.

2. Allocate space for the three different hash tables and begin populating the tables with data inside the

file specified by the user. *Any time during memory allocation, if an error occurs, free all allocated

memory and exit program

3. Write the populated tables into an output file which could be used to run search later.

The second portion of this assignment is search, which allows the user to check if an email is likely to be spam

or not.

Search workflow:

1. Use getopt() to parse all the command line arguments passed into the program. Do all the required

error checking.

2. Read in the different hash tables stored in the specified input file and recreate those tables in memory.

3. Begin the interactive user query mode that allows user to search the tables. If user specified -x for stats

mode, at the end of the query the total stats and collisions found will be printed. [Three collisions means

that there is a higher chance of the email being spam]

4. When the user ends the search with ctrl-d, free all allocated memory and exit the program.

Explanation about the hashtables

There are three hashtables:

● table 1(htbl) - Use hash() as the hash function. The void * at index i should be set to 1 if some

item hashed to index i

● table 2(rtbl) - Use revhash() as the hash function. The void * at index i should be set to 1 if

some item hashed to index i

● table 3(eotbl) - Use evenOddHash() as the hash function. This is a separate-chaining hash

table. When an item is hashed to index i, it should be pushed to the linked list pointed by the pointer

at index i

Memory Diagram of tables 1 and 2:

Memory Diagram of table 3 (linked list):

Functions to be Written

Listed below are all the modules to be written.

hash.s

int hash( char * str );

This function will be used to create the hash key of a string. This function creates an integer hash key from

str with the following hashing algorithm. Your task is to translate this algorithm into assembly. Use the global

variables defined in pa3Globals.c to access these constants in assembly. The return value can be negative

because of overflow.

int hash = HASH_START_VAL;

int strLen = strlen(str);

for( int i = 0; i < strLen; i++ ) {

hash = hash * HASH_PRIME + str[i];

}

return hash;

Return Value: The hash key of str.

revHash.s

int revHash( char * str );

This function is similar to hash.s, the only difference is that to compute the hash key, it iterates through the

string in reverse order. So, instead of starting at the beginning of the string, you will start at the end and work

your way forward.

Return Value: The hash key of str.

evenOddHash.c

int evenOddHash( char * str );

This function is also similar to hash.s, the only difference is that to compute the hash key, it iterates through the

string first by by even and then by odd indices. So, loop through the even indices in the string first (i.e. str[0]

then str[2], etc.), then iterate over all odd indices (i.e. str[1] then str[3], etc.).

Return Value: The hash key of str.

insertData.c

void insertData( void ** head, char * str );

This function takes in the head of a linked list and an element to insert (a string) and then pushes the element

onto the front of the linked list. To implement this function, you need to allocate new memory for the string, so

that the list does not need to rely on the memory already allocated for the string passed in as an argument.

In order to insert a new element in the list first allocate space for the new head of the list (a struct of type

linkedList_t), then allocate space for a copy of the string that was passed in. After that, put the allocated

string with a copy of str into the newly allocated head, and finally attach the new head to the original list.

Reasons for error:

● Error with dynamically allocating memory. Free previously allocated memory, then call perror(),

passing in the macro MEM_ERR defined in pa3.h as the parameter. (perror() will print the correct

error message for you) Then return immediately.

populateTables.c

void populateTables( table_t * htbl, table_t * rtbl, table_t * eotbl,

FILE * dataFile );

This function is in charge of populating 3 hash tables by reading in the dataFile, grabbing each line through

fgets(), and computing the hash value from each of the 3 hash functions. When reading in line by line, use

strchr() to find the newline character in the resulting buffer from fgets(), and replace it with a null

character. Next, in order to allow our strings to be case-insensitive, use tolower() on every character of

each string to fully convert all characters in it to lower-case. Following this, we will generate the three hashes

for each string by calling hash(), revHash(), and evenOddHash(). Note that the hashkey returned by

each of these hash functions will not necessarily be an index within the size of the hash table. In order to

guarantee that the indices are within the size of the hashtable and are positive, you MUST do the following for

each hash key returned by the hashes, and use the result as the index into the corresponding hashtable:

idx = ((hash_key % table_size) + table_size) % table_size.

After calculating all 3 hash indices, index into the hash tables htbl and rtbl and set the entry to 1. The third

hash table contains a linked list at each hash bucket, so when an element hashes to that location, the string is

pushed to the front of that linked list, and we do so by calling insertData().

writeTables.c

void writeTables( FILE * outFile, table_t * htbl, table_t * rtbl,

table_t * eotbl );

This function serializes the three hash tables and writes them to outFile so that they can later be read and

deserialized back into hashtables in their original format (note that the ordering of elements in each linked list

does not matter).

Contents of the tables should be written to outFile in the following order:

1. Write the size of the tables. You only need to write one number here since all three of the them have

the same size.

2. Write all entries of htbl. They are pointed by member tbl in table_t and each entry is a void

*. Use fwrite() (man -s3 fwrite) to write to outFile.

3. Write all entries of rtbl. They are pointed by member tbl in table_t and each entry is a void

*. Use fwrite().

4. Write the strings in each bucket of eotbl. This is slightly more complicated than the three writes

above, which can be accomplished with calls to fwrite. You can use fputs() (man -s3 fputs)

to help you write each string. Notice that fputs writes the string without its terminating null byte (‘\0’), so

you should write a ‘\0’ after each string using fputc()(man -s3 fputc). To represent the end of a

bucket, we want to write an extra null byte (‘\0’).

Example:

The following eotbl should be serialized as:

a@spam.me\0b@spam.me\0\0c@hack.co\0\0\0

How to test:

In testwriteTables.c, you should set up the hashtables for testing and a FILE stream to write to.

To check the correctness of the output, you can use the following commands to dump the content of the file:

od -c filename (octal dump that print printable characters)

xxd filename (hex dump)

Check the man pages for these commands to get more information!

create.c

int main( int argc, char * argv[] );

This is the main driver for the create portion of your program. Its main tasks are to parse the command line

arguments, build the hashtables of email data, and serialize the hashtables to a file.

Parsing command line arguments:

For this assignment, we want the program to support both short and long options, so we will be using

getopt_long() to parse the flags described in the long usage statement. Check the man page (man -s3

getopt_long) for more details and examples of how to use getopt_long(). Make sure to use the

constants for argument parsing defined in pa3.h. The implementation details for the flags are listed below:

Required

/Optional

Short

Flag

Long Flag Required

Argument

How to Handle

O -h --help N/A Print out the long usage to stdout and return indicating

success.

O -s --size table size Convert the table size from a string to a long (using

strtol()) and save the converted value for later use.

R -i --infile infile name Try to open the file of this name

R -o --output output file name Try to open the file of this name

In the case where getopt_long() detects an invalid flag or missing flag argument, it will automatically print

an error message for you. If this happens, you also need to print the short usage and return indicating failure.

Note that infile and output flags are required, so you need to check if either of them is missing. Size flag is

optional. DEFAULT_SIZE (pa3.h) should be used if no size flag entered.

If all flags are parsed without any errors, make sure you also check for extra arguments after getopt_long()

completes.

Build the hashtables and write to file:

Dynamically allocate zero-filled memory for the three hashtables. Remember to check if any error happens

during dynamic memory allocation. If no error occurs, call populateTables(). After the hashtables get

populated, call writeTables() to write to the file.

Free up memory:

Now you are done with the hashtables and are about to exit. Remember to free all the memory you

dynamically allocated. The function freeLinkedList() should be helpful here.

Reasons for error:

In this file, as soon as an error is detected, print the appropriate error message(s), short usage

SHORT_CREATE_USAGE (see pa3Strings.h), and return indicating failure.

The following are error messages (numbers indicate the order of error checking) you need to print for each

case, make sure you don’t forget to print the short usage in each case!

1. Command line argument error

○ Size (make sure the error checking is the following order)

■ Size is not a valid number: print the error message (INVALID_NUM)

■ Size is a number too large to be converted to a long: use snprintf() and perror()

to print out the error message (TOO_LARGE_NUM)

■ Size is not within [MIN_SIZE, MAX_SIZE] (pa3.h) inclusive. (checkRange())

○ Infile

■ Error encountered when opening the file. If this happens, call perror() with the

FILTER_ERR, and immediately return. (Don’t forget to set errno to 0 before calling

fopen())

○ Output

■ Error encountered when opening the file. If this happens, call perror() with the

WTABLE_FILE_ERR, and immediately return. (Don’t forget to set errno to 0 before

calling fopen())

○ Any missing flag argument or invalid flag

■ getopt_long() prints out the error message for you

2. Missing either of infile flag or output flag

○ Print ARG_ERR

3. extra arguments after getopt_long() completes

○ Print EXTRA_ARGS

4. Dynamic memory allocation failed

○ Free whatever you have dynamically allocated in this function

○ Call perror() with the custom MEM_ERR

Return Value: EXIT_SUCCESS on success and EXIT_FAILURE on any error.

search.c

int main( int argc, char * argv[] );

This function performs the functionality for the search program. It starts by parsing the command line

arguments, handling any errors encountered in the process; then it creates the hash tables and fills them by

reading from the input file. Finally it performs an interactive search and prints stats if needed.

Now let’s take a closer look at each step of the program.

1. Parse command line arguments using getopt: we need to handle 2 possible options

a. INFILE_FLAG: use fopen to open the specified input file. If an error occurs opening the file,

print RTABLE_FILE_ERROR using perror, and then print SHORT_SEARCH_USAGE.

b. HELP_FLAG: simply print LONG_SEARCH_USAGE.

c. If we didn't see an input file, print error message MISSING_INFILE, and

SHORT_SEARCH_USAGE, then immediately return.

d. If we saw extra args, print error message EXTRA_ARGS, and SHORT_SEARCH_USAGE, then

immediately return.

2. Create the 3 tables, and then fill them by reading from the specified input file (call readTables())

3. Perform interactive search by calling launchUserQuery()

4. Finally, don't forget to free all allocated memory.

Reasons for error:

● Error calling fopen

● No input file specified

● Extra arguments for the input flags

Return Value: 1 if an error occurs; 0 otherwise.

launchUserQuery.c

void launchUserQuery(table_t * htbl, table_t * rtbl, table_t * eotbl);

This function handles user interaction to search the tables and prints out the result for each search.

First, start by printing out a prompt for the user to input the string. Then, use fgets() in a while loop to read

the user input line by line.

● Once we read in a string, we want to find any newline character also read in along the string, and

replace it with a NUL character.

● If the user did not enter any string (simply pressed ‘enter’), then we want to immediately reprompt, and

skip the search.

● Before performing the search, we also need to pre-process the string by lowercasing every character.

(Hint: you might find tolower() helpful)

● Finally we want to check the string in three tables by calling checkTables().

○ If checkTables() returns the macro IN_TABLE, print FOUND_IN_ALL with the current word

as the argument.

○ If the string is found in at least one table, print FOUND_IN_SOME with corresponding arguments.

○ If the string in not found in any of the tables, print WORD_NOT_FOUND.

● At the end of each fgets() iteration, reprompt the user for the next input.

Lastly, print a newline before exiting the function.

checkTables.c

int checkTables( char * str, table_t * htbl, table_t * rtbl, table_t * eotbl );

This function reports the number of tables the input string was seen in.

Perform the check in each table by calling retrieveData(). Check htbl first, if it is found in htbl then

check rtbl, if it is found in rtbl then check if it is found in eotbl. Increment the count every time the string

is found in a table. If the string is found in eotbl, then you must also look inside the linked list attached to its

eotbl index. If the input string itself is also found in the linked list, return IN_TABLE instead of the count.

Return Value: Number of tables the string was found in, or IN_TABLE if it is found in all three tables and the

linked list attached to the 3rd table

freeLinkedList.c

void freeLinkedList( linkedList_t * head );

This function frees all elements in the input linked list. To free a linked list, iterate through the list starting from

the head, first free the value of the node, then the node itself, until we reach the end of the list.

retrieveData.c

void * retrieveData( int hashVal, table_t * table );

This function takes a hash value and a table pointer and returns the element at the relevant index of the table.

To calculate the index, mod the hash value by the size of the table, add table size to the result, and then mod

the entire result by the table size again (the same formula from populateTables.c). f

Return Value: The element at the relevant index of the table.

readTables.c

void readTables( FILE * inFile, table_t * htbl, table_t * rtbl,

table_t * eotbl );

Read from outFile and populate the three hashtables serialized in writeTables(). You can assume

the outFile is a file of the same format produced by writeTables(). The hashtables should have the

same structure as original ones, but the order of elements within a linked list does not matter.

Refer to writeTables() for how the hashtables are serialized to the file.

You should read in the following order:

● Read the size of the hashtables

● Read the htbl contents

● Read the rtbl contents

● Read eotbl contents

Here is one way to implement reading the third table eotbl from the file:

● Keep a relatively big buffer (size should be greater than max word size) and read the file in blocks of

the size of the buffer with fread()

○ If there is no incomplete string left from last read into the buffer, you can try reading in a whole

buffer size of data

○ If there is an incomplete string from last read, the incomplete string should be move to the front

of the buffer (more details on this in the next bullet point). When you read from file into the buffer

again, you should append to the incomplete string that is currently at the beginning of the buffer

● Process data in the buffer you just read

○ If you encounter a ‘\0’,check if it is the end of a bucket in eotbl or the end of a string

■ If it is the end of a bucket, go to the next bucket

■ If it is the end of a string, push the string to the linked list of the current bucket

(insertData())

○ If there is a string that is at the end of the buffer but not terminated, move the incomplete string

to the front of the buffer so that next time when you read data from the file into the buffer, you

can start after the end of the incomplete string. You can use memmove()(man -s3

memmove) to accomplish this.

Reasons for error:

● Error with dynamically allocating memory.

○ Free previously allocated memory for the hashtables. Call perror(), passing in the macro

MEM_ERR defined in pa3.h as the parameter. (perror() will print the correct error message

for you) And then return.

Return Value: None (The three hashtables should be populated correctly if no error occurred)

Unit Testing

You are provided with only one basic unit test file for testrevHash(. This file only has minimal test cases

and is only meant to give you an idea of how to write your own unit test files. You must write unit test files

for some of the functions (as listed below), as well as add several of your own thorough test cases to

all of the unit test files. You need to add as many unit tests as necessary to cover all possible use

cases of each function. You will lose points if you don’t do this! You are responsible for making sure you

thoroughly test your functions. Make sure you think about boundary cases, special cases, general cases,

extreme limits, error cases, etc. as appropriate for each function. The Makefile includes the rules for compiling

these tests. Keep in mind that your unit tests will not build until all required files for that specific unit test have

been written. These test files will be collected and count as part of the correctness of your program.

Unit tests you need to complete:

testinsertData.c

testhash.c

testrevHash.c

testevenOddHash.c

testpopulateTables.c

testwriteTables.c

To compile:

$ make testrevHash

To run:

$ ./testrevHash

(Replace “testrevHash” with the appropriate file names to

compile and run the other unit tests)

README Questions

1. [AI] Your friend is stressed and asks you for your CSE 30 PA code. How do you convince them to act

with integrity?

2. [Unix] You are trying to open your pa3Strings.h file. You have already typed “vim pa3S”. Which single

key can you press to autocomplete the command to “vim pa3Strings.h”?

3. [Vim] How do you change the color scheme in vim? Give the command you would use to change the

color scheme to “desert”.

Extra Credit

There are 5 points total for extra credit on this assignment.

● [5 Points] Multiple Email Search

Multiple Email Search

Do not begin this extra credit until your non extra credit assignment is completed and you have tested it

thoroughly. You will need to modify your launchUserQuery() to accept multiple emails entered at the same

time. There will be no changes to how the non extra credit program operates, you will be making a separate

searchEC program. To begin, make a copy of your launchUserQuery.c file. Do not make changes to any

other files (also do not worry about the usage message, keep it the same as the normal program).

$ cp ~/pa3/launchUserQuery.c ~/pa3/launchUserQueryEC.c

To make the extra credit:

cs30yyz@pi-cluster-001:pa3$ make searchEC

There is a public executable for you to test with, use your normal create to make the infile as this has not

changed.

cs30yyz@pi-cluster-001:pa3$ ~/../public/pa3testSearchEC -i infile

launchUserQueryEC.c

void launchUserQuery(table_t * htbl, table_t * rtbl, table_t * eotbl);

First note that you should NOT change the name of the function, only the name of the file is different. You will

need to make a few changes to this function in order to properly deal with multiple emails entered at the same

time. As an example, this is what the program output should look like with multiple emails entered:

Enter a word: hijack@spam.me jello grad@hack.co

hijack@spam.me is SPAM!

jello is not SPAM!

grad@hack.co is SPAM!

Enter a word:

Thus, you will need to so some form of looping over the emails entered. Make sure that your new functionality

is displaying the same stats as in the normal program.

Be sure to test well with the public executable to ensure your output is correct. For simplicity, you are

guaranteed that each email will be separated by a single space.

Turnin Summary

and upload the following files. Your file names must match the below *exactly* otherwise our Makefile will not

find your files.

Due Date: Wednesday night, August 29th @ 11:59 pm

Files required for the Final Turn-in:

hash.s

revHash.s

checkRange.s

testinsertData.c

testhash.c

testrevHash.c

testevenOddHash.c

testpopulateTables.c

testwriteTables.c

launchUserQuery.c

checkTables.c

create.c

evenOddHash.c

freeLinkedList.c

retrieveData.c

pa3Globals.c

populateTables.c

insertData.c

readTables.c

search.c

writeTables.c

pa3.h

pa3Strings.h

test.h

Makefile

README

Extra Credit Files:

launchUserQueryEC.c

If there is anything in these procedures which needs clarifying, please feel free to ask any tutor, the instructor,

or post on the Piazza Discussion Board.