CS1210留学生辅导、讲解Python、讲解Character Maps、辅导Python语言程序

2018.11.08 - 首页 >> Python编程

CS1210 Computer Science I: Foundations

Homework 2: Character Maps

Due Friday, November 9 at 11:59PM

Introduction

In this homework, we’re going to do some text analysis and plotting of the results. Our end goal is to

produce character maps, that is, a graphic representation of the activity of individual actors in a book. So,

for example, analyzing the classic children’s book The Wind in the Willows yields plots that look like:

where the x-axis indicates the position within the text of the book and the y-axis represents the frequency

with which that character is mentioned at the corresponding text location. Providing different parameters

alters the definition of a character (really, just a proper noun), and sets parameters for aggregation over the

course of the book (e.g., the graph on the left is significantly finer grained than the one on the right).

I will provide you with the usual template file, file signatures and a description of what each function

should do. Unlike Homework 1, however, you will be given more freedom regarding how to structure

your code; you may, for example, create additional helper functions that are called by the functions I

have established in the template file. Note that full credit will be given to working, well documented, well

structured and elegant Python! Thus you are expected to really focus on writing polished code with

good but not overwhelming comments (if in doubt, look at the solution to Homework 1 for an example.

readFile(file, regexes = RE)

This function should return a list of words taken from the specified file. Like our War and Peace example,

we will want to adjust the text for any special cases using a list of regular expression substitutions. At the

very least, your regular expressions should handle common contractions, expanding word patterns like

"couldn’t" into "could not".

Because we have not yet finished covering this topic in class, the template file contains a few regular

expressions just to get you started:

RE = ( ( "( [ a-z] ) n’ t \ b "," \1 not " ), # ". . . n’ t " => " ... not "

(" \ bma’a ?m \ b" , "madam" ) , # Ab b r evs l i k e Mme . ?

( "W( [ a-z] ) - ( [a-z] ) ", " \1\2 ") , # Me r ge s tut t er s l i ke k -k-k i c k

("- +" , " " )) # Sp l it word s a t hyph e n s .

For now these will do; once you get everything else to work and we have finished our coverage of regular

expressions in class, you should go back and extend this list to add any other regular expression

1 Revised October 26, 2018

substitutions as needed. My advice is that you carefully examine the sample texts provided as you read

them in to see what other adjustments might be needed.

Once the regular expressions have been used to tidy up the text, you can go ahead and split up the words

and remove any extraneous punctuation (n.b., the template file imports the punctuation variable from

the string package, a string containing all of the usual punctuation marks). Unlike our War and

Peace example, however, you’re not to drop any words, but return every word from the original

text, without changing the case of the individual words.

So, using the file wind.txt containing the text of The Wind in the Willows, we obtain:

W= r e a dF i l e ( ’wi nd. t xt ’ , RE )

Re a d 60735 word s

>>> W[ : 11 ]

[ ’THE’ , ’WIND’ , ’ IN’ , ’ THE ’ , ’WI LLOWS ’ , ’BY ’ , ’KE NNE TH’ , ’GRAHAME ’ , \ \

’CON TENTS ’ , ’CHAP TER ’ , ’ I’]]

>>> W[ 1520 : 1530 ]

[’ t he ’ , ’Mo l e’ , ’ a s’ , ’ h e’ , ’ p as se d’ , ’i t’ , ’ down ’ , ’ into ’, ’ t h e’ , ’ boa t ’ ]

Note: the transcripts shown here depend on my own solution. Yours may differ, leading to slightly

different values. Attempting to match my values exactly would not be a good use of your time; test your

own code and ascertain that it performs as the specifications require.

findNouns(W, cmin=1)

This function takes as its input a list of words of the sort just returned by readFile(), and returns as its

output a dictionary of proper nouns (the keys) associated with the number of times each noun appears in

the text (the values). The second argument is used to filter out proper nouns that occur fewer than cmin

times. Key to this process is to recognize what makes a proper noun. In general, a proper noun is always

capitalized, and never appears in lower case (note that, on occasion, proper nouns may appear in ALL

CAPS, as a typographical device at the beginning of a chapter, for example). You will need to build this

logic into your code, perhaps, in part, as a helper function.

Using the value of W obtained from The Wind in the Willows in the previous example:

>>> findNo u n s (W, 50 )

Fo u n d 603 noun s

Re t a i ning3 c ommo n noun s

{’ i ’ : 730 , ’mol e ’: 315 , ’ b a dge r ’: 155 }

>>> findNo u n s (W, 30 )

Fo u n d 603 noun s

Re t a i ning4 c ommo n noun s

{’ i ’ : 730 , ’mol e ’: 315 , ’ b a dge r ’: 155 , ’ r at t y’ : 44 }

Note that the keys in the dictionary returned are the lower-case equivalents to the proper nouns found in

buildIndex(W, N)

This function takes as input a list of words, W, like that produced by readFile(), and a dictionary of proper

nouns, N, like that produced by findNouns(). It returns as its output a dictionary of proper nouns (the

keys) associated with a list of integers corresponding to the locations in W that correspond to the key

noun. So:

>>> bui ld I nd ex (W, N)

2 Revised October 26, 2018

{’ i ’ : [ 10, 64, 1138 , 1163 , ... ] , \ \

’mol e ’: [ 69 , 416 , 691 , 959 , ... ] , \ \

’ b a dge r ’: [ 24 , 2141 , 2167 , 2906 , ... ] , \ \

’ r at t y’ : [ 2793 , 3556 , 4079 , 4562 , ... ] }

where the lists of index locations have been elided to save space.

plotChars(N, I, W, xsteps=100)

This function takes as input a dictionary of proper nouns, N, like that produced by findNouns(), an index

of proper noun locations, I, like that produced by buildIndex(), a list of words, W, like that produced by

readFile(), and a step increment, xstep, to be used in plotting the character map.

The function should open a new window like the ones shown in the figure above using the plotting

commands we will discuss in class. Each line in the plot represents the count of mentions of a character at

that point in the text, where "point in the text" is defined by dividing the word list, W, into xstep equal

sized chunks. Thus, the "first location" in the text when xstep is 100 (the default) is the k words of W,

where k*100 = len(W). Using a higher value of xstep givesafiner grained, more volatile, graph (see the

example on the left, above), while a smaller value of xstep breaks the text into fewer equal size chunks,

and hence produces a coarser grained, less volatile, graph of the same data (see the example on the right,

above).

Resources

Start from the Python template file attached to this assignment. The file contains the skeleton of the

system you will be implementing; your job is to complete each function in the skeleton as instructed in

the template file itself. Remember, unlike Homework 1, you have considerably more freedom regarding

how to structure your code, creating additional helper functions, and so on. Full credit will be given to

working, well documented, well structured and elegant Python! Thus you are expected to really focus on

writing polished code with good but not overwhelming comments (if in doubt, look at the solution to

Homework 1 for an example.

You should use the TA office hours to your advantage. If a funciton you are to complete is unclear to you,

then ask the TAs to help you understand what the input/output behavior of the function should be.

As usual, you should start by completing the hawkid() function so that we may properly credit you for

your work. Test hawkid() to ensure it in fact returns your own hawkid as the only element in a single

element tuple. As you work on each function, test it independently to make sure your code functions as

expected. Feel free to upload versions of your code to ICON as you go; we only grade the last version

uploaded, so this practice allows you to "lock in" working partial solutions prior to the deadline. Finally,

some general guidance:

(1) You will be graded on both the correctness and the quality of your code, including the quality of

your comments!

(2) As usual, respect the function signatures provided.

(3) Be careful with iteration; always choose the most appropriate form of iteration

(comprehension, while, or for) as the function mandates. Poorly selected iterative forms may be

graded down, even if they work!

(4) Finally, spend time polishing your solutions; show us you know how to write good quality code.

3 Revised October 26, 2018

hw2.py Fri Oct 26 09:48:32 2018 1

# CS1210 Homework2

# This file should contain only your own work; there are no partners

# assigned or permitted for Homework assignments.

# I certify that the entirety of this file contains only my own work.

# I have not shared the contents of this file with anyone in any form,

# nor have I obtained or included code from any other source aside

# from the code contained in the original homework template file.

from string import punctuation

import matplotlib.pyplot as plt

import re

######################################################################

# Edit the following function definition so it returns a tuple

# containing a single string, your hawkid.

# THE AUTOGRADER WILL FAIL TO ASSIGN A GRADE IF YOUR HAWKID IS NOT

# PROPERLY INCLUDED IN THIS FUNCTION. CAVEAT EMPTOR.

######################################################################

def hawkid():

’’’Used by CS1210 Autograder to extract student identity from code.’’’

return(("segre",))

######################################################################

# Some sample regular expressions.

RE = ( ("([a-z])n’t\\b","\\1 not"), # "...n’t" => "... not"

("\\bma’a?m\\b", "madam"), # Abbrevs like Mme.?

("\W([a-z])-([a-z])", "\\1\\2"), # Merge stutters like k-k-kick

("-+", " ") ) # Split words at hyphens.

######################################################################

# readFile(filename, regexes) returns a list of words read from the

# specified file. The second argument is a list of regular expressions

# that should be applied to the text before stripping punctuation from

# the words in the text.

def readFile(file, regexes = RE):

’’’DOCSTRING’’’

pass

######################################################################

# findNouns(W, cmin=1) returns a dictionary of proper nouns (as keys)

# and their occurrance counts (as values) provided each key noun

# appears at least cmin times in the text.

def findNouns(W, cmin=1):

’’’DOCSTRING’’’

pass

######################################################################

# buildIndex(W, N) returns a dictionary of proper nouns (as keys)

# taken from N and the index value in W for each occurrance of the key

# noun.

def buildIndex(W, N):

’’’DOCSTRING’’’

pass

######################################################################

# plotChars(N, I, W, xsteps=100) uses matplotlib to plot a character

# plot like the one shown in the handout, where N is a dictionary of

# proper nouns (as returned by findNouns()), I is an index of proper

# nouns and their locations in the text (as returned by buildIndex()),

# W is a list of words in the text (as returned by readFile()) and

# xsteps is the window size within which we count occurrences of each

hw2.py Fri Oct 26 09:48:32 2018 2

# character.

def plotChars(N, I, W, xsteps=100):

’’’DOCSTRING’’’

pass

######################################################################

# plot(file=’wind.txt’, cmin=100, xsteps=100) is a driver that manages

# the entire analysis and plotting process. It is presented to give

# you an idea of how to use the functions you’ve just designed.

def plot(file=’wind.txt’, cmin=100, xsteps=100):

’’’Drive analysis of text contained in file.’’’

W=readFile(file)

N=findNouns(W, cmin)

plotChars(N, buildIndex(W, N), W, xsteps)