STAT 7008留学生讲解、辅导Python设计、讲解CampbellSoup、辅导Python编程

2018.12.16 - 首页 >> Python编程

STAT 7008 – Assignment 3

Deadline of Submission by midnight 19 Nov 2018

All questions in this assignment must be solved or answered by

writing Python programs.

Total Marks 100

Question 1: Reading pdf (36 marks)

The file 57070_CampbellSoup_Investor_Spread.pdf is a financial report of

the Campbell Soup company.

(a) Write a Python code to identify the page in which the Consolidated

Statements of Cash Flows is located.

(b) Write a Python program and use appropriate regular expressions to

convert the Consolidated Statements of Cash Flows to a Pandas

DataFrame.

Question 2: Identifying undervalued stocks (44 marks)

(Use the three sets of Panda Notes to write your codes)

The main objectives of this question are to solve the followings:

1. The website http://finviz.com/ screener provides a comprehensive list

of variables for 7,541 listed companies. We are interested in

downloading these information provided in the website into a pandas

data frame for further analysis. The links we are going to download the

variables begin with

'https://finviz.com/screener.ashx?v=152&r=a suitable number

&c=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,2

5,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,

48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69'.

Observe how the links can be constructed by supplying the set of

suitable numbers to the above link.

Example codes to download a table from each link and then combine

them into a large pandas data frame is given below:

import pandas as pd

import numpy as np

from pandas.io.parsers import TextParser

from numpy import nan as NA

from lxml.html import parse

from urllib.request import urlopen

def _unpack(row, kind='td'):

elts = row.findall('.//%s' % kind)

return [val.text_content() for val in elts]

def parse_options_data(table):

rows = table.findall('.//tr')

data = [_unpack(r) for r in rows]

header = data[6]

data2 = [data[i] for i in range(7,len(data)]

return TextParser(data2,names=header).get_chunk()

#tailurl is the set of suitable numbers

#y is the tail of the link

baseurl = 'https://finviz.com/screener.ashx?v=152&r='

df = pd.DataFrame()

for x in tailurl:

parsed = parse(urlopen(baseurl+x+'&c='+y))

doc = parsed.getroot()

tables = doc.findall('.//table')

pdf = parse_options_data(tables[6])

df = pd.concat([df,pdf], ignore_index=True)

print(x+' is completed')

A similar codes can also be found in the notes Data Loading and Storage

with Pandas and Pandas Data Wrangling, Aggregation and Group

Operations.

2. You may find that there exists some rows in the df dataframe consisting

a lot of NaN. Remove those rows.

3. Remove the column 'Earnings'.

4. The 6

th to the last columns are all in char format which contains 'B', 'M',

'K', '%', '-' and ','. Write a function to clean the data and convert all to

float or int format whichever is appropriate.

5. Obtain a histogram of stock prices using the code

df. .hist(bins=100,alpha=0.3,color='k',normed=True).

However, the graph consists of one bar which is not normal given that

we have over 7,000 stocks. So we consider only stock prices less than

150 and re-produce the histogram.

6. Obtain a horizontal bar chart of the average prices per Sector.

7. Obtain a horizontal bar chart of the top 30 average prices of the top 20

priced stocks per industry.

8. Obtain a horizontal bar chart of the average prices per financial industry.

Ignoring the largest industry.

9. Since the industry property casualty insurers has the highest average

price in the finance sector, obtain a horizontal bar chart of the top 50

highest selling stock prices of property casualty insurers.

Ignoring the largest one.

10. Create variables to locate stocks which sells below their sector averages

on PE, PEG, PS, PB and Price respectively.

11. Create variables to locate stocks which sells below their industry

averages on PE, PEG, PS, PB and Price respectively.

12. Question 9 and 10 altogether define 10 simplifying criteria for an

undervalued stock. Create an index to determine the number of criteria

each stock satisfies. We call this index a relative_value_index.

13. Besides the relative_value_index, suppose that other criteria for

identifying an undervalued stock are as follows:

a) Price per share is between $20 and $100

b) Volume must be greater than 10,000

c) Positive earnings per share and positive projected earnings per share

d) Total debt to equity ratio less than 0.5

e) Beta less than 1.5

f) Institutional ownership less than 30 percent

g) Relative valuation index values greater than 8

Identify stocks in the dataset that satisfies the stated criteria.

Question 3 understanding and revising alien codes (20 marks)

The website https://www.pyimagesearch.com/2017/07/17/credit-card-ocrwith-opencv-and-python/

describes a program which can read the numbers

shown on a credit card.

Basically the steps are as follows:

1. Given images of the digits 0,1,2,3,4,5,6,7,8,9, we change to gray scale

and cv2 has a function to identify contour of these digits. The contour

of these digits act as a set of reference contours.

2. Since the digits are appeared in a group of 4 in a rectangular box and

each digit is in a square box, we specify dimension of a rectangular box

and that of a square box. These dimensions help to identify positions

where the group of digits are.

3. Given a credit card image, the strategy is to change it to a gray scale.

Since the digits are in a bright color, cv2 has a set of functions to spur

the image so that areas of continuous light color that conforms to the

given box dimensions can be identified.

4. For each identified area, use the cv2 contour function again to identify

contour of digits in the area.

5. For each digit from left to right, scores are given in a template matching

to the reference contours to find the matching digit.

A program ocr_template_match.py, digit images OCR-A_reference.png and

the credit card image card_image.png are given for your study.

Unfortunately, the program cannot read AMEX card Hilton-honors1.png. A

special set of reference digits AMEX_reference.png is created. This exercise

is to change the existing program to read this AMEX card.