STAT 7008留学生讲解、辅导Python设计、讲解CampbellSoup、辅导Python编程
- 首页 >> Python编程STAT 7008 – Assignment 3
Deadline of Submission by midnight 19 Nov 2018
All questions in this assignment must be solved or answered by
writing Python programs.
Total Marks 100
Question 1: Reading pdf (36 marks)
The file 57070_CampbellSoup_Investor_Spread.pdf is a financial report of
the Campbell Soup company.
(a) Write a Python code to identify the page in which the Consolidated
Statements of Cash Flows is located.
(b) Write a Python program and use appropriate regular expressions to
convert the Consolidated Statements of Cash Flows to a Pandas
DataFrame.
Question 2: Identifying undervalued stocks (44 marks)
(Use the three sets of Panda Notes to write your codes)
The main objectives of this question are to solve the followings:
1. The website http://finviz.com/ screener provides a comprehensive list
of variables for 7,541 listed companies. We are interested in
downloading these information provided in the website into a pandas
data frame for further analysis. The links we are going to download the
variables begin with
'https://finviz.com/screener.ashx?v=152&r=a suitable number
&c=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,2
5,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,
48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69'.
Observe how the links can be constructed by supplying the set of
suitable numbers to the above link.
Example codes to download a table from each link and then combine
them into a large pandas data frame is given below:
import pandas as pd
import numpy as np
from pandas.io.parsers import TextParser
from numpy import nan as NA
from lxml.html import parse
from urllib.request import urlopen
def _unpack(row, kind='td'):
elts = row.findall('.//%s' % kind)
return [val.text_content() for val in elts]
def parse_options_data(table):
rows = table.findall('.//tr')
data = [_unpack(r) for r in rows]
header = data[6]
data2 = [data[i] for i in range(7,len(data)]
return TextParser(data2,names=header).get_chunk()
#tailurl is the set of suitable numbers
#y is the tail of the link
baseurl = 'https://finviz.com/screener.ashx?v=152&r='
df = pd.DataFrame()
for x in tailurl:
parsed = parse(urlopen(baseurl+x+'&c='+y))
doc = parsed.getroot()
tables = doc.findall('.//table')
pdf = parse_options_data(tables[6])
df = pd.concat([df,pdf], ignore_index=True)
print(x+' is completed')
A similar codes can also be found in the notes Data Loading and Storage
with Pandas and Pandas Data Wrangling, Aggregation and Group
Operations.
2. You may find that there exists some rows in the df dataframe consisting
a lot of NaN. Remove those rows.
3. Remove the column 'Earnings'.
4. The 6
th to the last columns are all in char format which contains 'B', 'M',
'K', '%', '-' and ','. Write a function to clean the data and convert all to
float or int format whichever is appropriate.
5. Obtain a histogram of stock prices using the code
df. .hist(bins=100,alpha=0.3,color='k',normed=True).
However, the graph consists of one bar which is not normal given that
we have over 7,000 stocks. So we consider only stock prices less than
150 and re-produce the histogram.
6. Obtain a horizontal bar chart of the average prices per Sector.
7. Obtain a horizontal bar chart of the top 30 average prices of the top 20
priced stocks per industry.
8. Obtain a horizontal bar chart of the average prices per financial industry.
Ignoring the largest industry.
9. Since the industry property casualty insurers has the highest average
price in the finance sector, obtain a horizontal bar chart of the top 50
highest selling stock prices of property casualty insurers.
Ignoring the largest one.
10. Create variables to locate stocks which sells below their sector averages
on PE, PEG, PS, PB and Price respectively.
11. Create variables to locate stocks which sells below their industry
averages on PE, PEG, PS, PB and Price respectively.
12. Question 9 and 10 altogether define 10 simplifying criteria for an
undervalued stock. Create an index to determine the number of criteria
each stock satisfies. We call this index a relative_value_index.
13. Besides the relative_value_index, suppose that other criteria for
identifying an undervalued stock are as follows:
a) Price per share is between $20 and $100
b) Volume must be greater than 10,000
c) Positive earnings per share and positive projected earnings per share
d) Total debt to equity ratio less than 0.5
e) Beta less than 1.5
f) Institutional ownership less than 30 percent
g) Relative valuation index values greater than 8
Identify stocks in the dataset that satisfies the stated criteria.
Question 3 understanding and revising alien codes (20 marks)
The website https://www.pyimagesearch.com/2017/07/17/credit-card-ocrwith-opencv-and-python/
describes a program which can read the numbers
shown on a credit card.
Basically the steps are as follows:
1. Given images of the digits 0,1,2,3,4,5,6,7,8,9, we change to gray scale
and cv2 has a function to identify contour of these digits. The contour
of these digits act as a set of reference contours.
2. Since the digits are appeared in a group of 4 in a rectangular box and
each digit is in a square box, we specify dimension of a rectangular box
and that of a square box. These dimensions help to identify positions
where the group of digits are.
3. Given a credit card image, the strategy is to change it to a gray scale.
Since the digits are in a bright color, cv2 has a set of functions to spur
the image so that areas of continuous light color that conforms to the
given box dimensions can be identified.
4. For each identified area, use the cv2 contour function again to identify
contour of digits in the area.
5. For each digit from left to right, scores are given in a template matching
to the reference contours to find the matching digit.
A program ocr_template_match.py, digit images OCR-A_reference.png and
the credit card image card_image.png are given for your study.
Unfortunately, the program cannot read AMEX card Hilton-honors1.png. A
special set of reference digits AMEX_reference.png is created. This exercise
is to change the existing program to read this AMEX card.