python编程辅导、辅导python编程、辅导python、辅导python程序、python程序辅导

- 首页 >> Python编程


Phase 1: Warmup - Python Exercises (20 marks, worth 20% of subject grade)

In this phase, you will practice your Python wrangling skills with a publicly available dataset. The dataset is

obtained through the TMDB (The Movie DB) API. It contains information on movies featured in the Full

MovieLens Dataset and released on or before July 2017. The main features of the Movies Metadata file include

posters, backdrops, budget, revenue, release dates, languages, production countries and companies.

You will be working with the following dataset in this phase:

Movies_tmdb.csv: It has a set of movie records (approx. 45,000), released on or before July 2017. Note

that this dataset is quite large, and you may find it beneficial during development, to first test your code on

a smaller sample of this data.

Libraries to use are Pandas and Matplotlib. You will need to write Python 3 code and work with Series and

DataFrames discussed in workshop week 2 and data cleaning and basic visualisations covered in workshop

weeks 3-4. If you are using other packages, you must provide an explanation in your code about why it is

necessary.

Import Required Python Libraries and Load the Data

Please write here all the Python libraries you will be using! Also load the dataset (.csv) in a dataframe object.

In [ ]:

1 Understanding the Dataset (3 Marks)

1.1 Print the number of movies, number of attributes/columns, column names and datatypes. The output of this

step should look like (2 Marks)

#import ....

import pandas as pd

movies_df = pd.read_csv("Movies_tmdb.csv", low_memory=False)

***

Q1.1

Number of movies: #

Number of attributes/columns: #

Column names: #

Column datatypes: #

***

where # is the values/strings you find.

In [ ]:

1.2 In this assignment, we won't be using all the features (i.e. columns) which are included in the csv file, so

create a new dataframe with the following columns: (1 Marks)

You must keep the order of the columns as provided above. Output of this question should be printing the first

TWO rows (i.e. movies) from the new created dataframe in the following format:

***

Q1.2

The first two rows from the filtered movies dataframe are:

#

#

***

where each # represents one movie row.

In [ ]:

2 Missing Values and Data Types (5 Marks)

2.1 Most of the columns in the movies dataframe have object datatype, let's convert the "popularity" column to

float64 datatype, "title" column to string and "adult" column to boolean. (1 Mark)

The output of this step should print the datatypes of all columns in the movies dataframe after the conversion.

You should follow the following format:

***

Q2.1 Datatypes after conversion:

#

***

### answer Q1.1

### answer Q1.2

where # should be the datatypes of the dataframe columns. Note: You don't have to create a new dataframe for

this question, instead you can use the same dataframe which you created in Q1.2.

In [ ]:

2.2 Now, we will deal with the missing values as a preprocessing step before performing any further analysis.

Let's first print the total number of missing values for each column separately. Following this, you should print

the percentage of movies with incomplete data in any of its attributes (i.e. missing values). Note: A movie is

considered incomplete record if it has a missed value in at least one of its features. (2 Marks)

Note: missing values might be 0, nan, or empty cell.

***

Q2.2 Number of missing values per attribute:

col_1: x

col_2: x

...

col_n: x

***

% of movies with incomplete data: #

***

Replace col_1,col_2 ... col_n with the columns' names, x with the calculated values, and # with the calculated

percentage.

In [ ]:

2.3 Write code that will add a new column called "runtime_non_missing" to the movies dataframe. The values in

the new column should be copied from the "runtime" column and replaces all missing values in this column with

the average of non-missing values for that column. (2 Marks).

The output of this question should print the average calculated value in the following format:

***

Q2.3 Missing values in 'runtime' column are replaced with:

#

***

Where # is the calculated value.

Do you think it will be better to replace the missing values in the "runtime" column with the median instead of

the average? Yes/No - Why?

In [ ]:

### answer Q2.1

### answer Q2.2

### answer Q2.3

3 Cleaning Dataset (8 Marks)

3.1 In this question, you will deal with the datetime datatype. The question has three parts as following: (4

Marks).

Dealing with data formats is an essential step in the data wrangling pipeline. One of the issues is that data entry

might be inconsistent. For example, by looking at the "release_date" column, you will find two different formats

for the date value: '%m/%d/%Y' and '%Y-%m-%d'. Write code which converts the "release_date" into datetime

datatype and consider reading both formats correctly. The final datetime format should be '%m/%d/%Y'.

Another issue is the wrong values for some of these dates, for example some movies have the "release_date"

after July 2017. However, in the description of the dataset, it says the collected movies released on or before

July 2017. To deal with this issue, write code which removes any suspicious records (i.e. any movie which has a

"release_date" after July 2017).

In this assignment, we are not interested in analysing movies released before 1990. Therefore, as a preparation

for our next questions, we only want to keep movies with release date between Jan 1990 and July 2017

(inclusive, i.e. Jan 1990 release_date July 2017) in the movies dataframe. So write code to delete all

movies released outside this interval or has a nan/empty "release_date". You should display the number of the

records (i.e. movies) in the final filtered movies dataframe.

The output of this question should be in the following format:

***

Q3.1

The number of movies with release date between Jan-1990 and July-2017: #

***

Where # is the calculated number.

Note: The resulting dataframe will be used to answer the remaining questions.

In [ ]:

3.2 You might have noticed that the data of the genres column is represented as a list of dictionaries. Let's

change this format into an easier one. Write code to convert the values of genres column into a list of strings

instead of a list of dictionaries, keeping only the value of the "name" attribute. For example, the value [{'id': 16,

'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}] should be

['Animation','Comedey','Family']. The newly converted values should be stored in an extra new column called

"genres_cleaned". (4 Marks).

You should display the first 5 rows of the movies dataframe after adding this new column. The output of this

question should be in the following format:

### answer Q2.3 justification

### answer Q3.1

***

Q3.2

The first 5 rows after adding the "genres_cleaned" column are:

#

***

where # is the first 5 rows in the movies dataframe.

In [ ]:

4 Basic statistics, summary and grouping (10 Marks)

4.1 Write code that calculates the median and average of non-missing values in the budget column for movies

released between 2000 and 2010 (inclusive, i.e. 2000 release_year 2010). (2 Marks).

Your code should print out the results with the following format:

***

Q4.1: Movies budget (2000-2010)

Median: #

Average: #

***

where # is the calculated values rounded to 1 decimal place.

In [ ]:

4.2 Write code that returns a "popular_movies" dataframe with the most popular movie for each year since

2000. This means the dataframe will contain 18 movie, one for each year from 2000 till 2017). The dataframe

should also contain the following columns: "title", "release_date", "runtime" , "original_language", "popularity".

Also, the dataframe should be sorted by the "popularity" values in descending order. (2 Marks).

Your code should print out the popular_movies dataframe in the following format:

***

Q4.2: Most popular movies (2000-2017):

### answer Q3.2

### answer Q4.1

***

where each # represents one row in the popular_movies dataframe.

In [ ]:

4.3 In this question, we will be working with the "vote_average" and "vote_count" columns. Write the code that

returns the "title", "vote_average", "vote_count" of the 10 lowest voted average movies with at least 400 voters

(i.e. "vote_count)". The 10 movies should be displayed in ascending order by the "vote_average" values. (2

Marks).

Your code should print out the lowest voted average movies in the following format:

***

Q4.3: The 10 movies with the lowest vote average are:

***

where each # represent one of the 10 movies.

In [ ]:

4.4 Write code to display the count of movies for the top three movie production countries since Feb-2005 (i.e.

Feb-2005 release_date). (4 Marks).

Your code should print out the result in the following format:

***

Q4.4: Top 3 Movie Production Countries since Feb-2005:

### answer Q4.2

### answer Q4.3

***

where x represents the country name and y is the count of movies produced by this country.

In [ ]:

5 Visualization (13 Marks)

5.1 Boxplots (2 Marks).

Draw a plot consisting of two boxplots. One boxplot to show the distribution of revenue for adult movies. One

boxplot to show the distribution of revenue for other non-adult movies. Note: You should not include movies with

zero-revenue in the box-plot.

In [ ]:

5.2 Histogram (2 Marks)

Draw a bar plot showing month of year (x-axis) versus total number of movies released on that month (y-axis).

Each bar will represent the total number of movies released on a specific month across all years.

Is there any relation between the month of the year and the total number of movies? Yes/No - Explain?

In [ ]:

5.3 Scatter plot (3 Marks)

In this question, we will analyze the relation between few columns in the movies dataset. Draw four plots with

the following four scatter/line plots:

1- Non-zero revenue movies (x-axis) versus number of genres.

2- Non-zero revenue movies (x-axis) versus release year.

3- Non-zero revenue movies (x-axis) versus runtime.

4- Non-zero revenue movies (x-axis) versus vote average.

Pick one of the four plots and justify/explain the relation between the two attributes. You should mention

whether the relation is positive, negative or random. Did you expect this type of relation, Yes/No? Why?

### answer Q4.4

### answer Q5.1

### answer code Q5.2

### answer justification 5.2

In [ ]:

5.4 Pie chart (2 Marks)

Create a pie chart showing the number of movies for each genre. For example, if a movie is classified as both

comedey and action then the count for each of the action and comedey slices should be increased by 1. Each

slice of the pie should have a different colour and contain a percentage number listing its relative size. Also,

each slice of the pie should have a label next to it indicating which genre it corresponds to.

In [ ]:

5.5 Parallel co-ordinates (4 Marks)

In this question, we will the parallel co-ordinates plot to visualize the trend/relation between some of the

features in the movies dataset. You should write the code that implements the following steps:

1. Delete all movies (rows) with missing values in any of the following features: budget, runtime, popularity,

vote_average, and revenue.

2. Then for each of the features, normalise its values to lie within the range [0-1] (0 to 1 inclusive). Use the the

following formula for normalising a feature:

where min is the minimum value for the feature, max is the

maximum value for the feature, newvalue is the normalised value for the feature and oldvalue is the old (unnormalised

value).

3. Using these normalised features, compute the mean budget, runtime, popularity, vote_average, and

revenue for each year.

4. Finally, draw a parallel co-ordinates plot, each line correspondes to a different year. The ordering of the

features for the plot should be budget(leftmost), runtime, popularity, vote_average, and revenue(rightmost).

Colour the movies with in red, in green and

in blue. Provide a legend mapping colours to day type.

From the plot, can you see any relation between the revenue and popularity features? Yes/No- Explain?

In [ ]:

### answer Q5.3

### answer justification 5.3

### answer Q5.4

### answer Q5.5

### answer justification 5.5


站长地图