CMPT 459.1-19辅导、讲解Programming Assignment 语言

2019.02.05 - 首页 >> 其他

---

title: "CMPT 459.1-19. Programming Assignment 1"

subtitle: "FIFA 19Players"

author: "Name - Student ID"

output: html_notebook

---

### Introduction

The data has detailed attributes for every player registered in

the latest edition of FIFA 19 database, obtained scraping the

website “sofifa.com”. Each instance is a different player, and

the attributes give basic information about the players and

their football skills. Basic pre-processing was done and Goal

Keepers were removed for this assignment.

Please look here for the original data overview and attributes’

descriptions:

- https://www.kaggle.com/karangadiya/fifa19

And here to get a better view of the information:

- https://sofifa.com/

---

### First look

**[Task 1]**: Load the dataset, completing the code below (keep

the dataframe name as **fifa**)

```{r}

# Loading

fifa <- read.csv("fifa.csv")

```

**[Checkpoint 1]**: How many rows and columns exist?

```{r}

cat(ifelse(all(dim(fifa) == c(16122, 68)), "Correct results!",

"Wrong results.."))

```

---

**[Task 2]**: Give a very brief overview of the types of each

attribute and their values. **HINT**: Functions *str*, *table*,

*summary*.

```{r}

# Overview

str(fifa)

```

**[Checkpoint 2]**: Were functions used to display data types

and give some idea of the information of the attributes?

---

### Data Cleaning

Functions suggested to use on this part: *ifelse*, *substr*,

*nchar*, *str_split*, *map_dbl*.

Five attributes need to be cleaned.

- **Value**: Remove euro character, deal with ending

"K" (thousands) and "M" (millions), define missing values and

make it numeric.

- **Wage**: Same as above.

- **Release.Clause**: Same as above.

- **Height**: Convert to "cm" and make it numeric.

- **Weight**: Remove "lbs" and make it numeric.

**[Task 3]**: The first 3 of the 5 attributes listed above that

need to be cleaned are very alike. Create only one function to

clean them the same way. This function should get the vector of

attribute values as parameter and return it cleaned, so use it

three times, each with one of the columns. **Encode zeroes or

blank as NA.**

```{r}

# Function used to clean attributes

library(stringr)

attr_fix <- function(attribute){

cleaned_attribute = str_split(attribute, gsub, pattern='€',

replacement='')

return(cleaned_attribute)

}

# Cleaning attributes

fifa$Value <- attr_fix(fifa$Value)

fifa$Wage <- attr_fix(fifa$Wage)

fifa$Release.Clause <- attr_fix(fifa$Release.Clause)

```

**[Checkpoint 3]**: How many NA values?

```{r}

cat(ifelse(sum(is.na(fifa))==1779, "Correct results!", "Wrong

results.."))

```

---

**[Task 4]**: Clean the other two attributes. **Hint**: To

convert to "cm" use http://www.sengpielaudio.com/calculatorbodylength.htm.

```{r}

# Cleaning attribute Weight:

```

```{r}

# Cleaning attribute Height:

```

**[Checkpoint 4]**: What are the mean values of these two

columns?

```{r}

cat(ifelse(all(c(round(mean(fifa[,8]),4)==164.1339,

round(mean(fifa[,7]),4)==180.3887)), "Correct results!", "Wrong

results.."))

```

---

### Missing Values

**[Task 5]**: What columns have missing values? List them below

(Replace <ANSWER HERE>). Impute (so do not remove) values

missing (that is all NA found) and explain the reasons for the

method used. Suggestion: MICE imputation based on random

forests .R package mice: https://www.ncbi.nlm.nih.gov/pmc/

articles/PMC3074241/, Use *set.seed(1)*. **HINT**: Remember to

not use "ID" nor "International.Reputation" for the imputation,

if MAR (Missing at Random) is considered. Also later remember to

put them back to the "fifa" dataframe.

Columns with missing values:

- <ANSWER HERE>

- ...

```{r}

# Handling NA values

```

```{r}

# Putting columns not used on imputation back into "fifa"

dataframe

```

**[Checkpoint 5]**: How many instances have at least one NA? It

should be 0 now. How many columns are there? It should be 68

(remember to put back "ID" and "International.Reputation").

```{r}

cat(ifelse(all(sum(is.na(fifa))==0, ncol(fifa)==68), "Correct

results!", "Wrong results.."))

```

---

### Feature Engineering

**[Task 6]**: Create a new attribute called "Position.Rating"

that has the rating value of the position corresponding to the

player. For example, if the player has the value "CF" on the

attribute "Position", then "Position.Rating" should have the

number on the "CF" attribute. **After that, remove the

"Position" attribute from the data**.

```{r}

# Creating the attribute "Position.Rating"

```

```{r}

# Removing the attribute "Position"

```

**[Checkpoint 6]**: What's the mean of the "Position.Rating"

attribute created? How many columns are there in the dataframe?

It should be 68 (remember to remove "Position").

```{r}

cat(ifelse(all(c(round(mean(fifa$Position.Rating),5) ==

66.87067, ncol(fifa)==68)), "Correct results!", "Wrong

results.."))

```

---

### Dimension Reduction

**[Task 7]**: Performe PCA (Principal Component Analysis) on the

columns representing ratings of positions (that is, attributes:

LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM,

RM, LWB, LDM, CDM, RDM, RWB, LB, LCB, CB, RCB, RB). Show the

summary of the components obtained. **Keep the minimum number of

components to have at least 98.50% of the variance explained by

them.**. Remove the columns used for PCA. **HINT**: Function

*prcomp*, remember to center and scale.

```{r}

# Perform PCA

# Show Summary

```

```{r}

# Put the components back into "fifa" dataframe

# Remove original columns used for PCA

```

**[Checkpoint 7]**: How many columns exist in the dataset? It

should be 45.

```{r}

cat(ifelse(ncol(fifa)==45, "Correct results!", "Wrong

results.."))

```

**[Bonus]**: Use the code below to see which columns influenced

the most each component graphically. Replace "fifa.pca" with the

object result from the use of *prcomp* function.

```{r}

library(factoextra)

fviz_pca_var(fifa.pca,

col.var = "contrib", # Color by contributions to

the PC

gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),

repel = TRUE # Avoid text overlapping

)

```

---

### Binarization

**[Task 8]**: Perform binarization on the following categorical

attributes: "Preferred.Foot" and "Work.Rate". **HINT**: R

package "dummies", function *dummy.data.frame*.

```{r}

# Binarize categorical attributes

```

**[Checkpoint 8]**: How many columns exist in the dataset? It

should be 54.

```{r}

cat(ifelse(ncol(fifa)==54, "Correct results!", "Wrong

results.."))

```

---

### Normalization

**[Task 9]**: Remove attribute "ID" from "fifa" dataframe, save

attribute "International.Reputation" on vector named "IntRep"

and then also remove "International.Reputation" from "fifa"

dataframe. Perform z-score normalization on "fifa", except for

columns that came from PCA. Finally combine the normalized

attributes with those from PCA saving on "fifa" dataframe.

**HINT**: Function *scale*.

```{r}

# Normalize with Z-Score

```

**[Checkpoint 9]**: How many columns exist in the dataset? It

should be 52. What's the mean of all the means of the

attributes? Should be around zero.

```{r}

cat(ifelse(ncol(fifa)==52, "Correct results!", "Wrong

results.."))

```

---

### K-Means

**[Task 9]**: Perform K-Means for values of K ranging from 2 to

15. Find the best number of clusters for K-means clustering,

based on the silhouette score. Report the best number of

clusters and the silhouette score for the corresponding

clustering (Replace <ANSWER HERE> below). How strong is the

discovered cluster structure? (Replace <ANSWER HERE> below) Use

"set.seed(1)". **HINT**: Function *kmeans* (make use of

parameters *nstart* and *iter.max*) and *silhouette* (from

package "cluster").

```{r}

# K-Means and Silhouette scores

```

Results found:

- Best number of clusters: <ANSWER HERE>

- Silhouette score: <ANSWER HERE>

- How strong is the cluster? <ANSWER HERE>

**[Checkpoint 9]**: Are there silhouette scores for K-Means with

K ranging from 2 to 15? Were the best K and correspondent

silhouette score reported?

---

**[Task 10]**: Perform K-means with the K chosen and get the

resulting groups. Try out several pairs of attributes and

produce scatter plots of the clustering from task 9 for these

pairs of attributes. By inspecting these plots, determine a pair

of attributes for which the clusters are relatively wellseparated

and submit the corresponding scatter plot.

```{r}

# K-Means for best K and Plot

```

**[Checkpoint 10]**: Is there at least one plot showing two

attributes and the groups (colored or circled) reasonably

separated?

---

### Hierarchical Clustering

**[Task 11]**: Sample randomly 1% of the data (set.seed(1)).

Perform hierarchical cluster analysis on the dataset using the

algorithms complete linkage, average linkage and single linkage.

Plot the dendrograms resulting from the different methods (three

methods should be applied on the same 1% sample). Discuss the

commonalities and differences between the three dendrograms and

try to explain the reasons leading to the differences (Replace

the <ANSWER HERE> below).

```{r}

# Sample and calculate distances

```

```{r}

# Complete

```

```{r}

# Average

```

```{r}

# Single

```

Discussion:

- <ANSWER HERE>

**[Checkpoint 11]**: Does the discussion show commonalities and

differences between the three dendrograms and explain the

differences?

---

### Clustering comparison

**[Task 12]**: Now perform hierarchical cluster analysis on the

**ENTIRE dataset** using the algorithms complete linkage,

average linkage and single linkage. Cut all of the three

dendrograms from task 11 to obtain a flat clustering with the

number of clusters determined as the best number in task 9.

To perform an external validation of the clustering results, use

the vector "IntRep"" created. What is the Rand Index for the

best K-means clustering? And what are the values of the Rand

Index for the flat clusterings obtained in this task from

complete linkage, average linkage and single linkage? Discuss

the results (Replace <ANSWER HERE> below). **HINT**: Function

*cluster_similarity* from package "clusteval".

```{r}

# Hierarchical Clusterings (Complete, Average and Single)

```

```{r}

# Flat Clusterings

```

```{r}

# Cluster Similarities

```

Discussion:

- <ANSWER HERE>