CMPT 459.1-19辅导、讲解Programming Assignment 语言
- 首页 >> 其他---
title: "CMPT 459.1-19. Programming Assignment 1"
subtitle: "FIFA 19Players"
author: "Name - Student ID"
output: html_notebook
---
### Introduction
The data has detailed attributes for every player registered in
the latest edition of FIFA 19 database, obtained scraping the
website “sofifa.com”. Each instance is a different player, and
the attributes give basic information about the players and
their football skills. Basic pre-processing was done and Goal
Keepers were removed for this assignment.
Please look here for the original data overview and attributes’
descriptions:
- https://www.kaggle.com/karangadiya/fifa19
And here to get a better view of the information:
- https://sofifa.com/
---
### First look
**[Task 1]**: Load the dataset, completing the code below (keep
the dataframe name as **fifa**)
```{r}
# Loading
fifa <- read.csv("fifa.csv")
```
**[Checkpoint 1]**: How many rows and columns exist?
```{r}
cat(ifelse(all(dim(fifa) == c(16122, 68)), "Correct results!",
"Wrong results.."))
```
---
**[Task 2]**: Give a very brief overview of the types of each
attribute and their values. **HINT**: Functions *str*, *table*,
*summary*.
```{r}
# Overview
str(fifa)
```
**[Checkpoint 2]**: Were functions used to display data types
and give some idea of the information of the attributes?
---
### Data Cleaning
Functions suggested to use on this part: *ifelse*, *substr*,
*nchar*, *str_split*, *map_dbl*.
Five attributes need to be cleaned.
- **Value**: Remove euro character, deal with ending
"K" (thousands) and "M" (millions), define missing values and
make it numeric.
- **Wage**: Same as above.
- **Release.Clause**: Same as above.
- **Height**: Convert to "cm" and make it numeric.
- **Weight**: Remove "lbs" and make it numeric.
**[Task 3]**: The first 3 of the 5 attributes listed above that
need to be cleaned are very alike. Create only one function to
clean them the same way. This function should get the vector of
attribute values as parameter and return it cleaned, so use it
three times, each with one of the columns. **Encode zeroes or
blank as NA.**
```{r}
# Function used to clean attributes
library(stringr)
attr_fix <- function(attribute){
cleaned_attribute = str_split(attribute, gsub, pattern='€',
replacement='')
return(cleaned_attribute)
}
# Cleaning attributes
fifa$Value <- attr_fix(fifa$Value)
fifa$Wage <- attr_fix(fifa$Wage)
fifa$Release.Clause <- attr_fix(fifa$Release.Clause)
```
**[Checkpoint 3]**: How many NA values?
```{r}
cat(ifelse(sum(is.na(fifa))==1779, "Correct results!", "Wrong
results.."))
```
---
**[Task 4]**: Clean the other two attributes. **Hint**: To
convert to "cm" use http://www.sengpielaudio.com/calculatorbodylength.htm.
```{r}
# Cleaning attribute Weight:
```
```{r}
# Cleaning attribute Height:
```
**[Checkpoint 4]**: What are the mean values of these two
columns?
```{r}
cat(ifelse(all(c(round(mean(fifa[,8]),4)==164.1339,
round(mean(fifa[,7]),4)==180.3887)), "Correct results!", "Wrong
results.."))
```
---
### Missing Values
**[Task 5]**: What columns have missing values? List them below
(Replace <ANSWER HERE>). Impute (so do not remove) values
missing (that is all NA found) and explain the reasons for the
method used. Suggestion: MICE imputation based on random
forests .R package mice: https://www.ncbi.nlm.nih.gov/pmc/
articles/PMC3074241/, Use *set.seed(1)*. **HINT**: Remember to
not use "ID" nor "International.Reputation" for the imputation,
if MAR (Missing at Random) is considered. Also later remember to
put them back to the "fifa" dataframe.
Columns with missing values:
- <ANSWER HERE>
- <ANSWER HERE>
- ...
```{r}
# Handling NA values
```
```{r}
# Putting columns not used on imputation back into "fifa"
dataframe
```
**[Checkpoint 5]**: How many instances have at least one NA? It
should be 0 now. How many columns are there? It should be 68
(remember to put back "ID" and "International.Reputation").
```{r}
cat(ifelse(all(sum(is.na(fifa))==0, ncol(fifa)==68), "Correct
results!", "Wrong results.."))
```
---
### Feature Engineering
**[Task 6]**: Create a new attribute called "Position.Rating"
that has the rating value of the position corresponding to the
player. For example, if the player has the value "CF" on the
attribute "Position", then "Position.Rating" should have the
number on the "CF" attribute. **After that, remove the
"Position" attribute from the data**.
```{r}
# Creating the attribute "Position.Rating"
```
```{r}
# Removing the attribute "Position"
```
**[Checkpoint 6]**: What's the mean of the "Position.Rating"
attribute created? How many columns are there in the dataframe?
It should be 68 (remember to remove "Position").
```{r}
cat(ifelse(all(c(round(mean(fifa$Position.Rating),5) ==
66.87067, ncol(fifa)==68)), "Correct results!", "Wrong
results.."))
```
---
### Dimension Reduction
**[Task 7]**: Performe PCA (Principal Component Analysis) on the
columns representing ratings of positions (that is, attributes:
LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM,
RM, LWB, LDM, CDM, RDM, RWB, LB, LCB, CB, RCB, RB). Show the
summary of the components obtained. **Keep the minimum number of
components to have at least 98.50% of the variance explained by
them.**. Remove the columns used for PCA. **HINT**: Function
*prcomp*, remember to center and scale.
```{r}
# Perform PCA
# Show Summary
```
```{r}
# Put the components back into "fifa" dataframe
# Remove original columns used for PCA
```
**[Checkpoint 7]**: How many columns exist in the dataset? It
should be 45.
```{r}
cat(ifelse(ncol(fifa)==45, "Correct results!", "Wrong
results.."))
```
**[Bonus]**: Use the code below to see which columns influenced
the most each component graphically. Replace "fifa.pca" with the
object result from the use of *prcomp* function.
```{r}
library(factoextra)
fviz_pca_var(fifa.pca,
col.var = "contrib", # Color by contributions to
the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
```
---
### Binarization
**[Task 8]**: Perform binarization on the following categorical
attributes: "Preferred.Foot" and "Work.Rate". **HINT**: R
package "dummies", function *dummy.data.frame*.
```{r}
# Binarize categorical attributes
```
**[Checkpoint 8]**: How many columns exist in the dataset? It
should be 54.
```{r}
cat(ifelse(ncol(fifa)==54, "Correct results!", "Wrong
results.."))
```
---
### Normalization
**[Task 9]**: Remove attribute "ID" from "fifa" dataframe, save
attribute "International.Reputation" on vector named "IntRep"
and then also remove "International.Reputation" from "fifa"
dataframe. Perform z-score normalization on "fifa", except for
columns that came from PCA. Finally combine the normalized
attributes with those from PCA saving on "fifa" dataframe.
**HINT**: Function *scale*.
```{r}
# Normalize with Z-Score
```
**[Checkpoint 9]**: How many columns exist in the dataset? It
should be 52. What's the mean of all the means of the
attributes? Should be around zero.
```{r}
cat(ifelse(ncol(fifa)==52, "Correct results!", "Wrong
results.."))
```
---
### K-Means
**[Task 9]**: Perform K-Means for values of K ranging from 2 to
15. Find the best number of clusters for K-means clustering,
based on the silhouette score. Report the best number of
clusters and the silhouette score for the corresponding
clustering (Replace <ANSWER HERE> below). How strong is the
discovered cluster structure? (Replace <ANSWER HERE> below) Use
"set.seed(1)". **HINT**: Function *kmeans* (make use of
parameters *nstart* and *iter.max*) and *silhouette* (from
package "cluster").
```{r}
# K-Means and Silhouette scores
```
Results found:
- Best number of clusters: <ANSWER HERE>
- Silhouette score: <ANSWER HERE>
- How strong is the cluster? <ANSWER HERE>
**[Checkpoint 9]**: Are there silhouette scores for K-Means with
K ranging from 2 to 15? Were the best K and correspondent
silhouette score reported?
---
**[Task 10]**: Perform K-means with the K chosen and get the
resulting groups. Try out several pairs of attributes and
produce scatter plots of the clustering from task 9 for these
pairs of attributes. By inspecting these plots, determine a pair
of attributes for which the clusters are relatively wellseparated
and submit the corresponding scatter plot.
```{r}
# K-Means for best K and Plot
```
**[Checkpoint 10]**: Is there at least one plot showing two
attributes and the groups (colored or circled) reasonably
separated?
---
### Hierarchical Clustering
**[Task 11]**: Sample randomly 1% of the data (set.seed(1)).
Perform hierarchical cluster analysis on the dataset using the
algorithms complete linkage, average linkage and single linkage.
Plot the dendrograms resulting from the different methods (three
methods should be applied on the same 1% sample). Discuss the
commonalities and differences between the three dendrograms and
try to explain the reasons leading to the differences (Replace
the <ANSWER HERE> below).
```{r}
# Sample and calculate distances
```
```{r}
# Complete
```
```{r}
# Average
```
```{r}
# Single
```
Discussion:
- <ANSWER HERE>
**[Checkpoint 11]**: Does the discussion show commonalities and
differences between the three dendrograms and explain the
differences?
---
### Clustering comparison
**[Task 12]**: Now perform hierarchical cluster analysis on the
**ENTIRE dataset** using the algorithms complete linkage,
average linkage and single linkage. Cut all of the three
dendrograms from task 11 to obtain a flat clustering with the
number of clusters determined as the best number in task 9.
To perform an external validation of the clustering results, use
the vector "IntRep"" created. What is the Rand Index for the
best K-means clustering? And what are the values of the Rand
Index for the flat clusterings obtained in this task from
complete linkage, average linkage and single linkage? Discuss
the results (Replace <ANSWER HERE> below). **HINT**: Function
*cluster_similarity* from package "clusteval".
```{r}
# Hierarchical Clusterings (Complete, Average and Single)
```
```{r}
# Flat Clusterings
```
```{r}
# Cluster Similarities
```
Discussion:
- <ANSWER HERE>