辅导INF 2309H、data讲解、辅导R设计、讲解R编程语言
- 首页 >> Database ---
title: "Applied Data Science Using R - INF 2309H - Assignment #3"
author: "Insert your name here"
date: "03/03/2020"
output:
word_document: default
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see.
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.
##Question 1 (Machine Learning using Linear Regression)
1. Load the Abalone dataset from the "AppliedPredictiveModelling" package
```{r}
```
2. The age of Abalone is the number of rings + 1.5. Use the tidyverse library to create the dependent variable "age". Display the structure of the dataset after creating the age column and removing the Rings column.
```{r}
```
3. Use linear regression to train a model to predict abalone age from the abalone dataset (the one you updated in the previous question through creating age and removing rings). Create train and test sets in the ratio of 60:40. Set your seed to 123 for this question and the rest of the assignment. Display the summary of the model and interpret the coefficients and p-values of Type and ShuckedWeight variables.
```{r}
```
4. Predict the model, calculate the errors and the rmse. Interpret the model prediction.
```{r}
```
##Question 2 (Feature Selection)
Use three methods of feature selection to determine the best features to predict age of the abalone dataset. Use the same abalone dataset for which you created the age variable and removed rings.
Method 1: Name the method
```{r}
```
Method 2: Name the method
```{r}
```
Method 3: Name the method
```{r}
```
2. Determine computationally in two different ways which is the most important feature.
Method 1
```{r}
```
Method 2
```{r}
```
##Question 3 (Machine learning using KNN on a numeric dataset)
For this question, you will also use the abalone dataset for which you created the dependent variable age and removed rings in question 1.
1. Train a KNN model on the abalone dataset using a train/test ratio of 70-30 and a k of 5. Use all the variables in the abalone dataset to create the model.
```{r}
```
2. Calculate the performance measures of the KNN model you created in question 1 and interpret the performance measures
```{r}
```
3. Find the best value of K. Rerun the model and calculate its performance with the new k (if different from the original k=5)
```{r}
```
##Quetsion 4 (Correlation)
Remove highly correlated (>=0.6) independent variables from the wine quality dataset. This is a dataset where wine quality is predicted from wine composition.
1. Display the correlation matrix and plot.
2. Interpret the correlation between alcohol and density, and between sulphates and free sulphur dioxide
3. Remove highly correlated variables.How many independent variables are left in the dataset? which variables were removed?
Import the dataset from
http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
Note: Use multiple Rmd chunks for the various steps
```{r}
```
##Question 5 (KNN classifier)
Reduce the levels of rating for wine quality to three levels as high, medium and low.
Consider high quality wine is >=7, and low quality wine is <=4. Then, build a KNN classifier (75:25 train:test ratio) for the Wine dataset after normalizing it and choosing the best K (avoiding k=1).
Display all performance measures of the KNN classifier. What can you tell about the prediction of different categories of wine quality? Interpret why you got these results?
Note: Use multiple Rmd chunks for the various steps
```{r}
```
## Question 6 (ML classifiers and ensemble models)
1.Use the wine dataset consider wine of quality less than or equal to 5 as low and greater than five as high.
Read the dataset
```{r}
```
2.Build individual classifiers of random forest, support vector machines, Naive Bayes and logistic regression. Consider you are interested in whether the model correctly predicts high wine quality. Calculate performance measures and decide which is the best model. Use 70:30 training:test ratio and seed of 123. Use 3 repeats of 10-fold cross-validation.
Note: Use multiple Rmd chunks for the various steps
```{r}
```
3. Build an ensemble model of all the previous models and calculate and interpret its performance measures.
```{r}
```
##Question 7
1.Use the USDA dataset. Consider calories less than 200 as low, and 200 and more as high.
```{r}
```
2.Build individual classifiers of random forest, support vector machines, Naive Bayes and logistic regression. Consider you are interested in whether the model correctly predicts high calories. Calculate performance measures and decide which is the best model. Use 70:30 training:test ratio and seed of 123. Use 3 repeats of 5-fold cross-validation.
Note: Use multiple Rmd chunks for the various steps
Note: If you get a warning with logistic regression, ignore the warning and continue using the model
```{r}
```
3. Build an ensemble model of all the previous models and calculate and interpret its performance measures.
```{r}
```
##Question 8 (Kmeans clustering)
1.Apply a kmeans clustering to the geyser dataset "faithful" embedded in R. What is the best value of k to cluster this dataset? Interpret how did you determine this value? What is the compactness of the kmeans clustering?
```{r}
```
2. Plot the original faithful dataset (eruptions on the x-axis vs waiting on the y-axis). From the plot, can you explain why did you get the best value of K that you got in question 1?
```{r}
```
3. Validate your clustering and interpret the results of your validation
```{r}
```
4. Find trends in the dataset. Based on these trends, which varibale (eruptions or waiting) is a better variable to divide the dataset into categories? What will be the cutpoints at which the clusters are drawn? Interpret you answer.
```{r}
```
##Question 9 (Text analytics)
For this question, we will do text and sentiment analysis of Martin Luther King's speech "I have a dream". The speech is available on Quercus in file "Dream_Speech.docx"
1. Draw 20 most frequent words in Martin Luther King's speech "I have a dream" and show their counts?
```{r}
```
2. What are the 20 most common bigrams. Show their counts.
```{r}
```
3. Do a sentiment analysis of the speech using a chart of nrc lexicon.
```{r}
```
4. What are the four most common sentiments? Display them with their counts.
```{r}
```
5. Find the top six words associated with each of the 4 sentiments you identified in the previous question. Why do you think these four sentiments are most common?
```{r}
```
title: "Applied Data Science Using R - INF 2309H - Assignment #3"
author: "Insert your name here"
date: "03/03/2020"
output:
word_document: default
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.
##Question 1 (Machine Learning using Linear Regression)
1. Load the Abalone dataset from the "AppliedPredictiveModelling" package
```{r}
```
2. The age of Abalone is the number of rings + 1.5. Use the tidyverse library to create the dependent variable "age". Display the structure of the dataset after creating the age column and removing the Rings column.
```{r}
```
3. Use linear regression to train a model to predict abalone age from the abalone dataset (the one you updated in the previous question through creating age and removing rings). Create train and test sets in the ratio of 60:40. Set your seed to 123 for this question and the rest of the assignment. Display the summary of the model and interpret the coefficients and p-values of Type and ShuckedWeight variables.
```{r}
```
4. Predict the model, calculate the errors and the rmse. Interpret the model prediction.
```{r}
```
##Question 2 (Feature Selection)
Use three methods of feature selection to determine the best features to predict age of the abalone dataset. Use the same abalone dataset for which you created the age variable and removed rings.
Method 1: Name the method
```{r}
```
Method 2: Name the method
```{r}
```
Method 3: Name the method
```{r}
```
2. Determine computationally in two different ways which is the most important feature.
Method 1
```{r}
```
Method 2
```{r}
```
##Question 3 (Machine learning using KNN on a numeric dataset)
For this question, you will also use the abalone dataset for which you created the dependent variable age and removed rings in question 1.
1. Train a KNN model on the abalone dataset using a train/test ratio of 70-30 and a k of 5. Use all the variables in the abalone dataset to create the model.
```{r}
```
2. Calculate the performance measures of the KNN model you created in question 1 and interpret the performance measures
```{r}
```
3. Find the best value of K. Rerun the model and calculate its performance with the new k (if different from the original k=5)
```{r}
```
##Quetsion 4 (Correlation)
Remove highly correlated (>=0.6) independent variables from the wine quality dataset. This is a dataset where wine quality is predicted from wine composition.
1. Display the correlation matrix and plot.
2. Interpret the correlation between alcohol and density, and between sulphates and free sulphur dioxide
3. Remove highly correlated variables.How many independent variables are left in the dataset? which variables were removed?
Import the dataset from
http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
Note: Use multiple Rmd chunks for the various steps
```{r}
```
##Question 5 (KNN classifier)
Reduce the levels of rating for wine quality to three levels as high, medium and low.
Consider high quality wine is >=7, and low quality wine is <=4. Then, build a KNN classifier (75:25 train:test ratio) for the Wine dataset after normalizing it and choosing the best K (avoiding k=1).
Display all performance measures of the KNN classifier. What can you tell about the prediction of different categories of wine quality? Interpret why you got these results?
Note: Use multiple Rmd chunks for the various steps
```{r}
```
## Question 6 (ML classifiers and ensemble models)
1.Use the wine dataset consider wine of quality less than or equal to 5 as low and greater than five as high.
Read the dataset
```{r}
```
2.Build individual classifiers of random forest, support vector machines, Naive Bayes and logistic regression. Consider you are interested in whether the model correctly predicts high wine quality. Calculate performance measures and decide which is the best model. Use 70:30 training:test ratio and seed of 123. Use 3 repeats of 10-fold cross-validation.
Note: Use multiple Rmd chunks for the various steps
```{r}
```
3. Build an ensemble model of all the previous models and calculate and interpret its performance measures.
```{r}
```
##Question 7
1.Use the USDA dataset. Consider calories less than 200 as low, and 200 and more as high.
```{r}
```
2.Build individual classifiers of random forest, support vector machines, Naive Bayes and logistic regression. Consider you are interested in whether the model correctly predicts high calories. Calculate performance measures and decide which is the best model. Use 70:30 training:test ratio and seed of 123. Use 3 repeats of 5-fold cross-validation.
Note: Use multiple Rmd chunks for the various steps
Note: If you get a warning with logistic regression, ignore the warning and continue using the model
```{r}
```
3. Build an ensemble model of all the previous models and calculate and interpret its performance measures.
```{r}
```
##Question 8 (Kmeans clustering)
1.Apply a kmeans clustering to the geyser dataset "faithful" embedded in R. What is the best value of k to cluster this dataset? Interpret how did you determine this value? What is the compactness of the kmeans clustering?
```{r}
```
2. Plot the original faithful dataset (eruptions on the x-axis vs waiting on the y-axis). From the plot, can you explain why did you get the best value of K that you got in question 1?
```{r}
```
3. Validate your clustering and interpret the results of your validation
```{r}
```
4. Find trends in the dataset. Based on these trends, which varibale (eruptions or waiting) is a better variable to divide the dataset into categories? What will be the cutpoints at which the clusters are drawn? Interpret you answer.
```{r}
```
##Question 9 (Text analytics)
For this question, we will do text and sentiment analysis of Martin Luther King's speech "I have a dream". The speech is available on Quercus in file "Dream_Speech.docx"
1. Draw 20 most frequent words in Martin Luther King's speech "I have a dream" and show their counts?
```{r}
```
2. What are the 20 most common bigrams. Show their counts.
```{r}
```
3. Do a sentiment analysis of the speech using a chart of nrc lexicon.
```{r}
```
4. What are the four most common sentiments? Display them with their counts.
```{r}
```
5. Find the top six words associated with each of the 4 sentiments you identified in the previous question. Why do you think these four sentiments are most common?
```{r}
```