stata辅导、讲解统计stata留学生
- 首页 >> Algorithm 算法The following code reads business reviews which are part of the Yelp Dataset stored in Kaggle. The data are stored in a CSV file. The following code reads the CSV file and prints the contents of the first 5 records:
In [1]:
import pandas as pd
pd_data = pd.read_csv('yelp_review.zip')
pd_data[:5]
Out[1]:
review_iduser_idbusiness_idstarsdatetextusefulfunnycool
0vkVSCC7xljjrAI4UGfnKEQbv2nCi5Qv5vroFiqKGopiwAEx2SYEUJmTxVVB18LlCwA52016-05-28Super simple place but amazing nonetheless. It...000
1n6QzIUObkYshz4dz2QRJTwbv2nCi5Qv5vroFiqKGopiwVR6GpWIda3SfvPC-lg9H3w52016-05-28Small unassuming place that changes their menu...000
2MV3CcKScW05u5LVfF6ok0gbv2nCi5Qv5vroFiqKGopiwCKC0-MOWMqoeWf6s-szl8g52016-05-28Lester's is located in a beautiful neighborhoo...000
3IXvOzsEMYtiJI0CARmj77Qbv2nCi5Qv5vroFiqKGopiwACFtxLv8pGrrxMm6EgjreA42016-05-28Love coming here. Yes the place always needs t...000
4L_9BTb55X0GDtThi6GlZ6wbv2nCi5Qv5vroFiqKGopiws2I_Ni76bjJNK9yG60iD-Q42016-05-28Had their chocolate almond croissant and it wa...000
From the data, we will only use the reviews and the star rating. The following code extracts this information and places it in a list of pairs:
In [2]:
all_data = list(zip(pd_data['text'], pd_data['stars']))
In [3]:
len(all_data)
Out[3]:
5261668
In [4]:
all_data[:5]
Out[4]:
[("Super simple place but amazing nonetheless. It's been around since the 30's and they still serve the same thing they started with: a bologna and salami sandwich with mustard. \n\nStaff was very helpful and friendly.",
5),
("Small unassuming place that changes their menu every so often. Cool decor and vibe inside their 30 seat restaurant. Call for a reservation. \n\nWe had their beef tartar and pork belly to start and a salmon dish and lamb meal for mains. Everything was incredible! I could go on at length about how all the listed ingredients really make their dishes amazing but honestly you just need to go. \n\nA bit outside of downtown montreal but take the metro out and it's less than a 10 minute walk from the station.",
5),
("Lester's is located in a beautiful neighborhood and has been there since 1951. They are known for smoked meat which most deli's have but their brisket sandwich is what I come to montreal for. They've got about 12 seats outside to go along with the inside. \n\nThe smoked meat is up there in quality and taste with Schwartz's and you'll find less tourists at Lester's as well.",
5),
("Love coming here. Yes the place always needs the floor swept but when you give out peanuts in the shell how won't it always be a bit dirty. \n\nThe food speaks for itself, so good. Burgers are made to order and the meat is put on the grill when you order your sandwich. Getting the small burger just means 1 patty, the regular is a 2 patty burger which is twice the deliciousness. \n\nGetting the Cajun fries adds a bit of spice to them and whatever size you order they always throw more fries (a lot more fries) into the bag.",
4),
("Had their chocolate almond croissant and it was amazing! So light and buttery and oh my how chocolaty.\n\nIf you're looking for a light breakfast then head out here. Perfect spot for a coffee\\/latté before heading out to the old port",
4)]
Let's now check the distribution of star ratings:
In [5]:
%matplotlib inline
from matplotlib import pyplot as plt
In [6]:
from collections import Counter
c = Counter([rating for text, rating in all_data])
c
Out[6]:
Counter({1: 731363, 2: 438161, 3: 615481, 4: 1223316, 5: 2253347})
In [7]:
plt.bar(range(1,6), [c[1], c[2], c[3], c[4], c[5]])
Out[7]:
<Container object of 5 artists>
In this assignment you will predict whether a particular review gives 5 stars or not.
The data set is fairly large with more than 5 million samples. To speed up the computations for this assigmnent, we will use 500,000 samples for training, 10,000 for the dev-test set and 10,000 for the test set. To reduce any possible bias while partitioning the data set, we will first shuffle the data and then partition into training data, dev-test data, and test data using the following code:
In [8]:
import random
random.seed(1234)
random.shuffle(all_data)
train_data, devtest_data, test_data = all_data[:500000], all_data[500000:510000], all_data[510000:520000]
Exercise 1 (1 mark)
The data are annotated with a star rating. In this assignment we will attempt to predict whether the review has 5 stars or not. In other words, we will use two categories: "it does not have 5 stars", and "it has 5 stars". According to these categories, check that the training data, devtest data and test data have the same proportions of the categories "it does not have 5 stars", and "it has 5 stars".
In [ ]:
Exercise 2 (2 marks)
Use sklearn to generate the tf.idf matrix of the training set. With this matrix, train an sklearn Naive Bayes classifier using the training set and report the F1 scores of the training set, the devtest set, and the set set.
In [ ]:
Exercise 3 (2 marks)
Logistic regression normally produces better results than Naive Bayes but it takes longer time to train. Look at the documentation of sklearn and train a logistic regression classifier using the same tfidf information as in exercise 2. Report the F1 scores of the training set, the devtest set, and the test set.
In [ ]:
Exercise 4 (4 marks)
Given the results obtained in the previous exercises, answer the following questions. You must justify all answers.
1.(1 mark) How much overfitting did you observe in the classifiers?
2.(1 mark) What would you do to reduce overfitting?
3.(1 mark) Which classifier is better?
4.(1 mark) What can you conclude from the differences in the results between the dev-test set and the test set?
(write your answer here using Markdown formatting)
Exercise 5 (2 marks)
Write code that counts the false positives and false negatives of the training set of each classifier. What can you conclude from such counts?
In [ ]:
Exercise 6 (9 marks) - Improve the System and Final Analysis
This exercise is open ended. Your goal is to perform a more detailed error analysis and identify ways to improve the classification of reviews by adding or changing the features. To obtain top marks in this part, your answer must address all of the following topics:
1.An error analysis of the previous systems.
2.Based on the error analysis, explain what sort of modifications you would want to implement, and justify why these would be useful modifications.
3.Implementation of the improved classifier.
4.Evaluation of the results and comparison with the previous classifiers. What system is best and why?
5.Explain what further changes would possibly improve the classifier and why.
All this information should be inserted in this notebook below this question. The information should be structured in sections and it must be clear and precise. The explanations should be convincing. Below is a possible list of section headings. These sections are just a guideline. Feel free to change them, but make sure that they are informative and relevant.
Note that, even if the new system might not obtain better results than the previous systems, you can obtain top marks if you perform a good error analysis of the initial systems and the final system and you give a sensible justification of the decisions that led you to implement the new system. Similarly, you may not obtain top marks if you present a system that improves on the results but you do not provide a good error analysis or you do not justify your choice of new system.
1. Error Analysis
(write your text here)
2. Explanation of the Proposed New Classifier
(write your text here)
3. Code of the Proposed New Classifier
(write your text here)
4. Evaluation and Comparison
(write your text here)
5. Final Conclusions and Possible Improvements
(write your text here)
Submission of Results
Your submission should consist of this Jupyter notebook with all your code and explanations inserted in the notebook. The notebook should contain the output of the runs so that it can be read by the assessor without needing to run the code.
Examine this notebook so that you can have an idea of how to format text for good visual impact. You can also read this useful guide to the markdown notation, which explains the format of the text.
Late submissions will have a penalty of 4 marks deduction per day late.
Each question specifies a mark. The final mark of the assignment is the sum of all the individual marks, after applying any deductions for late submission.
By submitting this assignment you are acknowledging that this is your own work. Any submissions that break the code of academic honesty will be penalised as per the academic honesty policy.