讲解ECON 427、辅导Python/c++程序语言、讲解Java编程、辅导StockTwits 讲解R语言编程|辅导R语言程序
- 首页 >> OS编程 Project ECON 427,
1. Predicting Stock Price Movements
The goal of this project is to predict stock pricesby applying machine learning techniques
to data from StockTwits, a social media platform for investors. We extract
features from textual data, and formulate price prediction as both a regression and a
classification problem. We demonstrate the results and analyze them.
(a) Make yourself familiar with the StockTwits platform: https://stocktwits.
com/.
(b) The goal is to perform analysis on the component stocks of the Dow Jones Industrial
Average. Data were collected for the period December 2013 to December
2016, totaling 756 trading days. Two main datasets are used:
i. StockTwits Data: The data were collected and downloaded in raw JSON
format, totaling over 540,000 messages. Sentiment polarity was also extracted
from user-generated “bullish”/ “bearish” tags.
A. Calculate the difference of the number of bullis and bearish tags and
divide it by the total number of messages tagged for each stock in each
day, to find a polarity for each stock in each day, and calculate a moving
average for this ratio, and call it st
. Use a 3 point moving average initially,
but you can try to change the window size of the moving average to see
if you can get better results when you are training models.
B. Calculate the number of messages for each stock in each day, which we
call message volume.
C. Calculate the percentage 1-day message volume change, which is the difference
between today’s message volume and yesterday’s message volume
divided by yesterday’s message volume and call it mv1,t.
D. Calculate today’s message volume divided by the average message volume
in the previous 10 days and call it mv10,t.
ii. Price Data: Daily split-adjusted stock price data was collected via the Yahoo
Finance API. You can only focus only on the closing price data for the
purposes of this project, but you are welcome to test your algorithms for
other prices in the data set as well.
iii. Prediction Target: We focus on the forward T-day return, calculated as a
percentage change for the future price movement three days ahead of today’s
trading price, i.e.:
rt(T) = pt+T pt
where pt+T is the price at time period t + T, i.e. T days ahead. Calculate
rt(3) and rt(5) from the data for each company. Later, we will try to predict
them using various techniques.
(c) Pre-Processing and Exploratory data analysis:
1Project ECON 427, Instructor: Mohammad Reza Rajati
i. There are exceedingly large number of posts about AAPL. You can remove
AAPL from your analysis if the computational burden is too much for your
computer.
ii. Search what stop words mean and remove them from the data.
iii. Remove company names from the data.
iv. Remove posts mentioning/tagging multiple stocks (e.g. “$AAPL $FB $GOOG”).
v. Aggregate posts by date. For each date in the the period December 2013 to
December 2016, you should have a set of tweets for each company in that
date.
vi. Use 70% of the data for training and 30% for testing. Remember not to select
training and test data randomly. Use the first 70% of the days for training
and the last 30% for testing (January 2016 to December 2016). Explain whay
this is a correct way of splitting the data.
(d) Bag of Words Features
i. Calculate the frequencies of the words in the data.
ii. Only keep words that occured at least 25 times in the dataset. This should
give you more than 6800 words.
iii. For each of the words in 1(d)ii, calculate the TF-IDF metric with Laplace
smoothing. Those metrics are used as features in your classification models.
(e) Chi-Squared Statistics
i. Since the number of features is very large, we use a preliminary feature selection
method that detects correlation between features. Use the chi-squared
test to select the first 1000 important features with highest chi-squared scores.
(f) Classification
i. Explain how prediction of rt(T) can be converted into a binary classification
problem and convert the responses to binary labels.
ii. Na¨ve Bayes Binary Classifier
A. Train a Na¨?ve Bayes classifier using bag of words features.
B. Report train and test accuracy for this model.
C. Build a confusion matrix for both training and test data.
D. Report AUC, precision, recall, and F1-scores for both training and testing
data.
iii. Logistic Regression
A. Apply Recursive Feature elimination on the chi-squared features to train
a Logistic Regression model for binary classification.
B. Train an L 1-penalized Logistic Regression using the chi-squared features
as well as st
, mv1,t, and mv10,t. Use 5-fold cross validation to find the best
hyper-parameter.
C. Report train and test accuracy for both models.
D. Build a confusion matrix for both training and test data for both models.
2Project ECON 427, Instructor: Mohammad Reza Rajati
E. Report AUC, precision, recall, and F1-scores for both training and testing
data.
iv. Random Forests and Extra Trees
A. Use as many of the 1000 chi-squared features as you can (at least the top
20) along with st
, mv1,t, and mv10,t to train a random forest model for
binary classification.
B. Repeat 1(f)ivA using Extra Trees.
C. Report train and test accuracy for both models.
D. Build a confusion matrix for both training and test data for both models.
E. Report AUC, precision, recall, and F1-scores for both training and testing
data.
v. Support Vector Machines
A. Train an L 1-penalized SVM using the chi-squared features as well as
st
, mv1,t, and mv10,t. Use 5-fold cross validation to find the best hyperparameter
B. Report train and test accuracy for both models.
C. Build a confusion matrix for both training and test data for both models.
D. Report AUC, precision, recall, and F1-scores for both training and testing
data.
(g) Regression
i. KNN Regression
A. Use the chi-squared features along with st
, mv1,t, and mv10,t to perform
KNN regression on the data. Use 5-fold cross validation to determine the
value of k ∈ {5, 6, . . . , 30}. You are welcome to test the effect of larger
k’s.
B. Map any predicted ?r(T) whose absolute value is bigger than a reasonable
threshold (the suggested value is 0.5%, but you are welcome to try other
thresholds as well. Obviously, if the threshold is elected to be 0%, there
is not any no action signal) into a positive or negative signal.
C. Report train and test accuracy for both models.
D. Build a confusion matrix for both training and test data for both models.
E. Report AUC, precision, recall, and F1-scores for both training and testing
data.
F. Note: If you have a no action signal, the cases that are detected as no
action should not be considered in evaluationg classification metrics.
ii. Support Vector Regression1
A. Use the chi-squared features along with st
, mv1,t, and mv10,t to train a
Support Vector regression model on the data. Use L2 regularization. Use
5-fold cross validation to determine the hyperparameters of the algorithm.
1https://medium.com/coinmonks/support-vector-regression-or-svr-8eb3acf6d0ff
3Project ECON 427, Instructor: Mohammad Reza Rajati
B. Map any predicted ?r(T) whose absolute value is bigger than a reasonable
threshold (the suggested value is 0.5%, but you are welcome to try other
thresholds as well. Obviously, if the threshold is elected to be 0%, there
is not any no action signal) into a positive or negative signal.
C. Report train and test accuracy for both models.
D. Build a confusion matrix for both training and test data for both models.
E. Report AUC, precision, recall, and F1-scores for both training and testing
data.
F. Note: If you have a no action signal, the cases that are detected as no
action should not be considered in evaluationg classification metrics.
iii. Random Forest and Extra Tree Regression
A. Use the chi-squared features along with st
, mv1,t, and mv10,t to train a
Random Forest regression model and and an Extra Tree regression model
on the data.
B. Map any predicted ?r(T) whose absolute value is bigger than a reasonable
threshold (the suggested value is 0.5%, but you are welcome to try other
thresholds as well. Obviously, if the threshold is elected to be 0%, there
is not any no action signal) into a positive or negative signal.
C. Report train and test accuracy for both models.
D. Build a confusion matrix for both training and test data for both models.
E. Report AUC, precision, recall, and F1-scores for both training and testing
data.
F. Note: If you have a no action signal, the cases that are detected as no
action should not be considered in evaluationg classification metrics.
(h) Improving The Models: Use any method you know, including ensemble methods,
to yield the best classifier and the best regression model you can. You may
want to reduce the number of features you use using recursive feature elimination.
In that case, use recursive feature elimination inside your cross validation loops.
You are free to use any technique, for example a Recurrent Neural Network or
XGBoost.
(i) Explain why even a test accuracy slightly above 50% is not bad for this problem,
although it is a binary classification problem. Make every effort to have a test
accuracy of at least 60%.
(j) Make a table of the test accuracies for each stock, and identify the best three and
the worst three accuracies. Comment on your results.
2. Trading Scenario
(a) Your capital at the beginning of each day is Ct and is Et at the end of each day.
Assume that you are considering days {1, 2, . . . , τ} in your test set where τ is the
number of your test days. You make long/short decisions in days {1, 2, . . . , τ T}.
Because you have to wait T days to see the effect of your decisions on your capital,
4Project ECON 427, Instructor: Mohammad Reza Rajati
you calculate your capital at the end of days {1 + T, 2 + T, τ}. Repeat all of the
following steps for both T = 3 and T = 5.
(b) Start with an initial capital of C1 = C2 = · · · = C1+T = E1 = E2 = . . . = ET =
$90, 000. Only 1/3 of your total money at the end of the previous day should
be invested at the beginning of each day. Thus, if C
is the amount you invest
on stocks on day t, you would initially invest C
1+T = E1/3 =E2/3 = . . . = ET /3 = $30, 000, and C
changes from
$30, 000 at day t = 2 + T, and because the effect of your decisions in day 1 will
change your capital at the end of day 1 + T (which is E1+T ), and 1/3 of E1+T will
be available capital for investment at day t = 2 + T, i.e. C0
2+T = E1+T /3.
(c) Invest equal amounts of money in each company. Therefore, if you are considering,
say, M = 25 companies, invest I
/M in day t in each company. This means
you initially invest I
/M = $30, 000/25 = $1200 in
company m (if it makes your calculations simpler, you can consider fractional
shares, but small remainders do not seem to significantly affect the results). If
the price of each share of company m in day t is pt
, this means you invest in
/pt shares of company m in day t.
(d) Start making decisions in the first day in your training set. Trading is done using
long/short signals. If your predicted trade signal for company m is positive in
day t (i.e. if you predict that its price will go up in day t + T), long its shares,
i.e. calculate your return for the share of company m in day t using the following
formula:
On the other hand, if your predicted trade signal for a company is negative in
day t (i.e. if you predict that its price will go down in day t+T), short its shares,
i.e. calculate your return for the share of company m in day t using the following
formula:
If you predicted no action for a stock using the regression methods, obviously
(e) The effect of decisions in day t ? T on your capital are revealed when you realize
the prices in day t. The total gains and losses on day t resulting from long/short
decisions on day t T is calculated as:
is the number of shares of company m that was traded in day t T,
and the comission for each trade is considered to be $0.0075, unless r
(T) was
5Project ECON 427, Instructor: Mohammad Reza Rajati
predicted to be 0 (no action, by a regression model), where qt?T = 0. Thus, your
capital at the end of day t is:
Obviously, Ct+1 = Et
, t ∈ {T, · · · , τ ?1}, but we introduced Ct and Et
for clarity
of the above descriptions.
(f) Plot Ct over the test period, for each of your prediction algorithms on the same
graph and compare them. Which method makes you richer at the end of the
test period? You can include any custom-made algorithm you created to improve
the results in this comparison and argue that it works better than the standard
algorithms offered in the description of the project.
(g) Comparison with Oracle trading Dow Jones Industrial Average (DIA):
repeat the above scenario for the Dow Jones industrial average (DIA) and an
omniscient trader (Oracle), i.e. instead of predicting the movements using any
of your algorithms, use the true movements. In other words, if the actual T-day
ahead return is positive in a day, long the stock, and if it is negative, short the
stock. Compare all of your algoritms with the performance of Oracle on DIA on
the same plot and draw conclusions.
6
1. Predicting Stock Price Movements
The goal of this project is to predict stock pricesby applying machine learning techniques
to data from StockTwits, a social media platform for investors. We extract
features from textual data, and formulate price prediction as both a regression and a
classification problem. We demonstrate the results and analyze them.
(a) Make yourself familiar with the StockTwits platform: https://stocktwits.
com/.
(b) The goal is to perform analysis on the component stocks of the Dow Jones Industrial
Average. Data were collected for the period December 2013 to December
2016, totaling 756 trading days. Two main datasets are used:
i. StockTwits Data: The data were collected and downloaded in raw JSON
format, totaling over 540,000 messages. Sentiment polarity was also extracted
from user-generated “bullish”/ “bearish” tags.
A. Calculate the difference of the number of bullis and bearish tags and
divide it by the total number of messages tagged for each stock in each
day, to find a polarity for each stock in each day, and calculate a moving
average for this ratio, and call it st
. Use a 3 point moving average initially,
but you can try to change the window size of the moving average to see
if you can get better results when you are training models.
B. Calculate the number of messages for each stock in each day, which we
call message volume.
C. Calculate the percentage 1-day message volume change, which is the difference
between today’s message volume and yesterday’s message volume
divided by yesterday’s message volume and call it mv1,t.
D. Calculate today’s message volume divided by the average message volume
in the previous 10 days and call it mv10,t.
ii. Price Data: Daily split-adjusted stock price data was collected via the Yahoo
Finance API. You can only focus only on the closing price data for the
purposes of this project, but you are welcome to test your algorithms for
other prices in the data set as well.
iii. Prediction Target: We focus on the forward T-day return, calculated as a
percentage change for the future price movement three days ahead of today’s
trading price, i.e.:
rt(T) = pt+T pt
where pt+T is the price at time period t + T, i.e. T days ahead. Calculate
rt(3) and rt(5) from the data for each company. Later, we will try to predict
them using various techniques.
(c) Pre-Processing and Exploratory data analysis:
1Project ECON 427, Instructor: Mohammad Reza Rajati
i. There are exceedingly large number of posts about AAPL. You can remove
AAPL from your analysis if the computational burden is too much for your
computer.
ii. Search what stop words mean and remove them from the data.
iii. Remove company names from the data.
iv. Remove posts mentioning/tagging multiple stocks (e.g. “$AAPL $FB $GOOG”).
v. Aggregate posts by date. For each date in the the period December 2013 to
December 2016, you should have a set of tweets for each company in that
date.
vi. Use 70% of the data for training and 30% for testing. Remember not to select
training and test data randomly. Use the first 70% of the days for training
and the last 30% for testing (January 2016 to December 2016). Explain whay
this is a correct way of splitting the data.
(d) Bag of Words Features
i. Calculate the frequencies of the words in the data.
ii. Only keep words that occured at least 25 times in the dataset. This should
give you more than 6800 words.
iii. For each of the words in 1(d)ii, calculate the TF-IDF metric with Laplace
smoothing. Those metrics are used as features in your classification models.
(e) Chi-Squared Statistics
i. Since the number of features is very large, we use a preliminary feature selection
method that detects correlation between features. Use the chi-squared
test to select the first 1000 important features with highest chi-squared scores.
(f) Classification
i. Explain how prediction of rt(T) can be converted into a binary classification
problem and convert the responses to binary labels.
ii. Na¨ve Bayes Binary Classifier
A. Train a Na¨?ve Bayes classifier using bag of words features.
B. Report train and test accuracy for this model.
C. Build a confusion matrix for both training and test data.
D. Report AUC, precision, recall, and F1-scores for both training and testing
data.
iii. Logistic Regression
A. Apply Recursive Feature elimination on the chi-squared features to train
a Logistic Regression model for binary classification.
B. Train an L 1-penalized Logistic Regression using the chi-squared features
as well as st
, mv1,t, and mv10,t. Use 5-fold cross validation to find the best
hyper-parameter.
C. Report train and test accuracy for both models.
D. Build a confusion matrix for both training and test data for both models.
2Project ECON 427, Instructor: Mohammad Reza Rajati
E. Report AUC, precision, recall, and F1-scores for both training and testing
data.
iv. Random Forests and Extra Trees
A. Use as many of the 1000 chi-squared features as you can (at least the top
20) along with st
, mv1,t, and mv10,t to train a random forest model for
binary classification.
B. Repeat 1(f)ivA using Extra Trees.
C. Report train and test accuracy for both models.
D. Build a confusion matrix for both training and test data for both models.
E. Report AUC, precision, recall, and F1-scores for both training and testing
data.
v. Support Vector Machines
A. Train an L 1-penalized SVM using the chi-squared features as well as
st
, mv1,t, and mv10,t. Use 5-fold cross validation to find the best hyperparameter
B. Report train and test accuracy for both models.
C. Build a confusion matrix for both training and test data for both models.
D. Report AUC, precision, recall, and F1-scores for both training and testing
data.
(g) Regression
i. KNN Regression
A. Use the chi-squared features along with st
, mv1,t, and mv10,t to perform
KNN regression on the data. Use 5-fold cross validation to determine the
value of k ∈ {5, 6, . . . , 30}. You are welcome to test the effect of larger
k’s.
B. Map any predicted ?r(T) whose absolute value is bigger than a reasonable
threshold (the suggested value is 0.5%, but you are welcome to try other
thresholds as well. Obviously, if the threshold is elected to be 0%, there
is not any no action signal) into a positive or negative signal.
C. Report train and test accuracy for both models.
D. Build a confusion matrix for both training and test data for both models.
E. Report AUC, precision, recall, and F1-scores for both training and testing
data.
F. Note: If you have a no action signal, the cases that are detected as no
action should not be considered in evaluationg classification metrics.
ii. Support Vector Regression1
A. Use the chi-squared features along with st
, mv1,t, and mv10,t to train a
Support Vector regression model on the data. Use L2 regularization. Use
5-fold cross validation to determine the hyperparameters of the algorithm.
1https://medium.com/coinmonks/support-vector-regression-or-svr-8eb3acf6d0ff
3Project ECON 427, Instructor: Mohammad Reza Rajati
B. Map any predicted ?r(T) whose absolute value is bigger than a reasonable
threshold (the suggested value is 0.5%, but you are welcome to try other
thresholds as well. Obviously, if the threshold is elected to be 0%, there
is not any no action signal) into a positive or negative signal.
C. Report train and test accuracy for both models.
D. Build a confusion matrix for both training and test data for both models.
E. Report AUC, precision, recall, and F1-scores for both training and testing
data.
F. Note: If you have a no action signal, the cases that are detected as no
action should not be considered in evaluationg classification metrics.
iii. Random Forest and Extra Tree Regression
A. Use the chi-squared features along with st
, mv1,t, and mv10,t to train a
Random Forest regression model and and an Extra Tree regression model
on the data.
B. Map any predicted ?r(T) whose absolute value is bigger than a reasonable
threshold (the suggested value is 0.5%, but you are welcome to try other
thresholds as well. Obviously, if the threshold is elected to be 0%, there
is not any no action signal) into a positive or negative signal.
C. Report train and test accuracy for both models.
D. Build a confusion matrix for both training and test data for both models.
E. Report AUC, precision, recall, and F1-scores for both training and testing
data.
F. Note: If you have a no action signal, the cases that are detected as no
action should not be considered in evaluationg classification metrics.
(h) Improving The Models: Use any method you know, including ensemble methods,
to yield the best classifier and the best regression model you can. You may
want to reduce the number of features you use using recursive feature elimination.
In that case, use recursive feature elimination inside your cross validation loops.
You are free to use any technique, for example a Recurrent Neural Network or
XGBoost.
(i) Explain why even a test accuracy slightly above 50% is not bad for this problem,
although it is a binary classification problem. Make every effort to have a test
accuracy of at least 60%.
(j) Make a table of the test accuracies for each stock, and identify the best three and
the worst three accuracies. Comment on your results.
2. Trading Scenario
(a) Your capital at the beginning of each day is Ct and is Et at the end of each day.
Assume that you are considering days {1, 2, . . . , τ} in your test set where τ is the
number of your test days. You make long/short decisions in days {1, 2, . . . , τ T}.
Because you have to wait T days to see the effect of your decisions on your capital,
4Project ECON 427, Instructor: Mohammad Reza Rajati
you calculate your capital at the end of days {1 + T, 2 + T, τ}. Repeat all of the
following steps for both T = 3 and T = 5.
(b) Start with an initial capital of C1 = C2 = · · · = C1+T = E1 = E2 = . . . = ET =
$90, 000. Only 1/3 of your total money at the end of the previous day should
be invested at the beginning of each day. Thus, if C
is the amount you invest
on stocks on day t, you would initially invest C
1+T = E1/3 =E2/3 = . . . = ET /3 = $30, 000, and C
changes from
$30, 000 at day t = 2 + T, and because the effect of your decisions in day 1 will
change your capital at the end of day 1 + T (which is E1+T ), and 1/3 of E1+T will
be available capital for investment at day t = 2 + T, i.e. C0
2+T = E1+T /3.
(c) Invest equal amounts of money in each company. Therefore, if you are considering,
say, M = 25 companies, invest I
/M in day t in each company. This means
you initially invest I
/M = $30, 000/25 = $1200 in
company m (if it makes your calculations simpler, you can consider fractional
shares, but small remainders do not seem to significantly affect the results). If
the price of each share of company m in day t is pt
, this means you invest in
/pt shares of company m in day t.
(d) Start making decisions in the first day in your training set. Trading is done using
long/short signals. If your predicted trade signal for company m is positive in
day t (i.e. if you predict that its price will go up in day t + T), long its shares,
i.e. calculate your return for the share of company m in day t using the following
formula:
On the other hand, if your predicted trade signal for a company is negative in
day t (i.e. if you predict that its price will go down in day t+T), short its shares,
i.e. calculate your return for the share of company m in day t using the following
formula:
If you predicted no action for a stock using the regression methods, obviously
(e) The effect of decisions in day t ? T on your capital are revealed when you realize
the prices in day t. The total gains and losses on day t resulting from long/short
decisions on day t T is calculated as:
is the number of shares of company m that was traded in day t T,
and the comission for each trade is considered to be $0.0075, unless r
(T) was
5Project ECON 427, Instructor: Mohammad Reza Rajati
predicted to be 0 (no action, by a regression model), where qt?T = 0. Thus, your
capital at the end of day t is:
Obviously, Ct+1 = Et
, t ∈ {T, · · · , τ ?1}, but we introduced Ct and Et
for clarity
of the above descriptions.
(f) Plot Ct over the test period, for each of your prediction algorithms on the same
graph and compare them. Which method makes you richer at the end of the
test period? You can include any custom-made algorithm you created to improve
the results in this comparison and argue that it works better than the standard
algorithms offered in the description of the project.
(g) Comparison with Oracle trading Dow Jones Industrial Average (DIA):
repeat the above scenario for the Dow Jones industrial average (DIA) and an
omniscient trader (Oracle), i.e. instead of predicting the movements using any
of your algorithms, use the true movements. In other words, if the actual T-day
ahead return is positive in a day, long the stock, and if it is negative, short the
stock. Compare all of your algoritms with the performance of Oracle on DIA on
the same plot and draw conclusions.
6