data留学生讲解、辅导R程序语言、讲解R、辅导program 调试Matlab程序|调试Matlab程序
- 首页 >> C/C++编程 Statistical Modeling with R – MakeUp Exam:2020-05-22 Exam ID 00006
Name:
Student ID:
4. (a)
Statistical Modeling with R – MakeUp Exam: 00006 2
Statistical Modeling with R – MakeUp Exam: 00006 3
Statistical Modeling with R – MakeUp Exam: 00006 4
1. Load the data provided in flights.Rdata from **piazza** and select all flights that had
been scheduled for departure between 2013-01-21 and 2013-01-24.
(a) How many flights (i.e. cases) are in the resulting data set?
(b) How many variables does the resulting data set comprise?
2. Plot a box plot of the departure delays (variable dep_delay) using the departure airport
(variable origin) as grouping variable.
(a) Do the three NYC airports show a similar distribution in departure delays?
(b) Are the median departure delays for each airport close to zero?
(c) Which flight (from where to where) came the earliest (had the smallest departure delay)?
(d) Which flight (from where to where) has the largest departure delay?
3. Create a categorical variable delay with three categories:
• on time: flights with an arrival delay less than 15 minutes
• delayed: flights with an arrival delay between 15 and 48 minutes
• heavy delay: flights with an arrival delay of more than 48 minutes
Order the variable according to on time, delay, heavy delay.
(a) Compute the number of flights in category on time.
(b) Compute the number of flights in category heavy delay.
(c) Compute the number of flights in category delay.
(d) Which delay category is the most frequent one?
4. Cross-tabulate the variables origin and delay.
(a) What is the most-frequent combination of the two variables origin and delay?
(b) Which share of on time flights departed from LGA?
(c) Which share of flights departing from LGA have been in delay category on time ?
5. In the following, you restrict your analysis to flights that have an arrival delay of at least 58
minutes. Create the corresonding data subset.
(a) How many flights are in this subset?
(b) How many flights have been excluded by this procedure?
6. Compute a linear model for arrival delay as dependent variable using the following variables
in the data set as predictors: dep_delay, origin, air_time, carrier, dest, hour,
minute. Call this model flights.lm!
(a) According to the ANOVA table, is the predictor dest statistically significant (at least) at
the 1% significance level?
(b) Looking at the regression coefficients, briefly discuss what the coefficients for originJFK
and originLGA mean?
(c) How good does this model fit? On what do you base your judgment?
(d) Check whether the residuals of this model follow a normal distribution! Do they? Which
tool did you use for checking this?
Statistical Modeling with R – MakeUp Exam: 00006 5
7. Starting with the null model and taking the model flights.lm as upper bound, run a stepwise
model selection procedure to find the best model according to the AIC criterion. Call
the resulting model flights.lm.best!
(a) Which predictors are included in the optimal model?
(b) Report the adjusted R-squared of the final model?
(c) Report the AIC of the final model?
(d) Using an F-test check whether the final (= best according to automatic variable selection)
model is significantly different from the model using the predictors as computed
in the previous exercise?
(e) Which predictors are included in the model computed in the previous exercise that are
not included in the final model obtained by the automatic procedure?
(f) According to the final model, by how many minutes more is a flight at arrival delayed if
it departs 10 minutes later (all other things being equal).
(g) According to the final model, which carrier is the best to minimize arrival delays.
(h) Amend the final model by adding an interaction term between origin and air_time.
Is the interaction term significant at the 10% level? According to this model and all
other things equal, by how much will departure delays differ for two flights having a
difference in air time of 100 minutes and one leaving from EWR, the other from JFK
(all other things being equal)?
8. You now want to generate a classification model that tells you whether a flight is delayed at
arrival more than 103 minutes using the predictors in the model flights.lm.best. Use the
logit link here! Call the model flights.classbin!
(a) Which error distribution have you chosen to create this model?
(b) According to the Wald tests (Table of coefficients): which predictors are significant (at
least at the 1% level)?
(c) According to the Likelihood-Ratio Test (LR-test as given in the Deviance Table): which
predictors are significant (at least at the 1% level)?
(d) Report the residual deviance of your model?
(e) Report the Null deviance of your model?
(f) Using a χ
2
-test check whether your model is significantly better than the null model?
(g) According to your model, how does the air time of a flight influence the likelihood of it
being more than two hours delayed?
(h) Based on your model’s fitted probabilites for being more than two hours delayed create
an indicator for delayed/not delayed flights using the probability 0.5 as threshold.
Create a frequency table of the predicted and the observed delay indicator. Calculate
all misclassification rates.
9. Using the model flights.classbin predict the probability for being more than two hours
delayed at departure using the average scores of numeric predictors in the model for carrier
UA (United Air Lines) and destination ORD (Chicago Ohare International) and origin JFK.
10. Again using the model flights.classbin, you now want to investigate the specific dependency
on the hour of the day. In case hour is not yet included in the model, update the
model by adding this predictor. Generate new data such that you have the hours from 5 to
23 in increments of 1. The other numeric predictors enter again with their mean score into
the prediction in the model for carrier UA (United Air Lines) and destination ORD (Chicago
Ohare International). Compute the predictions and average them.
11. What is the name of the R function for ordinal logistic regression?
Name:
Student ID:
4. (a)
Statistical Modeling with R – MakeUp Exam: 00006 2
Statistical Modeling with R – MakeUp Exam: 00006 3
Statistical Modeling with R – MakeUp Exam: 00006 4
1. Load the data provided in flights.Rdata from **piazza** and select all flights that had
been scheduled for departure between 2013-01-21 and 2013-01-24.
(a) How many flights (i.e. cases) are in the resulting data set?
(b) How many variables does the resulting data set comprise?
2. Plot a box plot of the departure delays (variable dep_delay) using the departure airport
(variable origin) as grouping variable.
(a) Do the three NYC airports show a similar distribution in departure delays?
(b) Are the median departure delays for each airport close to zero?
(c) Which flight (from where to where) came the earliest (had the smallest departure delay)?
(d) Which flight (from where to where) has the largest departure delay?
3. Create a categorical variable delay with three categories:
• on time: flights with an arrival delay less than 15 minutes
• delayed: flights with an arrival delay between 15 and 48 minutes
• heavy delay: flights with an arrival delay of more than 48 minutes
Order the variable according to on time, delay, heavy delay.
(a) Compute the number of flights in category on time.
(b) Compute the number of flights in category heavy delay.
(c) Compute the number of flights in category delay.
(d) Which delay category is the most frequent one?
4. Cross-tabulate the variables origin and delay.
(a) What is the most-frequent combination of the two variables origin and delay?
(b) Which share of on time flights departed from LGA?
(c) Which share of flights departing from LGA have been in delay category on time ?
5. In the following, you restrict your analysis to flights that have an arrival delay of at least 58
minutes. Create the corresonding data subset.
(a) How many flights are in this subset?
(b) How many flights have been excluded by this procedure?
6. Compute a linear model for arrival delay as dependent variable using the following variables
in the data set as predictors: dep_delay, origin, air_time, carrier, dest, hour,
minute. Call this model flights.lm!
(a) According to the ANOVA table, is the predictor dest statistically significant (at least) at
the 1% significance level?
(b) Looking at the regression coefficients, briefly discuss what the coefficients for originJFK
and originLGA mean?
(c) How good does this model fit? On what do you base your judgment?
(d) Check whether the residuals of this model follow a normal distribution! Do they? Which
tool did you use for checking this?
Statistical Modeling with R – MakeUp Exam: 00006 5
7. Starting with the null model and taking the model flights.lm as upper bound, run a stepwise
model selection procedure to find the best model according to the AIC criterion. Call
the resulting model flights.lm.best!
(a) Which predictors are included in the optimal model?
(b) Report the adjusted R-squared of the final model?
(c) Report the AIC of the final model?
(d) Using an F-test check whether the final (= best according to automatic variable selection)
model is significantly different from the model using the predictors as computed
in the previous exercise?
(e) Which predictors are included in the model computed in the previous exercise that are
not included in the final model obtained by the automatic procedure?
(f) According to the final model, by how many minutes more is a flight at arrival delayed if
it departs 10 minutes later (all other things being equal).
(g) According to the final model, which carrier is the best to minimize arrival delays.
(h) Amend the final model by adding an interaction term between origin and air_time.
Is the interaction term significant at the 10% level? According to this model and all
other things equal, by how much will departure delays differ for two flights having a
difference in air time of 100 minutes and one leaving from EWR, the other from JFK
(all other things being equal)?
8. You now want to generate a classification model that tells you whether a flight is delayed at
arrival more than 103 minutes using the predictors in the model flights.lm.best. Use the
logit link here! Call the model flights.classbin!
(a) Which error distribution have you chosen to create this model?
(b) According to the Wald tests (Table of coefficients): which predictors are significant (at
least at the 1% level)?
(c) According to the Likelihood-Ratio Test (LR-test as given in the Deviance Table): which
predictors are significant (at least at the 1% level)?
(d) Report the residual deviance of your model?
(e) Report the Null deviance of your model?
(f) Using a χ
2
-test check whether your model is significantly better than the null model?
(g) According to your model, how does the air time of a flight influence the likelihood of it
being more than two hours delayed?
(h) Based on your model’s fitted probabilites for being more than two hours delayed create
an indicator for delayed/not delayed flights using the probability 0.5 as threshold.
Create a frequency table of the predicted and the observed delay indicator. Calculate
all misclassification rates.
9. Using the model flights.classbin predict the probability for being more than two hours
delayed at departure using the average scores of numeric predictors in the model for carrier
UA (United Air Lines) and destination ORD (Chicago Ohare International) and origin JFK.
10. Again using the model flights.classbin, you now want to investigate the specific dependency
on the hour of the day. In case hour is not yet included in the model, update the
model by adding this predictor. Generate new data such that you have the hours from 5 to
23 in increments of 1. The other numeric predictors enter again with their mean score into
the prediction in the model for carrier UA (United Air Lines) and destination ORD (Chicago
Ohare International). Compute the predictions and average them.
11. What is the name of the R function for ordinal logistic regression?