讲解subset留学生、辅导dataset、辅导C++,Java,Python编程设计 讲解数据库SQL|辅导Python程序
- 首页 >> OS编程 You use a subset (see below) of the dataset in the file “HousePrices.txt” which consist of 11
columns, with measurements for each of 585 Belgian municipalities. The response variable is the
median price of a regular house in the municipality (in thousands of euros).
Region x1 The administrative region: Flanders, Walloon, Brussels-capital.
Province x2 The name of the province (there are officially 10 provinces in Belgium), plus the
Brussels-capital region, which is here treated as a separate province. Hence this variable has 11
categories.
Municipality The name of the municipality (this identifies the different observations and is
provided just for the curious ones).
PriceHouse y Median price of a regular house in the municipality (in thousands of euros).
Shops x3 The number of officially registered shops in the municipality exceeding a certain
number of square meters.
Bankruptcies x4 Number of bankruptcies in the municipality in one year, this includes all type of
enterprises (from one-person companies to big firms).
MeanIncome x5 The average of the taxable incomes of all tax forms of the municipality (in
thousands of euros).
TaxForms x6 The number of tax declarations for the municipality that were submitted to the tax
office.
HotelRestaurant x7 The number of hotels and restaurants (added together) in the municipality.
Industries x8 Number of industrial firms in the municipality.
HealthSocial x9 The number of health care and social service facilities in the municipality.
Each of you will study a subset of these data, and use the following code to get your sub-dataset.
Note that the provided code serves as a hint, you will need to make changes to it.
Constructing your own dataset:
code = 753031
fulldata = read.csv("HousePrices.txt", sep = " ", header = TRUE)
digitsum = function(x) sum(floor(x/10^(0:(nchar(x)-1)))%%10)
set.seed(code)
mysum = digitsum(code)
if((mysum %% 2) == 0) { # number is even
rownumbers = sample(1:327,150,replace=F)
} else { # number is odd
rownumbers = sample(309:585,150,replace=F)
}
mydata = fulldata[rownumbers,]This way you have taken a sample of 150 municipalities, either from the Flanders region +
Brussels captial area, or from the Walloon region + Brussels captial area. Now, based on your own
sub-dataset, answer following questions one by one.
Questions to be answered:
1) Q1: Use semiparametric flexible modelling to construct a model for the median house price.
Use AIC as a method to select a final model and report on which (type of) models were
included in the search. Only for the components of the selected model that are modeled in a
nonlinear way, provide graphs. The models in this question should not treat covariates as
random effects. Give the model that you have selected in correct notation. It is alright to use
a general notation (e.g. f(x2)) for a smooth function, but you have to state which (spline)
functions you have used, and how the smoothing parameter was selected. If you want to
use the function gam from library(mgcv), the provided AIC value is compatible with
parametric AIC values when using the default option for setting the smoothing parameter.
Notes for Q1:
a) explore all variables of “mydata”, state the information of “distribution” and “link function”
clearly in the models.
b) Clearly state how many (and why) knots you choose, and clarify how you choose
smoothing parameter in details.
c) Treat all variables as fixed.
2) For this question you use the response and only the covariates x6 (number of tax forms) and
x9 (number of health care and social service facilities). State the null hypothesis of a
parametric additive model for the median house price with quadratic effects for both
covariates. Test this hypothesis using an order selection test against a nonparametric
alternative hypothesis, report the hypotheses, the construction of the test statistic, its value,
as well as the corresponding p-value and draw the correct conclusion.
Notes for Q2:
a) Test whether you can fit an additive model in those two covariates (x6 and x9) in quadratic
effects.
b) Clearly state how to do a proper test, including all steps of hypothesis testing and how
they lead to the conclusion?
3) In this question a parametric (generalized) linear mixed effect model should be constructed.
(i) Make a graphical presentation that supports why you suggest a certain mixed effect structure
using x2 Province as the grouping variable. Construct the plot illustrating whether there is an
effect of Province when regressing y on x6 the number of tax forms. For the plot you may
ignore all other covariates.
(ii) Construct a parametric (generalized) linear mixed effect model using your suggestion from (i).
You leave out variable x1 for this part, other covariates may be included in the model in a parametric way. Your model should include x2 and x6, the inclusion of other covariates in
your model may be based on your answer of question 1, no fixed effect model selection
should be done for this question. Provide the model using correct notation, and give a
summary of the output. Briefly discuss whether the output supports your suggestion from (i).
Note: library(hglm) contains both hglm and hglm2 wich may be used for fitting, also
glmm-PQL is a possibility. If one of these functions gives problems for your dataset, try one of
the other ones.
Notes for Q3:
a) Among Q1-5, only Q3 takes the random effect into consideration.
4) In this question you start from a large parametric model (no random effects, no interactions)
and you will perform a focused search over all sub-models of the large model and this for
two focuses:
(i) the median price of a regular house for one municipality of your choice from your
dataset where there is a low (though not the lowest) number of industrial firms,
(ii) the median price of a regular house for one municipality of your choice from your
dataset where there is a large (though not the largest) number of hotels and restaurants.
Write the selected model for each focus using correct notation and provide the
estimated values of the focuses for both cases. Briefly discuss.
Notes for Q4:
a) Look your dataset in 150 lines, pick one village for the low industry, and another one for
the high number of hotels. And, search for the best models to match the house price for
those two villages.
b) Use correct notations and clearly state the “distribution”, “link function”, “coefficient”.
5) In this question you use the same large parametric model (no random effects, no interactions)
as you started with in question 4.
(i) Construct a table containing the vector of estimated coefficients of the regression model
using four methods:
(a) maximum likelihood estimation in the large model
(b) Ridge regression
(c) Lasso estimation
(d) An elastic net estimator, different from the ridge and lasso one.
For (b), (c) and (d) you use the software’s default value for the penalty parameter λ.
(ii) Using the four estimation methods from (i), give in a table the predictions for the median
price of a regular house for the same two municipalities as in question 4. Briefly discuss.
Note: If you would like to use a function other than glmnet for penalized estimation, here is an
alternative with a few more options. Since the syntax is quite a bit different, you might want to
adjust the lines below to your setting, if you want to use this.library(h2o)
h2o.init()
mydat2=as.h2o(mydata)
mydat2$Region <- as.factor(mydat2$Region)
mydat2$Province <- as.factor(mydat2$Province)
y="PriceHouse"
X = c("Province", "Shops") # add here the variables that you wish to put in X.
alpha0 <- h2o.glm(family= "something", link="something", x= X, y=y, alpha=0,
lambda_search=TRUE, training_frame=mydat2, nfolds=0)
# indicate the same rows as in question 4:
Xeval = as.h2o(as.data.frame(mydat2[c(1,2),]))
h2o.predict(alpha0, newdata=Xeval)
columns, with measurements for each of 585 Belgian municipalities. The response variable is the
median price of a regular house in the municipality (in thousands of euros).
Region x1 The administrative region: Flanders, Walloon, Brussels-capital.
Province x2 The name of the province (there are officially 10 provinces in Belgium), plus the
Brussels-capital region, which is here treated as a separate province. Hence this variable has 11
categories.
Municipality The name of the municipality (this identifies the different observations and is
provided just for the curious ones).
PriceHouse y Median price of a regular house in the municipality (in thousands of euros).
Shops x3 The number of officially registered shops in the municipality exceeding a certain
number of square meters.
Bankruptcies x4 Number of bankruptcies in the municipality in one year, this includes all type of
enterprises (from one-person companies to big firms).
MeanIncome x5 The average of the taxable incomes of all tax forms of the municipality (in
thousands of euros).
TaxForms x6 The number of tax declarations for the municipality that were submitted to the tax
office.
HotelRestaurant x7 The number of hotels and restaurants (added together) in the municipality.
Industries x8 Number of industrial firms in the municipality.
HealthSocial x9 The number of health care and social service facilities in the municipality.
Each of you will study a subset of these data, and use the following code to get your sub-dataset.
Note that the provided code serves as a hint, you will need to make changes to it.
Constructing your own dataset:
code = 753031
fulldata = read.csv("HousePrices.txt", sep = " ", header = TRUE)
digitsum = function(x) sum(floor(x/10^(0:(nchar(x)-1)))%%10)
set.seed(code)
mysum = digitsum(code)
if((mysum %% 2) == 0) { # number is even
rownumbers = sample(1:327,150,replace=F)
} else { # number is odd
rownumbers = sample(309:585,150,replace=F)
}
mydata = fulldata[rownumbers,]This way you have taken a sample of 150 municipalities, either from the Flanders region +
Brussels captial area, or from the Walloon region + Brussels captial area. Now, based on your own
sub-dataset, answer following questions one by one.
Questions to be answered:
1) Q1: Use semiparametric flexible modelling to construct a model for the median house price.
Use AIC as a method to select a final model and report on which (type of) models were
included in the search. Only for the components of the selected model that are modeled in a
nonlinear way, provide graphs. The models in this question should not treat covariates as
random effects. Give the model that you have selected in correct notation. It is alright to use
a general notation (e.g. f(x2)) for a smooth function, but you have to state which (spline)
functions you have used, and how the smoothing parameter was selected. If you want to
use the function gam from library(mgcv), the provided AIC value is compatible with
parametric AIC values when using the default option for setting the smoothing parameter.
Notes for Q1:
a) explore all variables of “mydata”, state the information of “distribution” and “link function”
clearly in the models.
b) Clearly state how many (and why) knots you choose, and clarify how you choose
smoothing parameter in details.
c) Treat all variables as fixed.
2) For this question you use the response and only the covariates x6 (number of tax forms) and
x9 (number of health care and social service facilities). State the null hypothesis of a
parametric additive model for the median house price with quadratic effects for both
covariates. Test this hypothesis using an order selection test against a nonparametric
alternative hypothesis, report the hypotheses, the construction of the test statistic, its value,
as well as the corresponding p-value and draw the correct conclusion.
Notes for Q2:
a) Test whether you can fit an additive model in those two covariates (x6 and x9) in quadratic
effects.
b) Clearly state how to do a proper test, including all steps of hypothesis testing and how
they lead to the conclusion?
3) In this question a parametric (generalized) linear mixed effect model should be constructed.
(i) Make a graphical presentation that supports why you suggest a certain mixed effect structure
using x2 Province as the grouping variable. Construct the plot illustrating whether there is an
effect of Province when regressing y on x6 the number of tax forms. For the plot you may
ignore all other covariates.
(ii) Construct a parametric (generalized) linear mixed effect model using your suggestion from (i).
You leave out variable x1 for this part, other covariates may be included in the model in a parametric way. Your model should include x2 and x6, the inclusion of other covariates in
your model may be based on your answer of question 1, no fixed effect model selection
should be done for this question. Provide the model using correct notation, and give a
summary of the output. Briefly discuss whether the output supports your suggestion from (i).
Note: library(hglm) contains both hglm and hglm2 wich may be used for fitting, also
glmm-PQL is a possibility. If one of these functions gives problems for your dataset, try one of
the other ones.
Notes for Q3:
a) Among Q1-5, only Q3 takes the random effect into consideration.
4) In this question you start from a large parametric model (no random effects, no interactions)
and you will perform a focused search over all sub-models of the large model and this for
two focuses:
(i) the median price of a regular house for one municipality of your choice from your
dataset where there is a low (though not the lowest) number of industrial firms,
(ii) the median price of a regular house for one municipality of your choice from your
dataset where there is a large (though not the largest) number of hotels and restaurants.
Write the selected model for each focus using correct notation and provide the
estimated values of the focuses for both cases. Briefly discuss.
Notes for Q4:
a) Look your dataset in 150 lines, pick one village for the low industry, and another one for
the high number of hotels. And, search for the best models to match the house price for
those two villages.
b) Use correct notations and clearly state the “distribution”, “link function”, “coefficient”.
5) In this question you use the same large parametric model (no random effects, no interactions)
as you started with in question 4.
(i) Construct a table containing the vector of estimated coefficients of the regression model
using four methods:
(a) maximum likelihood estimation in the large model
(b) Ridge regression
(c) Lasso estimation
(d) An elastic net estimator, different from the ridge and lasso one.
For (b), (c) and (d) you use the software’s default value for the penalty parameter λ.
(ii) Using the four estimation methods from (i), give in a table the predictions for the median
price of a regular house for the same two municipalities as in question 4. Briefly discuss.
Note: If you would like to use a function other than glmnet for penalized estimation, here is an
alternative with a few more options. Since the syntax is quite a bit different, you might want to
adjust the lines below to your setting, if you want to use this.library(h2o)
h2o.init()
mydat2=as.h2o(mydata)
mydat2$Region <- as.factor(mydat2$Region)
mydat2$Province <- as.factor(mydat2$Province)
y="PriceHouse"
X = c("Province", "Shops") # add here the variables that you wish to put in X.
alpha0 <- h2o.glm(family= "something", link="something", x= X, y=y, alpha=0,
lambda_search=TRUE, training_frame=mydat2, nfolds=0)
# indicate the same rows as in question 4:
Xeval = as.h2o(as.data.frame(mydat2[c(1,2),]))
h2o.predict(alpha0, newdata=Xeval)