辅导R外国、辅导R编程CANVAS
- 首页 >> Algorithm 算法Hand in electronically via CANVAS
First a bit about handing in your assignment. You need to submit both your R Markdown
document and a pdf file containing the document it generates. To create a pdf you should start
your R Markdown document with the following lines (having made the appropriate changes):
---
title: "STATS 762 Assignment 1"
author: "Your Name, ID 1234567"
date: "Due: 27 March 2017"
output: pdf_document
---
If you are using Windows, you may find that you cannot generate a pdf file directly. In this
case replace output: pdf_document with output: word_document. When you click the **Knit**
button a Word document will be produced which you can then open and save as a pdf file. Submit
the pdf file and not the Word file.
The data for this assignment comes from the UCI Machine Learning Repository:
The original source for this data is: M. Elter, R. Schulz-Wendtland and T. Wittenberg (2007)
“The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize
an intelligible decision process.” Medical Physics 34(11), pp. 4164-4172.
In addition to the data file, an information file has been posted on CANVAS which contains the
background for this dataset that was given on the Machine Learning Repository webpage. Read
this file carefully as it contains background information that will help you understand the context
of this data.
1
1. Create a data frame named birad.df in R. Make sure that the variables have the proper
designations (numeric, factor . . . ). Also make sure that there are no obvious mistakes in the
data. In R missing values are designated by NA, so you may need to modify your data frame
to conform to this protocol.
2. The BI-RADS (Breast Imaging Reporting and Data System) assessment score evaluates the
severity of a lesion based on its observed characteristics during a mammogram.
(a) Use a mosaic plot to explore the relationship between the BI-RADS assessment and the
probability that a lesion is malignant as opposed to benign. Comment on what your plot
indicates about this relationship.
(b) Fit a logistic regression model that relates the BI-RADS assessment to the probability
that a lesion is malignant. Check for over-dispersion and comment on what you find.
Use this model to get a 95% confidence interval for the probability of malignancy for
each level of BI-RADS assessment.
3. A patients age is also believed to be important in predicting whether a lesion is malignant or
not.
(a) Fit the logistic regression model that uses both the BI-RADS assessment and age as
regressors. Does including age in the model improve its ability to predict the probability
that a lesion is malignant? Support your answer.
(b) Medical diagnostic test are often assessed by their sensitivity (the probability the test is
positive when the condition exists) and specificity (the probability the test is negative
when the condition does not exist). For the two logistic regression models that have been
fitted assume that the diagnosis of a malignant lesion is positive when the estimated
probability is ≥ .5. Estimate the sensitivity and specificity for each of the two logistic
regression models and comment on the results.
4. Often it is useful to create a categorical variable from a numeric variable such as age. For this
data set create a categorical variable that divides Age into the following categories: under 30,
30-39, 40-49, . . . .
(a) Fit the logistic regression model that use both the BI-RADS assessment and the age
group as regressors. Does this new model fit better than the model from part 3?
(b) Estimate the sensitivity and specificity for this model and compare it to your finding in
3(b).
5. The BI-RADS assessment is based on a number of characteristics including the shape, margin
and density of the lesion. Does the BI-RADS assessment capture all of the useful information
(with respect to predicting the probability a lesion is malignant) from these three characteristics.
Provide evidence to support your answer.
