IS542留学生讲解、辅导pdf format、讲解R语言、R设计辅导

- 首页 >> Algorithm 算法

Fall 2018 IS542 Final

Due Tuesday December 18, 5:00PM US Central Time

Discuss two or more of the following questions, in your own words. You may choose to address any two,

three, four, or even all questions but should target 3-4 pages of text in total (not counting figures, tables,

and references). Upload your answers to the final section of the class Moodle page as a single narrative

document in pdf format. You may, and are encouraged to, illustrate your answers using R, but that's no

substitute for lucid natural language explanations. To preserve the natural flow of the narrative, figures

and tables should be embedded into the document near their first mention. Any supplementary files like

code or data should be referenced in the text and separately uploaded. You may use books, articles, notes,

search engines, or computers, but may not solicit or receive direct assistance from other human beings.

Cite sources if you use them. For the first three question you may want to illustrate technical detail using

R, discuss practical aspects that are important for applications, and theoretical aspects of the subject.

Question 1. Construct a dataset with at least 8 observations and 3 variables (y, x1, and x2) such that least

squares linear regression of y versus x1 produces y = - 2x1 + e1 and regressing y versus x1 and x2

produces y = 2x1 - x2 + e2. How might you interpret the relationship between y and x1? Show your work

in R.

Question 2. Write a short essay, in your own words, explaining the four assumptions of linear regression

and show how to test them on a dataset of your choice. Show your work in R.

Question 3. Write a short essay, in your own words, on the subject of the Bayes theorem illustrate its use

in an application of your making.

Question 4. R challenge. During the last class session we worked with the circle.arff dataset, assessing

the cross-validated performance of a wide variety of classification algorithms such as decision trees,

random forest, rules, support vector machine, Na?ve Bayes, Bayes Net, logistic regression, neural net, knearest

neighbor, and boosting. Replicate some of these experiments using R.

Question 5. R challenge: The data directory contains a file with author names and associated Ethnea and

Genni predictions. Use logistic regression to identify character n-grams of first and/or last names that may

help predict the Ethnea categories. It might be helpful to install and use an R package such as tm that is

able to extract character n-grams. Classification performance can be assessed using precision and recall

for each ethnicity Ethnea category, and classes that are the most similar can be identified using the

confusion matrix.

Full dataset:

Of which a smaller, random sample is given here:


Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a largescale

bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of

Congress, Washington DC, USA
