SOC 360辅导、C++,Java编程辅导
- 首页 >> Algorithm 算法 SOC 360 Statistics for Sociologists 1
Data Analysis Project
Part 1
For this project, you’ll use the methods of statistical analysis we’ve learned to analyze real social
relations. The full project will be completed over the course of the term and will be submitted in two
parts. You’ll develop a research question and hypothesis about the univariate and bivariate distributions of
two variables, using appropriate numerical and graphical methods to present your findings. Later, you’ll
use inferential statistics to assess the probability that the observed relationship between the variables is
due only to chance. We’re primarily interested in assessing whether you can properly present and
interpret the results of statistical analysis and write clearly about them.
Essentially, there are four types of DAP that you might conduct; you are only required to carry out one
of them:
1. a quantitative predictor and quantitative outcome
2. a categorical predictor and outcome
3. a quantitative predictor and categorical outcome
4. a categorical predictor and quantitative outcome.
The first two methods are simpler and easier to conduct with the tools we’ve developed, but you are
welcome to try your hand at any of the four types. A detailed tutorial for each type, each of which is
roughly an hour (though I speak very slowly and you may be able to watch it in 40 minutes or less
depending on how much you speed it up), is available on Youtube. These walk through a do-file with
detailed, clear commentary. You are also more than welcome to ask your TA or the lecturer for support!
Please limit the number of pages for the first part of the assignment to eight (8) double-spaced pages
with 12-point font. Contact us if you foresee any problems in sticking to this limit.
INSTRUCTIONS FOR DAP I
Step 1: Exploring the data
For this paper, we’ll use the General Social Survey (GSS) data or Current Population Survey (CPS) data
that you have been using in section, unless you’ve worked with other sets in other courses and want to
use them—if so, reach out to us! Graduate students should generally try to come up with data-sets that
are directly relevant to them; ask Griffin or Sanghyo for help. The CPS data are more limited, but the
economic data are better and more like true quantitative variables than the GSS economic data;
conversely, the GSS data include many more questions. In each data-set, going to the variable window of
Stata and typing the first few letters of a keyword you’re interested in brings up relevant variables.
The GSS codebook, found here (http://gss.norc.org/documents/codebook/gss_codebook.pdf), describes
the many variables that you can find in the GSS. While the codebook is very large, you can use the
CTRL+f function on your computer to search quickly through the initial listing of the variables to find
your variable of interest (it’s best to have a rough idea in mind of what you’re looking for first, though);
then, continue scrolling through the hits until you get to that page of the codebook. Alternatively, on page
1 of the document (and page 12 of the file), the list of all variables begins. Note that not all questions
described are asked about every year, so you’ll want to stick to questions that are asked in your year.
For CPS data, we are technically using an extract from 2019 put together by the Center for Economic and
Policy Research (CEPR), which is described here. Documentation for this data-set is a bit more
complicated, but on the other hand, the set of variables is smaller, and it is usually clear what they mean.
Select the following…
1. An outcome variable whose variation interests you, such as income or education – for this part,
it is best to select variables that we think of as generally dependent on other sociological variables
(such as class or ascribed race).
2. A variable which you think might have some causal influence on the first sort of variable.
For the variables you have identified, do all of the following:
1. Formulate a research question and hypothesis about a) how the values of the variables are
distributed in the population, and b) how the outcome variable might relate to the predictor
variable.
a. E.g., a) on balance, most people in the US either have a college education or less than a
12th grade education, as do their fathers and b) their education level (educ) is1 2
positively related to their father’s education level (paeduc).3
b. Consider whether any of your variables might need some basic transformations
(especially the creation of a dummy variable); if so, create one.
2. Present basic descriptive statistics of the variable in nice-looking, titled tables, taking care
to show multiple measures of both central tendency and spread. Look for outliers and consider
the reason for them, commenting on whether they are simply extreme values or whether they
might be data errors. Make sure to avoid including measures that are not meaningful for the
type of variables which you have.
3. Present graphical analyses of each variable’s individual distribution. You can present the
distribution of each variable in multiple different ways; just make sure not to present data in a
way that does not make sense (e.g., a histogram of race would be much less informative than a
bar graph of the same, whereas a pie graph of income would be much less useful than a
histogram). Describe the meaning of each figure you use; you need at least one per variable.
4. Present a bivariate numerical analysis of the variables, again making sure to put the relevant
numbers into a nicely-formatted table. For quantitative variables, report all of the relevant
regression output; for qualitative variables, explain the meaning of a two-way table.
5. Present a bivariate graphical analysis of that relationship.
3 You can, for now, take this survey data as basically representative of the population at large and not worry about the
problem of sampling. The second half of the course, and the second part of the data project, will put this assumption
into question and lead us into statistical inference proper.
2 This typeface indicates the name of a variable in Stata or a command in Stata. “Educ” is how the GSS labels the
respondent’s education, whereas “paeduc” is father’s education.
1 This is a hypothesis which we would have evidence for rejecting; it’s just an example.
Step 2: Writing the report
After you complete the analysis, you are ready to begin writing your paper. This data analysis project
should be written in narrative form (full sentences and paragraphs) and should include …
1. Introduction
a. Give the research question you are trying to answer in this report (distribution of
variables of interest and their relationship).
b. State your research hypotheses – for both the univariate distribution and the relationship
between one variable and another — and justify them with some brief social analysis.
c. Discuss the relevance or importance of your research question (i.e., why should your
readers care about this) in a paragraph or so.
2. Data & methods
a. Briefly describe your data and its survey design.
b. Discuss the operationalization of your variables. What type of variable is it? How did
the researchers measure it?
i. Make sure to consider some potential weaknesses of this measure of your
variable. (Note: for some variables, the GSS variable will be a relatively
unproblematic measure of your theoretical construct, while in other cases, there
may be significant gaps – that is OK. Just report it.).
c. Discuss briefly the numeric and graphic methods you used for analysis. You can
simply refer to the results later in a table; here, you should focus less on the particular
results and more on why you chose, say, regression or a two-way table.
3. Univariate analysis
a. Present —report but also discuss — the contents of the key tables of univariate
descriptive statistics with appropriate titles, effective rounding, source cited, etc. Do the
same with the relevant graphs. The question of which measures are appropriate to
include will depend on the type of variables you select; be as comprehensive as you can
without being loquacious.
b. Comment on the relationship to your hypothesis: is this what you expected?
4. Bivariate analysis
a. Present —report but also discuss — the contents of the key tables of bivariate
descriptive statistics with appropriate titles, effective rounding, source cited, etc. Do the
same with the relevant graphs. Again, the question of which measures are appropriate to
include will depend on the type of variables you select; be as comprehensive as you can
without being loquacious.
b. Comment on the relationship to your hypothesis: is this what you expected?
5. Conclusion
a. Assess your hypothesis in the light of your findings. Does your hypothesis seem to have
evidence in favor of it or against it? Again, you do not need to currently worry about the
problem of statistical inference (are these survey data generalizable) – although you
should be aware of this problem. Are there limits on your form of analysis?
b. If your hypothesis is supported, what further questions do you have about the
findings? If your hypothesis is not supported, ruminate on why this is. Is the
hypothesis flawed? Were your measures not up to the task of testing it as rigorously as
you’d like? What direction should additional research on this subject take?
Data Project Part 1 Rubric
DAP I is due on Sunday, October 30th by 11:59pm as a .doc file uploaded to Canvas. The report should
be submitted as a single Word file (with graphs included in the body of the text). The document should
have 1-inch margins and the text should be double spaced and in 12-point font.
Section/Category Task Possible Points
Introduction (1) Research question clearly stated and hypothesis and
reasoning for hypothesis and relevance discussed
1
Data & methods (1.5) Brief description of the data-set 0.5
Pros and cons of the variable measure are discussed 0.5
Methodological choices are justified 0.5
Univariate analysis (3) Graphs are included & properly labeled / interpreted 1.5
Tables are included & properly labeled / interpreted 1.5
Bivariate analysis (3) Graphs are included & properly labeled / interpreted 1.5
Tables are included & properly labeled / interpreted 1.5
Conclusion (0.5) Makes a reasonable conclusion referring to the
original hypothesis and considers further questions
0.5
Style/grammar (1) Report written in complete sentences, full paragraphs,
nicely labeled sections, relatively few grammatical errors,
12pt font, double spaced with 1 inch margins
Data Analysis Project
Part 1
For this project, you’ll use the methods of statistical analysis we’ve learned to analyze real social
relations. The full project will be completed over the course of the term and will be submitted in two
parts. You’ll develop a research question and hypothesis about the univariate and bivariate distributions of
two variables, using appropriate numerical and graphical methods to present your findings. Later, you’ll
use inferential statistics to assess the probability that the observed relationship between the variables is
due only to chance. We’re primarily interested in assessing whether you can properly present and
interpret the results of statistical analysis and write clearly about them.
Essentially, there are four types of DAP that you might conduct; you are only required to carry out one
of them:
1. a quantitative predictor and quantitative outcome
2. a categorical predictor and outcome
3. a quantitative predictor and categorical outcome
4. a categorical predictor and quantitative outcome.
The first two methods are simpler and easier to conduct with the tools we’ve developed, but you are
welcome to try your hand at any of the four types. A detailed tutorial for each type, each of which is
roughly an hour (though I speak very slowly and you may be able to watch it in 40 minutes or less
depending on how much you speed it up), is available on Youtube. These walk through a do-file with
detailed, clear commentary. You are also more than welcome to ask your TA or the lecturer for support!
Please limit the number of pages for the first part of the assignment to eight (8) double-spaced pages
with 12-point font. Contact us if you foresee any problems in sticking to this limit.
INSTRUCTIONS FOR DAP I
Step 1: Exploring the data
For this paper, we’ll use the General Social Survey (GSS) data or Current Population Survey (CPS) data
that you have been using in section, unless you’ve worked with other sets in other courses and want to
use them—if so, reach out to us! Graduate students should generally try to come up with data-sets that
are directly relevant to them; ask Griffin or Sanghyo for help. The CPS data are more limited, but the
economic data are better and more like true quantitative variables than the GSS economic data;
conversely, the GSS data include many more questions. In each data-set, going to the variable window of
Stata and typing the first few letters of a keyword you’re interested in brings up relevant variables.
The GSS codebook, found here (http://gss.norc.org/documents/codebook/gss_codebook.pdf), describes
the many variables that you can find in the GSS. While the codebook is very large, you can use the
CTRL+f function on your computer to search quickly through the initial listing of the variables to find
your variable of interest (it’s best to have a rough idea in mind of what you’re looking for first, though);
then, continue scrolling through the hits until you get to that page of the codebook. Alternatively, on page
1 of the document (and page 12 of the file), the list of all variables begins. Note that not all questions
described are asked about every year, so you’ll want to stick to questions that are asked in your year.
For CPS data, we are technically using an extract from 2019 put together by the Center for Economic and
Policy Research (CEPR), which is described here. Documentation for this data-set is a bit more
complicated, but on the other hand, the set of variables is smaller, and it is usually clear what they mean.
Select the following…
1. An outcome variable whose variation interests you, such as income or education – for this part,
it is best to select variables that we think of as generally dependent on other sociological variables
(such as class or ascribed race).
2. A variable which you think might have some causal influence on the first sort of variable.
For the variables you have identified, do all of the following:
1. Formulate a research question and hypothesis about a) how the values of the variables are
distributed in the population, and b) how the outcome variable might relate to the predictor
variable.
a. E.g., a) on balance, most people in the US either have a college education or less than a
12th grade education, as do their fathers and b) their education level (educ) is1 2
positively related to their father’s education level (paeduc).3
b. Consider whether any of your variables might need some basic transformations
(especially the creation of a dummy variable); if so, create one.
2. Present basic descriptive statistics of the variable in nice-looking, titled tables, taking care
to show multiple measures of both central tendency and spread. Look for outliers and consider
the reason for them, commenting on whether they are simply extreme values or whether they
might be data errors. Make sure to avoid including measures that are not meaningful for the
type of variables which you have.
3. Present graphical analyses of each variable’s individual distribution. You can present the
distribution of each variable in multiple different ways; just make sure not to present data in a
way that does not make sense (e.g., a histogram of race would be much less informative than a
bar graph of the same, whereas a pie graph of income would be much less useful than a
histogram). Describe the meaning of each figure you use; you need at least one per variable.
4. Present a bivariate numerical analysis of the variables, again making sure to put the relevant
numbers into a nicely-formatted table. For quantitative variables, report all of the relevant
regression output; for qualitative variables, explain the meaning of a two-way table.
5. Present a bivariate graphical analysis of that relationship.
3 You can, for now, take this survey data as basically representative of the population at large and not worry about the
problem of sampling. The second half of the course, and the second part of the data project, will put this assumption
into question and lead us into statistical inference proper.
2 This typeface indicates the name of a variable in Stata or a command in Stata. “Educ” is how the GSS labels the
respondent’s education, whereas “paeduc” is father’s education.
1 This is a hypothesis which we would have evidence for rejecting; it’s just an example.
Step 2: Writing the report
After you complete the analysis, you are ready to begin writing your paper. This data analysis project
should be written in narrative form (full sentences and paragraphs) and should include …
1. Introduction
a. Give the research question you are trying to answer in this report (distribution of
variables of interest and their relationship).
b. State your research hypotheses – for both the univariate distribution and the relationship
between one variable and another — and justify them with some brief social analysis.
c. Discuss the relevance or importance of your research question (i.e., why should your
readers care about this) in a paragraph or so.
2. Data & methods
a. Briefly describe your data and its survey design.
b. Discuss the operationalization of your variables. What type of variable is it? How did
the researchers measure it?
i. Make sure to consider some potential weaknesses of this measure of your
variable. (Note: for some variables, the GSS variable will be a relatively
unproblematic measure of your theoretical construct, while in other cases, there
may be significant gaps – that is OK. Just report it.).
c. Discuss briefly the numeric and graphic methods you used for analysis. You can
simply refer to the results later in a table; here, you should focus less on the particular
results and more on why you chose, say, regression or a two-way table.
3. Univariate analysis
a. Present —report but also discuss — the contents of the key tables of univariate
descriptive statistics with appropriate titles, effective rounding, source cited, etc. Do the
same with the relevant graphs. The question of which measures are appropriate to
include will depend on the type of variables you select; be as comprehensive as you can
without being loquacious.
b. Comment on the relationship to your hypothesis: is this what you expected?
4. Bivariate analysis
a. Present —report but also discuss — the contents of the key tables of bivariate
descriptive statistics with appropriate titles, effective rounding, source cited, etc. Do the
same with the relevant graphs. Again, the question of which measures are appropriate to
include will depend on the type of variables you select; be as comprehensive as you can
without being loquacious.
b. Comment on the relationship to your hypothesis: is this what you expected?
5. Conclusion
a. Assess your hypothesis in the light of your findings. Does your hypothesis seem to have
evidence in favor of it or against it? Again, you do not need to currently worry about the
problem of statistical inference (are these survey data generalizable) – although you
should be aware of this problem. Are there limits on your form of analysis?
b. If your hypothesis is supported, what further questions do you have about the
findings? If your hypothesis is not supported, ruminate on why this is. Is the
hypothesis flawed? Were your measures not up to the task of testing it as rigorously as
you’d like? What direction should additional research on this subject take?
Data Project Part 1 Rubric
DAP I is due on Sunday, October 30th by 11:59pm as a .doc file uploaded to Canvas. The report should
be submitted as a single Word file (with graphs included in the body of the text). The document should
have 1-inch margins and the text should be double spaced and in 12-point font.
Section/Category Task Possible Points
Introduction (1) Research question clearly stated and hypothesis and
reasoning for hypothesis and relevance discussed
1
Data & methods (1.5) Brief description of the data-set 0.5
Pros and cons of the variable measure are discussed 0.5
Methodological choices are justified 0.5
Univariate analysis (3) Graphs are included & properly labeled / interpreted 1.5
Tables are included & properly labeled / interpreted 1.5
Bivariate analysis (3) Graphs are included & properly labeled / interpreted 1.5
Tables are included & properly labeled / interpreted 1.5
Conclusion (0.5) Makes a reasonable conclusion referring to the
original hypothesis and considers further questions
0.5
Style/grammar (1) Report written in complete sentences, full paragraphs,
nicely labeled sections, relatively few grammatical errors,
12pt font, double spaced with 1 inch margins