代写Biostatistics Proficiency代写留学生Matlab语言程序
 首页 >> WebBiostatistics Proficiency
Examination: June 916, 2024
Instructions
This examination is intended to help you integrate some of the important material from your firstyear courses. The exam is openbook and opennotes. You are welcome to discuss the exam with your classmates, with your instructors, and others. You are also welcome to use the Internet, including generative AI tools like ChatGPT or others. In the spirit of using this examination as an opportunity for learning, we have focused on some topics that students often struggle to apply or explain. Although you are allowed to collaborate on the exam, the answers you turn in must be in your own words. Please remember that you are expected to adhere to the Duke Community Standard.
Questions 1 and 2 require you to conduct data analyses. You may use the software package or programming language of your choice to conduct the data analyses. Please upload your program code in a separate file and clearly indicate which question each code block relates to. Your program code will not be graded but it may be used to help understand your written answers if anything is unclear.
Question 3 is likely to take the most time for you to complete as it requires you to do some research on unfamiliar topics and involves difficult interpretation. Question 3 also asks you to record a video. Make sure to include a link to your video in your answer to the question. We find the simplest way to make a video is to use Zoom and select the option to record to the cloud.
There are 2 files posted on Sakai along with this exam that you will need to answer the questions. Please make sure to download these files with the exam: data.csv and PressRelease.pdf. The CSV file is for Question 1 and the PDF file relates to Question 3.
We might choose to schedule followup interviews about some or all the exam content. Reasons for doing so could include providing an opportunity to clarify your answers and to support a critical review of the effectiveness of the exam questions. We plan to follow up the exam with an anonymous survey and would appreciate your assessment of the extent to which the exam was fair, appropriate, and allowed you to perform. your best work.
Your answers must be submitted by midnight on June 16, 2022.
Question 1: Genetic Variation in Blood Pressure
The attached data set is from an observational study that was designed to assess differences in systolic blood pressure according to the presence of a germline genetic marker in a cohort of 100 adult patients who are all taking the same blood pressure lowering drug to treat chronic hypertension. The hypothesis of the study was that the genetic marker is associated with improved blood pressure control. The minimum clinically important difference (MCID) that indicates improved blood pressure control is 10 mmHg (millimeters of mercury). The investigators felt it would be important to test their hypothesis about the genetic marker while adjusting for age, since older age is associated with higher blood pressure.
The dataset contains 3 variables:
· BloodPressure  the systolic blood pressure, in units of mmHg (millimeters of mercury)
· Age  The patient's age in years at the time of the blood pressure reading
· Group  A binary variable that groups patients according to whether they have the genetic marker of interest (1) or not (0)
Answer the following questions about this dataset.
a. Describe what you expect to see in terms of the structure of the dataset and the data collected for each of the included variables based on the research question and the study design. Specifically:
· How many observations should be in the dataset?
· Which categories do you expect to see for the Group variable
· What do you expect to see for the frequency (counts or percentages) for the Group variable?
· What do you expect for the distribution of the Age variable (e.g., minimum, maximum or mean)?
· What do you expect for the distribution of the BloodPressure variable (e.g., minimum, maximum or mean)?
b. Describe the twoway relationships you expect to see among the variables in the dataset if the investigator’s research hypothesis is true. Draw a picture (a simple handdrawn sketch is fine) and provide a brief 1sentence description for the following:
· The relationship between age and blood pressure
· The relationship between blood pressure and the genetic marker
· The relationship between age and the genetic marker
c. Conduct a descriptive analysis that includes a visualization and descriptive statistics for each of the following variables. For each analysis, explain what you see and whether it agrees with your expectations or not (refer to the expectations you listed in the 2 previous questions). In other words, present the univariate analyses for each of the 3 variables.
i. Blood pressure
ii. Age
iii. Group
d. Make a list of questions you would ask the investigator based on your observations from the analyses you conducted in the question above. In addition, state what actions you would take right now so that you can proceed with the analysis while waiting to hear back from the investigator. For example, you might choose to delete certain data points, set others to missing, etc.
e. Conduct the following descriptive bivariate analyses. Include a visualization and descriptive statistics for each analysis. Explain what you see and whether it agrees with your expectations or not.
i. Age and blood pressure
ii. Group and blood pressure
iii. Age and group
f. Based on your understanding of the study design and the results of your descriptive analyses, is it possible for age to confound the association between the genetic marker and blood pressure? Why or why not?
g. Fit a model predicting blood pressure based on the presence of the genetic marker and age. Fill in the following table with the results.
Model Parameter 
Estimate 
95% Confidence Interval 
Pvalue 
Intercept 



Age 



Group 



h. Write an interpretation of the point estimate and 95% confidence interval for the genetic marker. Be sure to discuss the MCID in your interpretation.
i. Write an interpretation of the pvalue for the genetic marker without using the pvalue for a hypothesis test. In other words, don’t use the words “null hypothesis”, “alternative hypothesis”, “statistical significance”, “reject”, or “fail to reject” in your response, and do not consider alpha levels.
j. What are the assumptions that you’re required to make to draw valid inference from this model? Which can be evaluated by statistical methods? Which have you evaluated already?
k. Produce the following diagnostic plots and describe 3 things for each plot: 1) What the plot shows on the X and Y axes; 2) What you are looking for in the plot; and 3) What your conclusion is about the assumptions you described above.
i. Residuals vs. fitted values
ii. QQ plot of residuals
l. Make a plot showing Cooks’ distance for each observation and then answer the following questions.
i. What is Cook’s distance telling us?
ii. How many observations in this study have Cook’s distance that is large enough to be concerning? What criterion or definition did you apply to identify these points?
iii. What would you suggest doing about these observations?
m. Supposed that you are concerned about assumption violations in your model and you want to explore alternate methods of inference. You’ve heard that bootstrap and permutation testing are potentially useful. Briefly explain how each works and what the difference is between the two approaches.
i. Bootstrap confidence interval
ii. Bootstrap test of the null hypothesis
iii. Permutation test of the null hypothesis
n. Execute the analyses described above (bootstrap CI and hypothesis test, and the permutation hypothesis test) and fill in the table below. Based on the results do you think the violation of assumptions you observed in your fitted model are influential on your results? Which result would you report to the investigator and why?

95% CI for Group 
Pvalue 
Fitted Model 


Bootstrap 


Permutation test 
NA 

Question 2: Small Sample Inference for Contingency Tables
The origins of Fisher’s exact test are in the story of the ‘lady tasting tea.’ The story is that Dr. Muriel Bristol, one of Fisher’s colleagues at the Rothamstead Experimental Station near London, England, one day claimed that she could tell by taste whether tea had been poured into a cup first followed by milk, or whether milk had been poured into a cup followed by tea. Fisher was skeptical and designed an experiment to test Dr. Bristol’s claim. It is from this experiment that Fisher’s exact test was developed.
Briefly, Fisher designed an experiment in which he poured milk followed by tea into 4 cups. In 4 other cups, Fisher poured the tea first and then the milk. He then randomized the order in which the 8 cups of tea were presented to Dr. Bristol for tasting. Fisher informed Dr. Bristol that 4 of the 8 cups had milk poured first. He then asked Dr. Bristol to taste all 8 cups and report which of the 4 she thought had milk poured into the cup first. The following contingency table crossclassifies the true state of each cup of tea with Dr. Bristol’s guess.

Dr. Bristol’s Guess 


Truth 
Milk Poured First 
Tea Poured First 
Total 
Milk Poured First 
3 
1 
4 
Tea Poured First 
1 
3 
4 
Total 
4 
4 
8 
Answer the following questions about the tea tasting experiment.
a. What is the null hypothesis for Fisher’s experiment? State the hypothesis in plain English without using any statistical terminology.
b. Fisher’s test is based on enumerating all possible 2x2 tables that have marginal totals that are fixed by the experimental design (i.e., 4 cups with tea poured first and 4 cups with milk poured first and Dr. Bristol takes a sip from each of the 8 cups). In this example, there are 5 total tables including the one that was observed. The next few questions will ask you to enumerate the remaining 4 possible tables. Start by drawing the contingency table expected under the null hypothesis. In other words, condition on the marginal totals and fill in the 4 cells of the table with what you’d expect to see if the null hypothesis were true.
c. So far, you’ve seen 2 of the possible tables—the one observed, plus the one you’d expect if the null hypothesis were true. Now consider the topleft cell of the table, i.e., the number of correct guesses that milk was poured first. Since the marginal totals are fixed by the design of the experiment, there is only 1 possible way the table cells could be configured that would have been a “better” result, i.e., that Dr. Bristol guessed all 4 cups correctly. Draw this table. This will be the 3^{rd} possible table.
d. Now draw the remaining possible tables. There should be 2 of them.
e. Focus now on the topleft cell of each of the 5 tables (the one you observed plus the other 4 possibilities you enumerated in the questions above). The number in the topleft cell is the number of correct guesses that milk was poured into the cup first. Use the hypergeometric distribution to find the probabilities that the number of correct guesses is 0, 1, 2, 3, or 4 assuming the null hypothesis is true. Draw each table and write the table probability next to it. Round your pvalues to 3 digits.
HINTS:
Since the margins are fixed by design, the probabilities of observing 04 successes in the upper left cell of the table correspond to the probabilities of observing each of the contingency tables you enumerated above. The following notation and formulae will help you find the table probabilities. For any contingency table, we label the cells and the margins as follows:

Dr. Bristol’s Guess 


Truth 
Milk Poured First 
Tea Poured First 
Total 
Milk Poured First 
n_{11} 

R1 
Tea Poured First 


R2 
Total 
C1 
C2 
N 
The number in the topleft cell, n_{11}, is the number of observed successes. The maximum possible value for n_{11}=4.
If X is the number of correct guesses in the topleft cell of the table, then the probabilities are given by the hypergeometric probability mass function (PMF):
P(X=n_{11}) = (R1 choose n_{11}) * (N  R1 choose C1  n_{11}) / (N choose C1)
In the context of the tea tasting experiment, C1=C2=R1=R2=4 and N=8 so the table probabilities are given by:
P(X=n_{11}) = (4 choose n_{11}) * (4 choose 4  n_{11}) / (8 choose 4)
f. The results show that Dr. Bristol guessed correctly in 3 cases out of the 4 in which milk was poured into the cup first. Based on the experimental design it is natural to want a 1sided pvalue.
1. Explain why a 1sided pvalue make sense in this experiment.
2. Use the table probabilities above to find a 1sided pvalue. Based on the pvalue do you think the experiment provides convincing evidence against the null hypothesis?
g. Consider again the data from the tea tasting experiment but this time suppose the data come from an experiment in which the researcher has randomized mice to two treatment groups and observed a binary outcome on each mouse.

Observed Outcome 


Randomized Group 
Success 
Failure 
Total 
Treatment 
3 
1 
4 
Control 
1 
3 
4 
Total 
4 
4 
8 
This design differs from the tea tasting experiment in the fact that only one of the margins is fixed: the number of mice assigned to each group (4 per group, indicated in the rows of the table above). The binary outcomes might be modeled using two binomial distributions. The target of inference in this case is the difference in the group proportions. Many researchers will analyze such designs using the chisquare test or 2sample proportion test, which are equivalent in the 2sample case shown here. However, these tests cannot be used in this setting since the assumptions related to expected cell counts are violated. In such cases many analysts revert to Fisher’s exact test. But this test considers both margins fixed, which is not the case.
Instead of using Fisher’s exact test you will now design an exact binomial test for the difference in proportions by executing the following steps.
1. What is the observed success probability in each group? What is the estimate of the risk difference obtained from subtracting the success probability in the control group from the success probability in the treatment group?
2. Write the binomial likelihood for the experimental group, letting r1 be the number of observed successes, n1 being the number of trials, and ϴ_{1} being the success probability.
3. Write the binomial likelihood for the control group.
4. What is the value of the success probability in each group under the null hypothesis of no difference between groups? What is the corresponding estimate of the risk difference under the null?
5. Write the likelihood for the number of successes in each group under the null hypothesis
6. What is the log of this likelihood function?
7. Find the first derivative of the loglikelihood, set it equal to zero and solve for the success probability. This is the estimate for the success probability under the null hypothesis. Show all of your work and explain what you are doing at each step. You may use online calculators or any other resource you want to help you with the mathematics.
8. Explain why the result you obtained in 7 matches your intuition about the success probability under the null hypothesis as you describe in question 4.
9. Now we will obtain a 1sided pvalue for the test of the null hypothesis of no difference between groups vs. the alternative that the treatment has a higher probability of success over the control. Recall that such a test is based on the probability of the observed result or more extreme assuming the null hypothesis is true. The observed result in this case is the observed risk difference that you wrote down in question 1.
There are 25 possible 2x2 tables for the mouse experiment where only 1 margin is fixed (the number assigned to each treatment group). Enumerate all possible tables and identify those tables where the risk difference is as extreme or more extreme than the observed result. Use your likelihood from question 5 to find the probability of each table under the null (use 2 decimal places for the probabilities). Then sum these probabilities to find the 1sided pvalue rounded to 2 decimal places.
10. Verify your calculations in question 9 are correct by writing a program to find the upper 1sided pvalue for the mouse experiment.
11. Using your statistical software, conduct a chisquare test (without continuity correction) or twosample proportion test and fill in the following table to compare pvalues from Fisher’s exact test, the chisquare test, and the exact binomial test. Remember to report the upper 1sided pvalues in all tests.
Fisher’s exact test 

Exact binomial test 

Chisquare test 

12. Write a short explanation of the differences you see among the 3 testing approaches. Limit your answer to no more than 3 sentences.
h. Suppose you’ve been approached by an investigator who is writing a grant proposal to obtain funding for a laboratory experiment. The investigator will culture lung cancer cells and treat them with either saline or a new investigational drug to evaluate the cytotoxic potential of the drug. The drug is considered potentially useful as a human anticancer agent if it kills greater than 50% of the cells in culture.
The investigator will prepare a set of cell culture dishes and randomly assign each one to be treated with either saline or the investigational drug. The number of cells in each culture dish will be counted immediately prior to and 1 hour after the assigned treatment is administered. If, after 1 hour, at least 50% of the cells have died in an individual culture dish then this will be counted as a successful outcome for that dish.
The investigator expects, based on the properties of the cell line they are culturing, that 5% of culture dishes in the saline treatment condition will have the outcome of >=50% cell death after 1 hour. The investigator also states that the new drug would be considered promising for future development if at least 40% of the drugtreated cell cultures had >= 50% cell death at 1 hour, representing an absolute increase of 35% in the outcome probability (i.e., 40% on drug minus 5% on saline).
The investigator tells you: “I used an online sample size calculator to figure out that 20 cell cultures per group (or 40 for the entire experiment) is enough to provide ~80% power for the MCID using a 1sided test that compares 2 independent proportions at the 2.5% alpha level. But the calculator also warned me that the assumptions for the test might not be met and suggested using Fisher’s exact test instead. Will I still be OK to budget for 20 cell cultures per group if I want to use Fisher’s exact test? If not, what else can I do?”
1. Perform. a statistical simulation to estimate the power the experiment has for the MCID with 20 per group. Estimate power for Fisher’s exact test and the exact binomial test you derived above. Use 1sided tests with alpha of .025.
2. What would you recommend to the investigator based on your simulation?
Question 3: A New Drug to Treat Alzheimer’s Disease
Annovis Bio is developing a new drug called Buntanetap to treat Alzheimer’s disease. The company recently published the following press release describing the results of their recent Phase II/III trial.
https://irpages2.eqs.com/websites/annovis/English/431010/uspressrelease.html?airportNewsID=7f4c17db2a474e9180598d07dae9de11
If you cannot access the press release online a PDF copy has been posted along with this exam. You should also read the description of the study on ClinicalTrials.gov:
https://www.clinicaltrials.gov/study/NCT05686044
Read these materials and then answer the following questions.
a. What were the objectives of this clinical trial?
b. Describe the design of the clinical trial as if you were writing something in the clinical trial protocol or the statistical analysis plan, i.e., imagine the study hasn’t happened yet and use future tense in your writing. Your description be 57 sentences and should include the following details:
· The population studied (key eligibility criteria)
· The sample size
· The treatment assignment plan and use (or not) of blinding
· Details of the intervention (where and how the assigned treatment is administered)
· The length of followup for each patient
· The primary endpoint (what was measured and when; no need to address secondary endpoints for this question)
· Where the study was conducted
c. This trial is described as a Phase II/III trial. What are some key differences between Phase II and III trials that are relevant to this example drug trial? Limit your answer to 810 sentences and avoid discussion general differences between the phases of drug development. Instead, imagine what a Phase II or III trial would like for this drug and describe how the example study incorporates design features of both phases.
d. What is the definition of a coprimary endpoint? How is the success of a trial determined based on the use of a coprimary endpoint?
e. The Anovis Bio Trial uses 2 continuous coprimary outcomes, the ADASCog 11 and the ADCSCGIC. What are the null and alternative hypotheses for the coprimary outcomes in this trial? Phrase your hypotheses first in words and then using statistical notation. Please note that neither the press release or the ClinicalTrials.gov entry for the study explicitly state the hypotheses of the study. We are asking you to come up with the hypotheses based on the information provided in both of these sources. Hint: consider using a global test that compares all 3 dose groups to placebo simultaneously. The table below shows all the parameters that need to be estimated for this test. You may use the information in this table to phrase your hypotheses using statistical notation.
The cells of the following table show the parameters (means, µ) that need to be estimated:
Treatment Groups 
Coprimary Outcome 1 
Coprimary Outcome 2 
T1 
µ1T1 
µ2T1 
T2 
µ1T2 
µ2T2 
T3 
µ1T3 
µ2T3 
P 
µ1P 
µ2P 
f. Should statisticians be concerned about Type I error inflation in trials that use coprimary endpoints? Why or why not?
g. Should statisticians be concerned about Type II error inflation in trials that use coprimary endpoints? Why or why not?
h. What is a clinical trial estimand? Give a brief 1 or 2 sentence definition of an estimand, and then give 12 sentences that defines the following 4 key parts of an estimand: population, variable (also called endpoint), intercurrent events, and populationlevel summary.
i. Write down what you think the estimand is for this trial based on the information provided in the press release and ClinicalTrials.gov. Make some reasonable suggestions for the parts of the estimand that aren’t well defined based on the information you have. For example, which intercurrent events would you be concerned about and how would you handle them in the analysis? Do this in two steps. First, make a bulleted list that describes in a single sentence each of the 4 parts of the estimand. Then, write a summary paragraph that combines this information.
j. Based on what you know about the study design write a brief statistical analysis plan for the study (about 1 paragraph, 57 sentences) that is in alignment with the study design and the estimand that you described above. You may assume the sample size has been determined appropriately to support your analysis strategy and that there will be no missing data. You should also assume the treatment assignment strategy was based on stratified randomization. Your plan should address the following points.
· Recommend an analysis strategy for the coprimary endpoints, including analytical model and decision criteria.
· Describe the analysis population, i.e., which enrolled patients will be included in the analysis.
· Describe which results you will report and how. Include an example table or figure.
k. Compare your proposed analysis plan with what is reported in the press release. Are you given enough information in the press release to tell whether the investigational treatment was successful? Why or why not?