Lab 5

By the end of this lab you will create formulate and conduct hypothesis tests using randomization.

For all visualizations you create, be sure to include informative titles for the plot, axes, and legend!

Getting started

  1. Go to Posit Cloud and start the project titled lab-5 - Hypothesis testing.

  2. Under the Files tab on the lower right, click on lab-5.qmd to open the lab template.

  3. Complete the exercises in this document.

Part 1: IMS Exercises

The exercises in this section do not require code. Make sure to answer the questions in full sentences.

Exercise 1

IMS - Chapter 11 exercises, #2: Identify the parameter, II.

Exercise 2

IMS - Chapter 11 exercises, #6: Identify hypotheses, II.

Exercise 3

IMS - Chapter 11 exercises, #8: Heart transplants.

Part 2: Heart transplants - outcome

Today, we will be working with data from The Stanford University Heart Transplant Study(Turnbull, Brown, and Hu 1974).

We’ll use the tidyverse and tidymodels packages for our analysis, so let’s load those first.

library(tidyverse)
library(tidymodels)
library(openintro)

And then, let’s read the data and take a peak at it:

heart_transplant <- read_csv("data/heart-transplant.csv")
glimpse(heart_transplant)
Rows: 103
Columns: 8
$ id         <dbl> 15, 43, 61, 75, 6, 42, 54, 38, 85, 2, 103, 12, 48, 102, 35,…
$ outcome    <chr> "deceased", "deceased", "deceased", "deceased", "deceased",…
$ transplant <chr> "control", "control", "control", "control", "control", "con…
$ age        <dbl> 53, 43, 52, 52, 54, 36, 47, 41, 47, 51, 39, 53, 56, 40, 43,…
$ survtime   <dbl> 1, 2, 2, 2, 3, 3, 3, 5, 5, 6, 6, 8, 9, 11, 12, 16, 16, 16, …
$ acceptyear <dbl> 68, 70, 71, 72, 68, 70, 71, 70, 73, 68, 67, 68, 71, 74, 70,…
$ prior      <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no",…
$ wait       <dbl> NA, NA, NA, NA, NA, NA, NA, 5, NA, NA, NA, NA, NA, NA, NA, …

The data dictionary is as follows:

Variable Description
id ID number of the patient
outcome Survival status with levels alive and deceased
transplant Transplant group with levels control (did not receive a transplant) and treatment (received a transplant)
age Age of the patient at the beginning of the study
survtime Number of days patients were alive after the date they were determined to be a candidate for a heart transplant until the termination date of the study
acceptyear Year of acceptance as a heart transplant candidate
prior Whether or not the patient had prior surgery with levels yes and no
wait Waiting time for transplant

Exercise 4

Recreate the stacked bar plot in Exercise 3.

Hint

The colors are #569BBD and #114B5F. Use scale_fill_manual() to indicate these are the colors you want to use.

Pause and render

Now that you’ve completed a few exercises, pause and render your document. If it renders without any issues, great! Move on to the next exercise. If it does not, debug the issue before moving on. Ask for help from your TA if you need. Do not proceed without rendering your document.

Exercise 5

Calculate the point estimate for the difference in proportions of patients who died between the treatment and control groups: \(\hat{p}_{treatment} - \hat{p}_{control}\), where \(\hat{p}\) is the observed probability of dying in each group. Do this in two ways:

  1. Construct a frequency table of survived by transplant and calculate the difference in proportions “by hand”.
  2. Use specify() and calculate() to calculate the difference in proportions as shown below.
obs_diff_outcome <- heart_transplant |>
  specify(response = outcome, explanatory = transplant, success = "deceased") |>
  calculate(stat = "diff in props", order = c("treatment", "control"))

Then, confirm that you obtained the same results with the two methods.

Exercise 6

Using the code below, generate a null distribution for the hypothesis test for the hypothesis you formulated in Exercise 3. Use 100 resamples (reps). However, change the seed to use a different seed.

set.seed(1234)

null_dist_outcome <- heart_transplant |>
  specify(response = outcome, explanatory = transplant, success = "deceased") |>
  hypothesize(null = "independence") |>
  generate(reps = 100, type = "permute") |>
  calculate(stat = "diff in props", order = c("treatment", "control"))

Then, inspect the object null_dist_outcome using either glimpse() or just printing it to screen. How many rows and how many columns does it have? What does each row represent? What does each variable represent?

Exercise 7

Visualize the distribution of simulated differences in proportions of deceased between treatment and control groups. What is the shape of this distribution? What is the center of this distribution? Are these results expected? Explain your reasoning.

Hint

These simulated values are in the stat column of null_dist_outcome.

Exercise 8

Calculate the p-value. Do this in two ways:

  1. Filter null_dist_outcome for stat values that are at least as far away from the null value as the observed value you calculated in Exercise 4. Make sure you’re considering both directions! Note how many such values there are, and calculate the proportion out of the total number of resamples.
  2. Use get_p_value():
p_value_outcome <- null_dist_outcome |>
  get_p_value(obs_stat = obs_diff_outcome, direction = "two-sided")

Then, confirm that you obtained the same results with the two methods. Finally, interpret the p-value in context of the data and the research question and comment on the statistical discernability of this result using a discernability level of 5%.

Pause and render

Now that you’ve completed a few exercises, pause and render your document. If it renders without any issues, great! Move on to the next exercise. If it does not, debug the issue before moving on. Ask for help from your TA if you need. Do not proceed without rendering your document.

Part 3: Heart transplants - survtime

Exercise 9

Recreate the side-by–side box plots in Exercise 3. The variable of interest is survtime, which is the number of days patients were alive after the date they were determined to be a candidate for a heart transplant until the termination date of the study.

Exercise 10

Do these data provide convincing evidence of a difference between average survival times of patient who did and did not get a heart transplant?

Write the hypotheses for answering this question and conduct the hypothesis test using randomization. Use 1,000 resamples. In order to avoid name collision, use object names like obs_diff_survtime, null_dist_survtime, and p_value_survtime. Find and interpret the p-value in context of the data and the research question and comment on the statistical discernability of this result using a discernability level of 5%.

Wrap up

Submitting

Important

Before you proceed, first, make sure that you have updated the document YAML with your name! Then, render your document one last time, for good measure.

To submit your assignment to Gradescope:

  • Go to your Files pane and check the box next to the PDF output of your document (lab-5.pdf).

  • Then, in the Files pane, go to More > Export. This will download the PDF file to your computer. Save it somewhere you can easily locate, e.g., your Downloads folder or your Desktop.

  • Go to the course Canvas page and click on Gradescope and then click on the assignment. You’ll be prompted to submit it.

  • Mark the pages associated with each exercise. All of the papers of your lab should be associated with at least one question (i.e., should be “checked”).

Warning

If you fail to mark the pages associated with an exercise, that exercise won’t be graded. This means, if you fail to mark the pages for all exercises, you will receive a 0 on the assignment. The TAs can’t mark your pages for you, and for them to be able to grade, you must mark them.

Grading

Exercise Points
Exercise 1 5
Exercise 2 4
Exercise 3 6
Exercise 4 4
Exercise 5 4
Exercise 6 6
Exercise 7 5
Exercise 8 4
Exercise 9 5
Exercise 10 7
Total 50

References

Turnbull, Bruce W., Byron Wm. Brown, and Marie Hu. 1974. “Survivorship Analysis of Heart Transplant Data.” Journal of the American Statistical Association 69 (345): 74–80. https://doi.org/10.1080/01621459.1974.10480130.