library(tidyverse)
library(tidymodels)
library(openintro)
Climate change
In this application exercise, we’ll do inference for a comparing two paired means and many means.
Packages
We’ll use the tidyverse, tidymodels, openintro, and ggthemes packages.
Paired means
Throughout this application exercise, we’ll make use of the climate70
dataset supplied by the openintro package.
glimpse(climate70)
Rows: 197
Columns: 7
$ station <fct> USC00203823, USC00276818, USC00186620, USC00331890, USC00235…
$ latitude <dbl> 41.93520, 44.25800, 39.41317, 40.24030, 37.83950, 45.56550, …
$ longitude <dbl> -84.64110, -71.25250, -79.40025, -81.87100, -94.37400, -100.…
$ dx70_1948 <int> 131, 80, 143, 156, 216, 138, 131, 194, 198, 213, 224, 74, 17…
$ dx70_2018 <int> 147, 99, 150, 158, 175, 132, 129, 188, 240, 213, 191, 101, 2…
$ dx90_1948 <int> 11, 1, 4, 18, 59, 39, 23, 17, 46, 67, 49, 0, 34, 35, 71, 100…
$ dx90_2018 <int> 16, 1, 1, 15, 51, 18, 17, 3, 65, 104, 39, 0, 62, 3, 86, 117,…
The question we want to answer is: Do these data provide convincing evidence that the number of days above 90 degrees between 2018 and 1948?
First, let’s set the hypotheses.
Add response here.
To carry out inference on paired data, we pre-compute the difference between paired values at the beginning of the analysis, and use those differences as our values of interest.
<- climate70 |>
climate70 mutate(dx90_diff = dx90_2018 - dx90_1948)
|>
climate70 select(station, latitude, longitude, contains("dx90"))
# A tibble: 197 × 6
station latitude longitude dx90_1948 dx90_2018 dx90_diff
<fct> <dbl> <dbl> <int> <int> <int>
1 USC00203823 41.9 -84.6 11 16 5
2 USC00276818 44.3 -71.3 1 1 0
3 USC00186620 39.4 -79.4 4 1 -3
4 USC00331890 40.2 -81.9 18 15 -3
5 USC00235987 37.8 -94.4 59 51 -8
6 USC00395691 45.6 -100. 39 18 -21
7 USC00219046 43.8 -94.2 23 17 -6
8 USW00003812 35.4 -82.5 17 3 -14
9 USC00047643 38.5 -122. 46 65 19
10 USW00013896 34.7 -87.6 67 104 37
# ℹ 187 more rows
And then visualize these differences:
ggplot(climate70, aes(x = dx90_diff)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We calculate the observed statistic in the paired setting in the same way that we would outside of the paired setting. Using specify()
and calculate()
:
<- climate70 |>
obs_stat specify(response = dx90_diff) |>
calculate(stat = "mean")
obs_stat
Response: dx90_diff (numeric)
# A tibble: 1 × 1
stat
<dbl>
1 2.90
Now, we want to compare this statistic to a null distribution, generated under the assumption that the true difference was actually zero, to get a sense of how likely it would be for us to see this observed difference or something more extreme.
Tests for paired data are carried out via the null = "paired independence"
argument to hypothesize()
.
set.seed(1129)
<- climate70 |>
null_dist specify(response = dx90_diff) |>
hypothesize(null = "paired independence") |>
generate(reps = 1000, type = "permute") |>
calculate(stat = "mean")
visualize(null_dist)

For each replicate, generate()
carries out type = "permute"
with null = "paired independence"
by:
- Randomly sampling a vector of signs (i.e., -1 or 1), probability 0.5 for either, with length equal to the input data, and
- Multiplying the response variable by the vector of signs, “flipping” the observed values for a random subset of value in each replicate
To get a sense for what this distribution looks like, and where our observed statistic falls, we can use visualize()
:
visualize(null_dist) +
shade_p_value(obs_stat, direction = "two-sided")

Then, we calculate the p-value:
<- null_dist |>
p_value get_p_value(obs_stat, direction = "two-sided")
p_value
# A tibble: 1 × 1
p_value
<dbl>
1 0.016