2020 Durham City and County Resident Survey

Application exercise
Suggested answers

These are suggested answers for the application exercise. They’re not necessarily complete or 100% accurate, they’re roughly what we develop in class while going through the exercises.

The main questions we’ll explore today are:


  • Wrapping up discussion on categorical data

  • Visualizing and summarizing numerical data

  • Improving visualizations for visual appeal and better communication




The data for this case study come from the 2020 Durham City and County Resident Survey.

First, let’s load the data:

durham <- read_csv("data/durham-2020.csv")
Rows: 803 Columns: 49
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): please_define_other_32_5, primary_language, please_define_other_34...
dbl (42): id, overall_quality_of_services_3_01, overall_quality_of_services_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

In the last application exercise we manipulated the data a bit.

durham <- durham |>
    own_rent = do_you_own_or_rent_your_current_resi_31,
    income = would_you_say_your_total_annual_hous_35,
    # a few new ones
    qos_city = overall_quality_of_services_3_01,
    qos_county = overall_quality_of_services_3_02,
    qol_durham = overall_quality_of_life_in_d_3_06,
    qol_neighborhood = overall_quality_of_life_in_y_3_07
  ) |>
    income = as_factor(income),
    own_rent = as_factor(own_rent)

Visualizing relationships between categorical variables

Exercise 1

Visualize and describe the relationship between income and home ownership of Durham residents.

Stretch goal: Customize the colors using named colors from http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf.

Add your answer here.

durham |>
  select(income, own_rent) |>
  drop_na() |>
  ggplot(aes(y = income, fill = own_rent)) +
  geom_bar(position = "fill") +
    labels = c(
      "1" = "Under $30,000",
      "2" = "$30,000-$59,999",
      "3" = "$60,000-$99,999",
      "4" = "$100,000 or more"
  ) +
    values = c("1" = "cadetblue", "2" = "coral"),
    labels = c("1" = "Own", "2" = "Rent")
  ) +
    x = "Proportion",
    y = "Would you say your total\nannual household income is...",
    fill = "Do you own\nor rent\nyour current\nresidence?",
    title = "Income vs. home ownership of Durham residents"

Exercise 2

Calculate the proportions of home owners for each category of Durham residents. Describe the relationship between these two variables, this time with the actual values from the conditional distribution of home ownership based on income level.

Add your answer here.

durham |>
  select(income, own_rent) |>
  drop_na() |>
  count(income, own_rent) |>
  group_by(income) |>
  mutate(prop = n / sum(n))
# A tibble: 8 × 4
# Groups:   income [4]
  income own_rent     n   prop
  <fct>  <fct>    <int>  <dbl>
1 1      1           51 0.362 
2 1      2           90 0.638 
3 2      1          105 0.565 
4 2      2           81 0.435 
5 3      1          107 0.552 
6 3      2           87 0.448 
7 4      1          160 0.930 
8 4      2           12 0.0698

Exercise 3

Recode the levels of these two variables to be more informatively labeled and calculate the proportions from the previous exercise again.

durham <- durham |>
    income = case_when(
      income == "1" ~ "Under $30,000",
      income == "2" ~ "$30,000-$59,999",
      income == "3" ~ "$60,000-$99,999",
      income == "4" ~ "$100,000 or more"      
    income = fct_relevel(income, "Under $30,000", "$30,000-$59,999", "$60,000-$99,999", "$100,000 or more"),
    own_rent = if_else(own_rent == 1, "Own", "Rent"),

durham |>
  select(income, own_rent) |>
  drop_na() |>
  count(income, own_rent) |>
  group_by(income) |>
  mutate(prop = n / sum(n))
# A tibble: 8 × 4
# Groups:   income [4]
  income           own_rent     n   prop
  <fct>            <chr>    <int>  <dbl>
1 Under $30,000    Own         51 0.362 
2 Under $30,000    Rent        90 0.638 
3 $30,000-$59,999  Own        105 0.565 
4 $30,000-$59,999  Rent        81 0.435 
5 $60,000-$99,999  Own        107 0.552 
6 $60,000-$99,999  Rent        87 0.448 
7 $100,000 or more Own        160 0.930 
8 $100,000 or more Rent        12 0.0698

Visualizing numerical data

Exercise 4

One of the questions on the survey is “How satisfied are you with the overall quality of services provided by the city?” A response of 1 indicates Very Dissatisfied and a response of 5 indicates Very Satisfied. The responses for this question are in the variable qos_city. This could be considered an ordinal, categorical variable but can also be treated as a numerical variable in an analysis. Let’s do the latter!

Visualize and describe the distribution of qos_city. If you get a warning with your visualization, comment on what it means.

Add your response here.

ggplot(durham, aes(x = qos_city)) +
  geom_histogram(binwidth = 1) +
    x = "Quality of services provided by the city\n1 - Very Dissatisfied, 5 - Very Satisfied"
Warning: Removed 75 rows containing non-finite values (`stat_bin()`).

Exercise 5

Calculate the mean and median of the distribution of qos_city. If these values are not exactly the same, can you explain what the difference might be attributed to?

Add your response here.

durham |>
    mean_qos_city = mean(qos_city, na.rm = TRUE),
    median_mean_qos_city = median(qos_city, na.rm = TRUE)
# A tibble: 1 × 2
  mean_qos_city median_mean_qos_city
          <dbl>                <dbl>
1          3.54                    4

Exercise 6

Based on the shape of the distribution of qos_city, which measure of spread (variability) is more appropriate? Calculate that value and interpret it in context of the data.

durham |>
    iqr_qos_city = IQR(qos_city, na.rm = TRUE),
    q1_qos_city = quantile(qos_city, 0.25, na.rm = TRUE),
    q3_qos_city = quantile(qos_city, 0.75, na.rm = TRUE)
# A tibble: 1 × 3
  iqr_qos_city q1_qos_city q3_qos_city
         <dbl>       <dbl>       <dbl>
1            1           3           4

Exercise 7

Make a box plot of qos_city and comment on how the values you calculated map on to the box plot.

ggplot(durham, aes(x = qos_city)) +
  geom_boxplot() +
    x = "Quality of services provided by the city\n1 - Very Dissatisfied, 5 - Very Satisfied"
Warning: Removed 75 rows containing non-finite values (`stat_boxplot()`).

Exercise 8

How the average level of happiness with quality of services provided by the city vary by income group?

Stretch goal: Visualize mean qos_city by income group.

Add your response here.

durham |>
  group_by(income) |>
    mean_qos_city = mean(qos_city, na.rm = TRUE)
# A tibble: 5 × 2
  income           mean_qos_city
  <fct>                    <dbl>
1 Under $30,000             3.45
2 $30,000-$59,999           3.50
3 $60,000-$99,999           3.56
4 $100,000 or more          3.72
5 <NA>                      3.37
durham |>
  drop_na(income) |>
  group_by(income) |>
    mean_qos_city = mean(qos_city, na.rm = TRUE)
  ) |>
  ggplot(aes(x = income, y = mean_qos_city)) +
  geom_point() +
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?



Some of the terms we introduced are:

  • Marginal distribution: Distribution of a single variable.

  • Conditional distribution: Distribution of a variable conditioned on the values (or levels, in the context of categorical data) of another.


In this application exercise we:

  • Defined factors – the data type that R uses for categorical variables, i.e., variables that can take on values from a finite set of levels.
  • Reviewed data imports, visualization, and wrangling functions encountered before:
    • Import: read_csv(): Read data from a CSV (comma separated values) file
    • Visualization:
      • ggplot(): Create a plot using the ggplot2 package
      • aes(): Map variables from the data to aesthetic elements of the plot, generally passed as an argument to ggplot() or to geom_*() functions (define only x or y aesthetic)
      • geom_bar(): Represent data with bars, after calculating heights of bars under the hood
      • geom_histogram(): Represent data with a histogram
      • geom_boxplot(): Represent data with a box plot
      • geom_point(): Represent data with points
      • labs(): Label x axis, y axis, legend for color of plot, title` of plot, etc.
    • Wrangling:
      • mutate(): Mutate the data frame by creating a new column or overwriting one of the existing columns
      • count(): Count the number of observations for each level of a categorical variable (factor) or each distinct value of any other type of variable
      • group_by(): Perform each subsequent action once per each group of the variable, where groups can be defined based on the levels of one or more variables
      • summarize(): Calculate summary statistics
  • Introduced new data wrangling functions:
    • rename(): Rename columns in a data frame
    • as_factor(): Convert a variable to a factor
    • drop_na(): Drop rows that have NA in one ore more specified variables
    • if_else(): Write logic for what happens if a condition is true and what happens if it’s not
    • case_when(): Write a generalized if_else() logic for more than one condition
    • fct_relevel: Change the order of levels in a factor
  • Introduced new data visualization functions:
    • geom_col(): Represent data with bars (columns), for heights that have already been calculated (must define x and y aesthetics)
    • scale_fill_viridis_d(): Customize the discrete fill scale, using a color-blind friendly, ordinal discrete color scale
    • scale_y_discrete(): Customize the discrete y scale
    • scale_fill_manual(): Customize the fill scale by manually adjusting values for colors


We also introduced chunk options for managing figure sizes:

  • fig-width: Width of figure
  • fig-asp: Aspect ratio of figure (height / width)
  • fig-height: Height of figure – but I recommend using fig-width and fig-asp, instead of fig-width and fig-height


This dataset was cleaned and prepared for analysis by Duke StatSci PhD student Sam Rosen.