Logistic regression

Lecture 10

Dr. Mine Çetinkaya-Rundel

Duke University
STA 101 - Fall 2023

Warm up

Check-in

Daily check-ins for getting you thinking at the beginning of the class and reviewing recent/important concepts
Go to Canvas and open today’s “quiz” titled 2023-10-09 Check-in
Access code: slopes (released in class)
“Graded” for completion

Project

Project workday on Friday in lab – use this time for finalizing and polishing, not getting started
Projects due Friday at 5 pm
Peer evaluations due Friday at 6 pm:
- You’ll receive an email about them later today from TEAMMATES
- You must turn in a peer evaluation to get the points your teammates awarded you
Only one person should submit the project on Gradescope and select all team members’ names. See https://sta101-f23.github.io/project/project-1.html#submission for more information.

Any questions about projects?

Logistic regression

So far in regression

Outcome: Numerical, Predictor: One numerical or one categorical with only two levels \(\rightarrow\) Simple linear regression
Outcome: Numerical, Predictors: Any number of numerical or categorical variables with any number of levels \(\rightarrow\) Multiple linear regression
Outcome: Categorical with only two levels, Predictors: Any number of numerical or categorical variables with any number of levels \(\rightarrow\) Logistic regression
Outcome: Categorical with any number of levels, Predictors: Any number of numerical or categorical variables with any number of levels \(\rightarrow\) Generalized linear models – Not covered in STA 101

Data + packages

library(tidyverse)
library(tidymodels)

hp_spam <- read_csv("data/hp-spam.csv")

4601 emails collected at Hewlett-Packard labs and contains 58 variables
Outcome: type
- type = 1 is spam
- type = 0 is non-spam
Predictors of interest:
- capitalTotal: Number of capital letters in email
- Percentages are calculated as (100 * number of times the WORD appears in the e-mail) / total number of words in email
  - george: Percentage of “george”s in email (these were George’s emails)
  - you: Percentage of “you”s in email

Glimpse at data

What type of data is type? What type should it be in order to use logistic regression?

hp_spam |>
  select(type, george, capitalTotal, you)

# A tibble: 4,601 × 4
    type george capitalTotal   you
   <dbl>  <dbl>        <dbl> <dbl>
 1     1      0          278  1.93
 2     1      0         1028  3.47
 3     1      0         2259  1.36
 4     1      0          191  3.18
 5     1      0          191  3.18
 6     1      0           54  0   
 7     1      0          112  3.85
 8     1      0           49  0   
 9     1      0         1257  1.23
10     1      0          749  1.67
# ℹ 4,591 more rows

EDA: How much spam?

hp_spam |>
  count(type) |>
  mutate(p = n / sum(n))

# A tibble: 2 × 3
   type     n     p
  <dbl> <int> <dbl>
1     0  2788 0.606
2     1  1813 0.394

EDA: AM I SCREAMING? `capitalTotal`

ggplot(hp_spam, aes(x = capitalTotal)) +
  geom_histogram()

EDA: `george`, is that `you`?

ggplot(hp_spam, aes(x = george)) +
  geom_histogram()
ggplot(hp_spam, aes(x = you)) +
  geom_histogram()

Logistic regression

Logistic regression takes in a number of predictors and outputs the probability of a “success” (an outcome of 1) in a binary outcome variable.
The probability is related to the predictors via a sigmoid link function, \[ p(y_i = 1) = \frac{1}{1+\text{exp}({- \sum \beta_i x_i })}, \]whose output is in \((0,1)\) (a probability).
In this modeling scheme, one typically finds \(\hat{\beta}\) by maximizing the likelihood function, another objective function, different than our previous “least squares” objective.

Logistic regression, visualized

Using data to estimate \(\beta_i\)

To proceed with building our email classifier, we will, as usual, use our data (outcome \(y_i\) and predictor \(x_i\) pairs), to estimate \(\beta\) (find \(\hat{\beta}\)) and obtain the model: \[ p(y_i = 1) = \frac{1}{1+\text{exp}({- \sum \hat{\beta}_i x_i})}, \]

Application exercise

Go to Posit Cloud and continue the project titled ae-09-Spam.

ICYMI

Today’s daily check-in access code: slopes (released in class)

Logistic regression

Warm up

Check-in

Project

Logistic regression

So far in regression

Data + packages

Glimpse at data

EDA: How much spam?

EDA: AM I SCREAMING? capitalTotal

EDA: george, is that you?

Logistic regression

Logistic regression, visualized

Using data to estimate \(\beta_i\)

Application exercise

EDA: AM I SCREAMING? `capitalTotal`

EDA: `george`, is that `you`?