Lecture 6
Duke University
STA 101 - Fall 2023
We’ll work with data on Apple and Microsoft stock prices and use the tidyverse and tidymodels packages.
\[ y = \beta_0 + \beta_1 x + \epsilon \]
\(y\): the outcome variable. Also called the “response” or “dependent variable”. In prediction problems, this is what we are interested in predicting.
\(x\): the predictor. Also commonly referred to as “regressor”, “independent variable”, “covariate”, “feature”, “the data”.
\(\beta_0\), \(\beta_1\) are called “constants” or coefficients. They are fixed numbers. These are population parameters. \(\beta_0\) has another special name, “the intercept”.
\(\epsilon\): the error. This quantity represents observational error, i.e. the difference between our observation and the true population-level expected value: \(\beta_0 + \beta_1 x\).
Effectively this model says our data \(y\) is linearly related to \(x\) but is not perfectly observed due to some error.
Let’s examine January 2020 open prices of Microsoft and Apple stocks to illustrate some ideas.
Before we get to fitting the best model, let’s fit “some” model, say with slope = -5 and intercept = 0.5.
\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 ~ x \\ \hat{y} = -5 + 0.5 ~ x \]
Population:
\[ y = \beta_0 + \beta_1 ~ x \]
Samples: \[ \hat{y} = \hat{\beta_0} + \hat{\beta_1} ~ x \]
This difference between what we observe and what we predict \(y - \hat{y}\) is known as a residual, \(e\).
More concisely,
\[ e = y - \hat{y} \]
Residuals are dependent on the line we draw. Visually, here is a model of the data, \(y = -5 + 0.5 ~ x\) and one of the residuals is outlined in red.
There is, in fact, a residual associated with every single point in the plot.
We often wish to find a line that fits the data “really well”, but what does this mean? Well, we want small residuals! So we pick an objective function. That is, a function we wish to minimize or maximize.
Go to Posit Cloud and start the project titled ae-06-Stocks.