Project 1 - Regression

Due Friday, October 13 at 5 pm

Team assignments are posted at https://canvas.duke.edu/courses/4625/files?preview=338995.

The deliverable for this project (what you will turn in) is a written report. See below for more details.

Description

For this project you will create a linear model (of your choosing) to predict newborn birth weights (or some transformation of the variable) using a sample of observations from 2020 US birth data originally sourced from the Centers for Disease Control (CDC).

Data

You will use the data set us_births.csv to fit your model. You can load it using the code below.

library(tidyverse)
births <- read_csv("data/us_births.csv")

This data set contains 3000 observations of 16 variables. The variables are described in the codebook below.

Codebook

  • newborn_birth_weight: newborn birth weight in grams (response)
  • month: birth month (1: January, …, 12: December)
  • mother_age: age of the mother in years
  • prenatal_care_starting_month: month in which prenatal care began; if 0, there was no prenatal care
  • daily_cigarette_prepregnancy: daily number of cigarettes smoked before the pregnancy
  • daily_cigarette_trimester_1: daily number of cigarettes smoked during the 1st trimester of the pregnancy
  • daily_cigarette_trimester_2: daily number of cigarettes smoked during the 2nd trimester of the pregnancy
  • daily_cigarette_trimester_3: daily number of cigarettes smoked during the 3rd trimester of the pregnancy
  • mother_height: height of the mother in inches
  • mother_bmi: body mass index of the mother
  • mother_weight_prepregnancy: weight of the mother before the pregnancy in pounds
  • mother_weight_delivery: weight of the mother at delivery in pounds
  • mother_diabetes_gestational: whether the mother had diabetes during the pregnancy
  • newborn_sex: sex of the newborn
  • gestation_week: number of gestational weeks
  • mother_risk_factors: whether the mother had any risk factor (diabetes, hypertension, previous preterm birth, previous cesarean, infertility treatment used, etc)

Testing data

You are additionally provided testing data to evaluate the predictive ability of your model, using 1,000 observations contained in the test data set below.

births_test <- read_csv("data/us_births_test.csv")

Report

Your report must be written using Quarto. Your written report should not exceed five pages inclusive of all tables and figures. You can access a template for that report in Posit Cloud in the project titled project-1.

To begin, edit the YAML at the top to specify a project title, your team name (get creative!), and the names of each group member, e.g.,

---
title: "Predicting baby weights"
author: "The Last Rbenders: Aang, Katara, Sokka, Momo"
---

All team members must contribute to the report.

Before you finalize your report, make sure the printing of code chunks is turned off by setting echo to false in your document YAML.

You should include, at a minimum, the following sections in your report. The report will be scored out of 50 points.

Introduction (7 pts)

The introduction provides motivation and context for your research.

To begin, introduce the data set in a few short sentences. Next, create a code book (aka a “data dictionary”) of the variables in the data set. Although a code book is provided above, you should include one in your report as well so that your report is self-contained. Specifically, only include in your report a code book of the variables that you use.

Complete the introduction by providing a concise, clear statement of your research question and hypotheses. Be sure to motivate why the research question is interesting/useful.

Example research question and hypotheses (if we were predicting penguin weights instead of baby weights):

Can we predict body mass with bill depth? We hypothesize that penguins with deeper bills will also have more mass.

Methodology (15 pts)

In this section, you should introduce your regression model. Specifically, do you use regular linear regression or logistic regression? Why? What’s the full model equation?

Did you choose to leave out some predictors but include others? Why or why not? You can also include summary statistics or create plots to explore the data and help inform your model choices.

Do you fit multiple models? If so, why?

Do you compare them using \(R^2\), another metric, why?

Results (15 pts)

Report and interpret your final fitted model equation(s).

Report any relevant metrics you use to select between models.

Place figure(s) here to illustrate the main results from your analysis. One elegant visualization is worth more than several poorly formatted figures. You must have at least one visualization.

Provide only the main results from your analysis. The goal is not to do an exhaustive data analysis (calculate every possible statistic and create every possible model for all variables). Rather, you should demonstrate that you are proficient at asking meaningful questions and answering them using data, that you are skilled in writing about and interpreting results, and that you can accomplish these tasks using R. More is not better.

How well does your model predict new data? Use the testing data to calculate a measure of the predictive performance of your model and interpret that in the context of your data.

Discussion (6 pts)

This section is a conclusion and discussion. You should

  1. Summarize your main finding in a sentence or two.

  2. Discuss your finding and why it is useful (put in the context of your motivation from the introduction).

  3. Critique your own analyses and include a brief paragraph on what you would do differently if you were able to start the project over.

Appendix (2 pts)

List a brief (1 or 2 sentence) summary of the relative contributions of each team member, e.g., “Aang built the models, Katara implemented them in R, and Sokka wrote the introduction and discussion.”

Important

All team members should be comfortable describing all aspects of the project and understanding all code.

Formatting (5 pts)

Your project should be professionally formatted. For example, this means labeling graphs and figures, turning off code chunks, using proper citations and cross-references, and following typical style guidelines.

Submission

  • Select one team member to upload the team’s PDF submission to Gradescope.

  • Be sure to select every team member’s name in Gradescope.

  • Associate all pages with “Full report”.