Lesson 5.A.1 - Describing Two Quantitative Variables
Key Question: Would raising attendance also raise test scores?
Content: Describing Scatterplots | Correlation Coefficient (r) | Correlation vs Causation
Alignment: CED Topics 5.1-5.2
Video
Course Resources
Resources for teaching our AP® Statistics curriculum.
- Lesson Flow - timing and flow of class, using our lesson materials
- Pacing Guide - pacing our units, with daily or block schedules
- CED Alignment Guide - aligning our lessons to the AP® Statistics Course and Exam Description
Teaching Resources
Resources for teaching with Skew The Script.
- Discussion Norms - our model discussion norms for the classroom
- Letter to Parents - letter to share with parents about our nonpartisan approach
- Teaching Math on Civic Topics - tips for teaching math lessons that cover civic topics
Lesson Notes
Lesson-specific insights from the creators of this lesson.
Generally, higher income students tend to perform better on standardized tests, like the SAT and ACT. This could be due to several factors, including the ability to move to areas with higher performing schools and to afford test prep tutoring. So, this raises a question for colleges evaluating applicants: how should they consider and compare student test scores?
- Identify the slope and y-intercept in a linear regression model
- Use a linear regression model to make predictions, differentiating between extrapolation and interpolation
- Calculate and interpret residuals
- Determine the appropriateness of a linear regression model based on a residual plot
Before proceeding: Familiarize yourself with the lesson materials linked above (e.g. handout, handout key, slides, video). Then, for additional background and teaching tips from the lesson creators, check out the sections below.
- After introducing the Key Question (“How should colleges evaluate test scores?”), the lesson shifts to a smaller and more accessible data set involving attendance and test scores. Students use this context to learn about linear regression models, predictions, residuals, and residual plots before returning to the Key Question during the discussion. Providing this framing early can help students see how the concepts connect to the larger question and maintain engagement throughout the lesson.
- This lesson builds directly on the previous lesson’s introduction to scatterplots, correlation, and association. The focus now shifts from describing relationships to using and evaluating linear models. Students learn how to interpret a model, make predictions, and assess whether a linear relationship provides a reasonable representation of the data. The mechanics of calculating the least-squares regression line are intentionally deferred to a later lesson so that the emphasis remains on interpretation and model evaluation.
First, download this lesson's Handout Key and read through its Discussion Question section. Then, check out our model discussion norms and the additional background notes below.
- In 2020, the University of California released a report examining the role of standardized tests in admissions. Since then, the university system has adopted a test-blind admissions policy. More recently, faculty have begun reexamining the role of standardized testing, citing concerns about college-level mathematics readiness. In June, the UC Academic Senate officially created a faculty work group to evaluate the return of standardized testing. This evolving discussion provides an authentic context for exploring how statistical models can be used to evaluate the college admissions process.
- The discussion question creates an opportunity to consider how residuals can be used to evaluate student performance relative to expectations rather than relying solely on raw scores. Because residuals measure the difference between an observed value and the value predicted by a model, they provide a way to compare outcomes after accounting for factors included in the model. In this context, residuals represent how much a student overperformed or underperformed relative to the score predicted by their family income. Encourage students to consider both the advantages and limitations of this approach.
- The lesson intentionally avoids presenting a single “correct” answer to the fairness question. Instead, the goal is to evaluate the strengths and weaknesses of a proposed model and consider how different assumptions can influence conclusions.
- The attendance and test score data set provides a simple context for introducing linear regression concepts. Because the relationship is strong and approximately linear, students can focus on interpreting models, making predictions, and understanding residuals without becoming distracted by more complex patterns in the data.
- The University of California admissions context demonstrates how statistical models are used to make decisions in real-world settings. The lesson provides an opportunity to discuss how models can be informative and useful while still having limitations that should be considered carefully.
- Students may wonder why statistics typically writes linear regression models in the form ŷ = a + bx rather than y = mx + b. This notation helps maintain consistency with more complex models, which begin with an intercept term and then add explanatory components.
- It is critical to distinguish between y and ŷ and to use the notation accurately. For a given x-value, y represents the observed value while ŷ represents the value predicted by the model. This distinction becomes especially important when interpreting residuals.
- When working with statistical models, it is important to recognize that models are not perfect representations of reality. Instead, they provide useful approximations based on available information. The aphorism “All models are wrong, but some are useful” provides a helpful summary of this idea.
- Interpolation and extrapolation are not equally reliable. Predictions made within the range of observed x-values rely on relationships supported by the data, while extrapolated predictions assume that the observed trend continues beyond the available data. The obesity example in the lesson provides a useful illustration of why such assumptions can be problematic.
- Residual plots are often more informative than the original scatterplot when evaluating whether a linear model is appropriate. Systematic patterns in the residual plot indicate that important structure remains unexplained by the model, while residuals that appear randomly scattered around zero provide evidence that a linear model is reasonable.
Student Supports
Lesson-specific resources to support all learners.
- Reinforce that order matters when calculating residuals. Residuals are always calculated as observed minus predicted (y − ŷ), and the sign should be retained. Positive residuals indicate underprediction by the model, while negative residuals indicate overprediction. Marking residuals directly on a scatterplot can help make this interpretation more intuitive.
- Scatterplots often do not begin at the origin, which is an important consideration when interpreting the y-intercept. The y-intercept is defined as the predicted value when x = 0, but if the graph does not display x = 0, the intercept may not appear as the point where the line crosses the visible axis.
- In AP Statistics residual plots are almost always provided. Therefore, students can focus on understanding how to interpret these graphs rather than learning to construct them.
- A residual plot is a useful tool for evaluating a model and identifying patterns that may not be obvious in the original scatterplot. Other considerations include whether the relationship appears linear, the presence of influential outliers, the magnitude of residuals, and whether predictions are being made through interpolation or extrapolation. These factors should be considered together when evaluating a model. This lesson introduces residual plots as one evaluation tool, while several of these additional considerations will be developed further in later lessons.
- Predictions made through interpolation are generally more reliable than predictions made through extrapolation. Encouraging students to identify the observed range of x-values before making predictions can help reinforce this distinction.
- When interpreting regression models, encourage students to include context throughout their explanations. Slope, intercepts, predictions, and residuals should all be interpreted using the variables being studied rather than described only in mathematical terms.
- Vocabulary used in the context of the lesson may include words that are unfamiliar or have several meanings. In particular, the following mathematical terms may need clarification or a definition provided:
- In addition, the following contextual terms may need clarification or a definition provided:
- The parameter discussed in this lesson is the population mean height (μ or “mu”). The statistic discussed is the sample mean height (x̄ or “x-bar”). You can note for students that parameters are often represented by Greek letters. Statistics are often represented by letters with symbols above them. We’ll see more examples as we proceed through the course.
- Parameters are numerical attributes that describe a whole population. Statistics are numerical attributes that describe a measured sample. One nice way to remember these: parameters describe populations, so “p” matches with “p.” And statistics describe samples, so “s” matches with “s.”