Lesson 1.A.1 - Populations, Samples, & Data

Key Question: Can data about colleges be misleading?

Content: Populations & Samples | Classifying Variables | Parameters & Statistics

Alignment: CED Topics 1.1 - 1.2

Video

Student Items

Handout: pdf, doc

Mastery Check: link

Teacher Items

Handout Key: pdf, doc

Mastery Check Key: link

Slide Deck: pdf, ppt

Course Resources

Resources for teaching our AP® Statistics curriculum.

  • Lesson Flow - timing and flow of class, using our lesson materials
  • Pacing Guide - pacing our units, with daily or block schedules
  • CED Alignment Guide - aligning our lessons to the AP® Statistics Course and Exam Description

Teaching Resources

Resources for teaching with Skew The Script.

Lesson Notes

Lesson-specific insights from the creators of this lesson.

GIF

This lesson welcomes students to the study of statistics by inviting them to see how common sense might not be so common — or so sensical. First, they investigate a famous World War II problem with a compelling story and a surprising solution. Then, they’re challenged to apply the same logic to a context that’s closer to home: the (sometimes misleading) data shared by colleges to recruit new applicants. Ultimately, students discover that statistics isn’t really about calculating means, medians, and modes. It’s about statistical thinking — a way of reasoning about evidence that can show them what’s really happening behind the curtain. After this class, they’ll never look at numbers, data, or the world around them the same way ever again. In the words of the modern philosopher John McClane: “Welcome to the party, pal!”

Learning Targets
  • Distinguish a sample from a population
  • Classify types of variables
  • Describe the relationship between parameters and statistics

Before proceeding: Familiarize yourself with the lesson materials linked above (e.g. handout, handout key, slides, video). Then, for additional background and teaching tips from the lesson creators, check out the sections below.


  • When setting up the World War II problem, invest time into telling Abraham Wald’s story and why solving the bomber problem — like every problem he solved during the war — was personal for him. Telling his story is key to investing students in this historic problem. For a model, see how Mr. Young-Saver shares the story from 2:59 - 4:16 in the lesson video. In addition, the segment from 5:04 - 6:23 in the video also shows a model of the “big reveal” of Wald’s answer and the reasoning behind it.
  • In the lesson video, Mr. Young-Saver also shares how Abraham Wald’s story relates to his own family’s personal history. This isn’t necessary to share when you’re presenting the lesson in your own class. But if you assign students to watch the lesson video, this can provide another powerful moment.
  • Abraham Wald’s key insight was that the sample of planes that returned from fighting was not representative of the full population of planes that embarked on the bombing mission. In statistics, we called this kind of unrepresentative sample a biased sample. We’ll cover the idea of bias thoroughly later in the course. For now, rather than add more vocabulary (e.g. “bias”), it’s best to simply emphasize that samples can sometimes provide distorted pictures of populations. This idea will return again in the lesson’s Discussion Question.
  • Make sure to spend time on the most subtle variables to classify in the student data set: shoe size and zip code. The shoe size example helps students see that values with decimals can still be discrete. The zip code example helps students see that values with numbers can still be categorical.

First, download this lesson's Handout Key and read through its Discussion Question section. Then, check out our model discussion norms and the additional background notes below.

  • It’s important to emphasize that Palo Alto College is a community college. One key function of community colleges is to be a stepping stone towards transferring to a four-year college. So, both outcomes described here - obtaining full-time work and proceeding to the next level of schooling - are generally considered positive outcomes.
  • When brainstorming ways that the statement may be misleading, students sometimes discuss whether the “full-time work” consists of graduates’ first-choice careers or merely “fallback” jobs outside their fields of study. Students also sometimes bring up whether the “next level of schooling” consists of graduates’ first-choice schools or “fallback” schools. These are valid questions and could point out additional issues beyond the practice of sampling just the graduates.
  • Graduation rate data can be found for almost every U.S. college through the IPEDS website. If there’s a college near your community that shares job placement rates only among graduates, you can use the IPEDS website to look up its graduation rate. If the graduation rate is low, consider replacing the Palo Alto College example with the more local college, to add extra relevance for students.
  • Abraham Wald’s original analysis utilized far more than 10 planes. In the lesson, we use the hypothetical example of 10 planes that went on a specific bombing mission (6 of which returned) in order to provide a simple example of sample size and population size for students.
  • The nose is one of several possible places where Abraham Wald may have actually told the Allied Forces to put extra armor. The only public reprint of Wald’s original analysis can be found here. In the paper, Wald shares a “hypothetical example” in which the engines are the most vulnerable part of the planes. However, this example is hypothetical - the real data may have been classified at the time. That real data is either still classified or lost to history. So, you’ll find variations of the airplane problem in which different areas are identified as the vulnerable regions on the planes. In our lesson, we chose the nose to honor Mr. Young-Saver’s former professor Joseph Blitzstein, who uses the nose in his retelling of the problem. Ultimately, every retelling has the same conclusion: we should put armor where we observe the least shots - not the most.
  • The Abraham Wald example highlights an idea that we’ll formally introduce later in the course: sampling bias. The planes that returned are not representative of all the planes that embarked on the mission. This bias informs Wald’s conclusion to reinforce the areas with the least shots – not the most. The specific type of sampling bias on display here is called undercoverage bias – bias that arises when “the sampling method fails to include part of the population or a part of the population is less likely to be selected based on the sampling method” (AP Statistics CED, Topic 1.12). This is the same type of bias on display in the Discussion Question, as the sampling method fails to include the students who don’t graduate. We’ll discuss bias later in the course. It does not need to be explicitly defined in this lesson.
  • The student data set in the lesson is provided in tidy format. Students in AP Statistics do not need to know what tidy format is; however, it’s helpful for instructors interested in preparing their own data sets to be familiar with tidy format. Data provided in tidy format follow three basic rules (shown below). When these rules are not followed, data sets can be difficult to manipulate, visualize, and analyze. Here are the rules of tidy data:
    • Every column is a variable.
    • Every row is an observation.
    • Every cell is a single value.
  • Note that the far left column in the student data set, which shows the student names, is not a variable. Rather, it’s a unique identifier. Variables are descriptive values that vary between observation units (e.g. height, dominant hand, shoe size). It’s possible for two observational units to share the same variable values (e.g. two students share the same height or shoe size). Unique identifiers are non-descriptive names, codes, or numbers whose sole function is to identify a unique observation in a data set. In this case, the names provide a way to uniquely identify each student in the data set.

Student Supports

Lesson-specific resources to support all learners.

  • When breaking down the table of student data, it’s worth emphasizing the differences between rows (horizontal, left-and-right) and columns (vertical, up-and-down), as this distinction in language will help students have clearer conversations about tables throughout the course. A good visual analogy: To “row” a boat, you extend oars out to the left and right sides of a boat. To support a structure, you build “columns” that extend up and down through the building.
  • For distinguishing between categorical and quantitative data, it can sometimes be helpful to ask students: “If you took these values and found their average, would that average be meaningful?” In the case of zip codes, finding the average value of many zip codes wouldn’t be very meaningful. So, that’s an indicator that zip codes are categorical.
  • When showing the sample mean, the lesson assumes that students will have some prior experience with calculating the average of a set of numbers. If they do not have prior experience or would benefit from a refresher, having students check out 3:12 - 3:40 from the 1.A.4 lesson video.
  • Vocabulary used in the context of the lesson may include words that are unfamiliar or have several meanings. In particular, the following mathematical terms may need clarification or a definition provided:
    • Observational unit
    • Horizontal
    • Row
    • Variable
    • Vertical
    • Column
    • Quantitative
    • Categorical data
    • Average
  • In addition, the following contextual terms may need clarification or a definition provided:
    • Job placement rate
  • The parameter discussed in this lesson is the population mean height (μ or “mu”). The statistic discussed is the sample mean height (x̄ or “x-bar”). You can note for students that parameters are often represented by Greek letters. Statistics are often represented by letters with symbols above them. We’ll see more examples as we proceed through the course.
  • Parameters are numerical attributes that describe a whole population. Statistics are numerical attributes that describe a measured sample. One nice way to remember these: parameters describe populations, so “p” matches with “p.” And statistics describe samples, so “s” matches with “s.”