IB Maths AA HL Topic 4 β€” Statistics & Probability Paper 1 & 2 ~7 min read

Sampling & Data Collection

Every statistical analysis starts with the data β€” how it was collected, what kind it is, and how representative it is of the bigger picture. The five sampling methods (simple random, systematic, stratified, quota, convenience) each have a use case; picking the wrong one introduces bias that no amount of clever analysis can fix.

πŸ“˜ What you need to know

Types of data

Qualitative
words / categories
e.g., eye colour, favourite sport, brand of phone
Quantitative
numbers
e.g., number of siblings, height, time taken

Quantitative data splits further:

Discrete
specific values
counted; e.g., 0, 1, 2, 3 pets β€” but not 1.7 pets
Continuous
any value in a range
measured; e.g., 1.732 m height, 12.45 seconds
Watch out for “age”: it can be either, depending on context. “How many years old” is discrete (whole years); “how long someone has been alive” is continuous (precise time).

Population vs sample

The population is the full group you’re interested in (e.g., all French bulldogs); the sample is a smaller subset chosen to represent it. Sampling is faster and cheaper than studying the whole population, but it introduces uncertainty and risk of bias.

The five sampling methods

MethodHow it worksNeeds sampling frame?Best for…
Simple randomnumber every member, pick n at randomyessmall populations, no obvious structure
Systematicpick every k-th member after a random startyesnatural ordering (e.g., production line)
Stratifiedsplit into groups, sample each in proportionyes (per group)distinct sub-groups in the population
Quotasplit into groups, sample a fixed number from each (not random)nostreet surveys, no full list available
Conveniencesample whoever is easiest to reachnoquick rough estimates; high bias risk
Stratified sample β€” number from each group number from group = size of sample (n)size of population (N) Γ— group size
Systematic interval k  =  size of population (N)size of sample (n)

Reliability of data

A sample is reliable if a different sample from the same population would give similar results. Reliability improves with: a larger sample, a sampling method that minimises bias, accurate recording, and a high response rate. Bias creeps in via leading questions, self-selection (only enthusiasts respond), or excluding parts of the population entirely.

🧭 Recipe β€” choosing a sampling method

  1. Is a sampling frame (list of all members) available? If no β†’ quota or convenience.
  2. Are there distinct sub-groups (year groups, departments, age brackets)? If yes β†’ stratified.
  3. Is there a natural order (conveyor belt, alphabetical list)? If yes β†’ systematic is quick.
  4. If population is small and unstructured β†’ simple random sampling.
  5. Always: prefer methods that give every member a fair chance of selection.

Worked examples

WE 1

Classify each variable as qualitative or quantitative; if quantitative, state discrete or continuous

(a) Hair colour. (b) Number of cars in a household. (c) Weight of a parcel in kg. (d) Shoe size.

(a) Hair colour describes a category β€” qualitative (b) Number of cars a count of distinct values (0, 1, 2, 3, …) β€” quantitative, discrete (c) Weight in kg a measurement (any value in a range) β€” quantitative, continuous (d) Shoe size specific values (5, 5.5, 6, 6.5, …) β€” quantitative, discrete (a) qualitative; (b) quant. discrete; (c) quant. continuous; (d) quant. discrete measured = continuous; counted = discrete
WE 2

Stratified sampling across three year groups

A school has 240 students in Year 10, 200 in Year 11, and 160 in Year 12 (total 600). The headteacher wants a stratified sample of 30 students. How many students should be sampled from each year group?

Step 1: Sample fraction 30/600 = 1/20 (so 1 in every 20 students) Step 2: Apply to each year group Year 10: (30/600) Γ— 240 = 12 Year 11: (30/600) Γ— 200 = 10 Year 12: (30/600) Γ— 160 = 8 Step 3: Verify the total 12 + 10 + 8 = 30 βœ“ 12 from Y10, 10 from Y11, 8 from Y12 within each group, choose the actual students randomly
WE 3

Systematic sampling on a production line

A factory produces 2000 toys per day on a single conveyor belt. Quality control wants to sample 80 toys using systematic sampling. Describe the procedure.

Step 1: Compute the interval k = N/n k = 2000/80 = 25 Step 2: Pick a random start s between 1 and 25 e.g., s = 7 (chosen randomly) Step 3: Sample every 25th toy after s Selected: 7th, 32nd, 57th, 82nd, …, 1982nd k = 25; pick random start in [1, 25], then every 25th after that conveyor belt has natural order β€” systematic is ideal
WE 4

Identify the sampling method used in each scenario

(a) A market researcher stands outside a mall and asks the first 50 shoppers who pass by. (b) Every employee is given a unique ID number; 30 IDs are then chosen using a random number generator. (c) A city is divided by district, and each district is sampled in proportion to its population.

(a) Asking whoever passes by β€” easiest to reach β†’ convenience sampling (b) Numbered list + random selection of n IDs β†’ simple random sampling (c) Districts = strata; proportional sampling from each β†’ stratified sampling (a) convenience; (b) simple random; (c) stratified key cue words: “first who pass” = convenience; “random number generator” = simple random; “in proportion” = stratified
WE 5

Identify a source of bias and suggest an improvement

To estimate the average height of adults in a city, a researcher measures everyone leaving a basketball stadium after a game. (a) State why this sampling method is likely to be biased. (b) Suggest one improvement.

(a) The sample is not representative basketball spectators (and players) are likely to be taller on average than the general population β†’ the sample mean would overestimate the city’s true average height (b) Improvement use a stratified random sample by age and sex from a city-wide list (e.g., electoral register) (a) basketball fans/players are unrepresentatively tall; (b) use stratified random sampling from a city-wide register always ask: “is every section of the population given a fair chance?”
WE 6

Library survey β€” combined sampling problem

A library has 480 fiction books and 320 non-fiction books. The librarian wants a stratified sample of 40 books to inspect for damage. (a) How many of each type should be sampled? (b) The librarian decides instead to sample 40 books by walking through the library and picking up the first 40 books that look worn. State the name of this method and one disadvantage. (c) Suggest one way to make the original stratified sample more reliable.

(a) Apply stratified formula Total: 480 + 320 = 800; sample fraction = 40/800 = 1/20 Fiction: (40/800) Γ— 480 = 24 Non-fiction: (40/800) Γ— 320 = 16 Verify: 24 + 16 = 40 βœ“ (b) “First 40 worn books” β€” easiest to reach β†’ convenience sampling disadvantage: heavily biased toward visibly damaged books β€” overestimates damage rate (c) Improvement to stratified method increase sample size (e.g., to 80) to reduce variability (a) 24 fiction + 16 non-fiction; (b) convenience, biased toward worn books; (c) larger sample size three classic improvements: (i) larger sample, (ii) random within strata, (iii) repeat the survey

πŸ’‘ Top tips

⚠ Common mistakes

Next: Measures of Central Tendency. Once data is collected, you summarise it. The three classical averages β€” mean, median, mode β€” each capture a different kind of “centre”, and each is right for different situations. Mean uses every value, median resists outliers, mode finds the most common.

Need help with Statistics & Probability?

Get 1-on-1 help from an IB examiner who knows exactly what Paper 1 & 2 are looking for.

Book Free Session β†’