IB Maths AA HL Topic 4 — Statistics & Probability Paper 1 & 2 ~7 min read

Sampling & Data Collection

Every statistical analysis starts with the data — how it was collected, what kind it is, and how representative it is of the bigger picture. The five sampling methods (simple random, systematic, stratified, quota, convenience) each have a use case; picking the wrong one introduces bias that no amount of clever analysis can fix.

📘 What you need to know

Qualitative data describes (e.g., colour); quantitative data counts or measures.
Discrete data takes specific values (e.g., number of pets); continuous takes any value in a range (e.g., height).
Population = the whole set of interest; sample = a subset taken from it.
Sampling frame = list of all population members (needed for some methods).
Five sampling methods: simple random, systematic, stratified, quota, convenience — each with strengths and use cases.
Stratified sample size: size of sample (n)size of population (N) × group size.
Systematic interval: k = N/n; pick random start in [1, k], then every k-th member.
Bias arises when some members are systematically over- or under-represented — sampling techniques aim to minimise it.

Types of data

Qualitative

words / categories

e.g., eye colour, favourite sport, brand of phone

Quantitative

numbers

e.g., number of siblings, height, time taken

Quantitative data splits further:

Discrete

specific values

counted; e.g., 0, 1, 2, 3 pets — but not 1.7 pets

Continuous

any value in a range

measured; e.g., 1.732 m height, 12.45 seconds

Watch out for “age”: it can be either, depending on context. “How many years old” is discrete (whole years); “how long someone has been alive” is continuous (precise time).

Population vs sample

The population is the full group you’re interested in (e.g., all French bulldogs); the sample is a smaller subset chosen to represent it. Sampling is faster and cheaper than studying the whole population, but it introduces uncertainty and risk of bias.

The five sampling methods

Method	How it works	Needs sampling frame?	Best for…
Simple random	number every member, pick n at random	yes	small populations, no obvious structure
Systematic	pick every k-th member after a random start	yes	natural ordering (e.g., production line)
Stratified	split into groups, sample each in proportion	yes (per group)	distinct sub-groups in the population
Quota	split into groups, sample a fixed number from each (not random)	no	street surveys, no full list available
Convenience	sample whoever is easiest to reach	no	quick rough estimates; high bias risk

Stratified sample — number from each group number from group = size of sample (n)size of population (N) × group size

Systematic interval k = size of population (N)size of sample (n)

Reliability of data

A sample is reliable if a different sample from the same population would give similar results. Reliability improves with: a larger sample, a sampling method that minimises bias, accurate recording, and a high response rate. Bias creeps in via leading questions, self-selection (only enthusiasts respond), or excluding parts of the population entirely.

🧭 Recipe — choosing a sampling method

Is a sampling frame (list of all members) available? If no → quota or convenience.
Are there distinct sub-groups (year groups, departments, age brackets)? If yes → stratified.
Is there a natural order (conveyor belt, alphabetical list)? If yes → systematic is quick.
If population is small and unstructured → simple random sampling.
Always: prefer methods that give every member a fair chance of selection.

Worked examples

WE 1

Classify each variable as qualitative or quantitative; if quantitative, state discrete or continuous

(a) Hair colour. (b) Number of cars in a household. (c) Weight of a parcel in kg. (d) Shoe size.

(a) Hair colour describes a category — qualitative (b) Number of cars a count of distinct values (0, 1, 2, 3, …) — quantitative, discrete (c) Weight in kg a measurement (any value in a range) — quantitative, continuous (d) Shoe size specific values (5, 5.5, 6, 6.5, …) — quantitative, discrete (a) qualitative; (b) quant. discrete; (c) quant. continuous; (d) quant. discrete measured = continuous; counted = discrete

WE 2

Stratified sampling across three year groups

A school has 240 students in Year 10, 200 in Year 11, and 160 in Year 12 (total 600). The headteacher wants a stratified sample of 30 students. How many students should be sampled from each year group?

Step 1: Sample fraction 30/600 = 1/20 (so 1 in every 20 students) Step 2: Apply to each year group Year 10: (30/600) × 240 = 12 Year 11: (30/600) × 200 = 10 Year 12: (30/600) × 160 = 8 Step 3: Verify the total 12 + 10 + 8 = 30 ✓ 12 from Y10, 10 from Y11, 8 from Y12 within each group, choose the actual students randomly

WE 3

Systematic sampling on a production line

A factory produces 2000 toys per day on a single conveyor belt. Quality control wants to sample 80 toys using systematic sampling. Describe the procedure.

Step 1: Compute the interval k = N/n k = 2000/80 = 25 Step 2: Pick a random start s between 1 and 25 e.g., s = 7 (chosen randomly) Step 3: Sample every 25th toy after s Selected: 7th, 32nd, 57th, 82nd, …, 1982nd k = 25; pick random start in [1, 25], then every 25th after that conveyor belt has natural order — systematic is ideal

WE 4

Identify the sampling method used in each scenario

(a) A market researcher stands outside a mall and asks the first 50 shoppers who pass by. (b) Every employee is given a unique ID number; 30 IDs are then chosen using a random number generator. (c) A city is divided by district, and each district is sampled in proportion to its population.

(a) Asking whoever passes by — easiest to reach → convenience sampling (b) Numbered list + random selection of n IDs → simple random sampling (c) Districts = strata; proportional sampling from each → stratified sampling (a) convenience; (b) simple random; (c) stratified key cue words: “first who pass” = convenience; “random number generator” = simple random; “in proportion” = stratified

WE 5

Identify a source of bias and suggest an improvement

To estimate the average height of adults in a city, a researcher measures everyone leaving a basketball stadium after a game. (a) State why this sampling method is likely to be biased. (b) Suggest one improvement.

(a) The sample is not representative basketball spectators (and players) are likely to be taller on average than the general population → the sample mean would overestimate the city’s true average height (b) Improvement use a stratified random sample by age and sex from a city-wide list (e.g., electoral register) (a) basketball fans/players are unrepresentatively tall; (b) use stratified random sampling from a city-wide register always ask: “is every section of the population given a fair chance?”

WE 6

Library survey — combined sampling problem

A library has 480 fiction books and 320 non-fiction books. The librarian wants a stratified sample of 40 books to inspect for damage. (a) How many of each type should be sampled? (b) The librarian decides instead to sample 40 books by walking through the library and picking up the first 40 books that look worn. State the name of this method and one disadvantage. (c) Suggest one way to make the original stratified sample more reliable.

(a) Apply stratified formula Total: 480 + 320 = 800; sample fraction = 40/800 = 1/20 Fiction: (40/800) × 480 = 24 Non-fiction: (40/800) × 320 = 16 Verify: 24 + 16 = 40 ✓ (b) “First 40 worn books” — easiest to reach → convenience sampling disadvantage: heavily biased toward visibly damaged books — overestimates damage rate (c) Improvement to stratified method increase sample size (e.g., to 80) to reduce variability (a) 24 fiction + 16 non-fiction; (b) convenience, biased toward worn books; (c) larger sample size three classic improvements: (i) larger sample, (ii) random within strata, (iii) repeat the survey

💡 Top tips

Identify the sampling frame first — its presence (or absence) usually decides which method is feasible.
For stratified problems, always verify your group sizes add up to n.
“In proportion to” is the giveaway phrase for stratified sampling.
Round sensibly for stratified sample sizes — half-people don’t exist; round to nearest while keeping the total at n.
Always justify in context: relate bias to the population being studied, not just generic “this is biased”.

⚠ Common mistakes

Confusing quota with stratified — both split into groups, but stratified samples randomly within each, while quota does not.
Calling all “split into groups” methods stratified — without random selection within groups, it’s quota.
Mixing up discrete and continuous — “shoe size” looks continuous but is discrete; “age in years” is discrete; “exact age” is continuous.
Saying “biased” without explaining why — exam answers need a specific reason linked to the scenario.
Forgetting that systematic and simple random need a sampling frame — they fail when no list exists.

Next: Measures of Central Tendency. Once data is collected, you summarise it. The three classical averages — mean, median, mode — each capture a different kind of “centre”, and each is right for different situations. Mean uses every value, median resists outliers, mode finds the most common.

Need help with Statistics & Probability?

Get 1-on-1 help from an IB examiner who knows exactly what Paper 1 & 2 are looking for.

Book Free Session →

Sampling & Data Collection

📘 What you need to know

Types of data

Population vs sample

The five sampling methods

Reliability of data

🧭 Recipe — choosing a sampling method

Worked examples

💡 Top tips

⚠ Common mistakes

Need help with Statistics & Probability?

Quick Links

Contact us

Follow us