IB Maths AA HL
Topic 4 β Statistics & Probability
Paper 1 & 2
~7 min read
Sampling & Data Collection
Every statistical analysis starts with the data β how it was collected, what kind it is, and how representative it is of the bigger picture. The five sampling methods (simple random, systematic, stratified, quota, convenience) each have a use case; picking the wrong one introduces bias that no amount of clever analysis can fix.
π What you need to know
- Qualitative data describes (e.g., colour); quantitative data counts or measures.
- Discrete data takes specific values (e.g., number of pets); continuous takes any value in a range (e.g., height).
- Population = the whole set of interest; sample = a subset taken from it.
- Sampling frame = list of all population members (needed for some methods).
- Five sampling methods: simple random, systematic, stratified, quota, convenience β each with strengths and use cases.
- Stratified sample size: size of sample (n)size of population (N) Γ group size.
- Systematic interval: k = N/n; pick random start in [1, k], then every k-th member.
- Bias arises when some members are systematically over- or under-represented β sampling techniques aim to minimise it.
Types of data
Qualitative
words / categories
e.g., eye colour, favourite sport, brand of phone
Quantitative
numbers
e.g., number of siblings, height, time taken
Quantitative data splits further:
Discrete
specific values
counted; e.g., 0, 1, 2, 3 pets β but not 1.7 pets
Continuous
any value in a range
measured; e.g., 1.732 m height, 12.45 seconds
Watch out for “age”: it can be either, depending on context. “How many years old” is discrete (whole years); “how long someone has been alive” is continuous (precise time).
Population vs sample
The population is the full group you’re interested in (e.g., all French bulldogs); the sample is a smaller subset chosen to represent it. Sampling is faster and cheaper than studying the whole population, but it introduces uncertainty and risk of bias.
The five sampling methods
| Method | How it works | Needs sampling frame? | Best for⦠|
|---|
| Simple random | number every member, pick n at random | yes | small populations, no obvious structure |
| Systematic | pick every k-th member after a random start | yes | natural ordering (e.g., production line) |
| Stratified | split into groups, sample each in proportion | yes (per group) | distinct sub-groups in the population |
| Quota | split into groups, sample a fixed number from each (not random) | no | street surveys, no full list available |
| Convenience | sample whoever is easiest to reach | no | quick rough estimates; high bias risk |
Stratified sample β number from each group
number from group = size of sample (n)size of population (N) Γ group size
Systematic interval
k = size of population (N)size of sample (n)
Reliability of data
A sample is reliable if a different sample from the same population would give similar results. Reliability improves with: a larger sample, a sampling method that minimises bias, accurate recording, and a high response rate. Bias creeps in via leading questions, self-selection (only enthusiasts respond), or excluding parts of the population entirely.
π§ Recipe β choosing a sampling method
- Is a sampling frame (list of all members) available? If no β quota or convenience.
- Are there distinct sub-groups (year groups, departments, age brackets)? If yes β stratified.
- Is there a natural order (conveyor belt, alphabetical list)? If yes β systematic is quick.
- If population is small and unstructured β simple random sampling.
- Always: prefer methods that give every member a fair chance of selection.
Worked examples
WE 1Classify each variable as qualitative or quantitative; if quantitative, state discrete or continuous
(a) Hair colour. (b) Number of cars in a household. (c) Weight of a parcel in kg. (d) Shoe size.
(a) Hair colour
describes a category β qualitative
(b) Number of cars
a count of distinct values (0, 1, 2, 3, …) β quantitative, discrete
(c) Weight in kg
a measurement (any value in a range) β quantitative, continuous
(d) Shoe size
specific values (5, 5.5, 6, 6.5, …) β quantitative, discrete
(a) qualitative; (b) quant. discrete; (c) quant. continuous; (d) quant. discrete
measured = continuous; counted = discrete
WE 2Stratified sampling across three year groups
A school has 240 students in Year 10, 200 in Year 11, and 160 in Year 12 (total 600). The headteacher wants a stratified sample of 30 students. How many students should be sampled from each year group?
Step 1: Sample fraction
30/600 = 1/20 (so 1 in every 20 students)
Step 2: Apply to each year group
Year 10: (30/600) Γ 240 = 12
Year 11: (30/600) Γ 200 = 10
Year 12: (30/600) Γ 160 = 8
Step 3: Verify the total
12 + 10 + 8 = 30 β
12 from Y10, 10 from Y11, 8 from Y12
within each group, choose the actual students randomly
WE 3Systematic sampling on a production line
A factory produces 2000 toys per day on a single conveyor belt. Quality control wants to sample 80 toys using systematic sampling. Describe the procedure.
Step 1: Compute the interval k = N/n
k = 2000/80 = 25
Step 2: Pick a random start s between 1 and 25
e.g., s = 7 (chosen randomly)
Step 3: Sample every 25th toy after s
Selected: 7th, 32nd, 57th, 82nd, …, 1982nd
k = 25; pick random start in [1, 25], then every 25th after that
conveyor belt has natural order β systematic is ideal
WE 4Identify the sampling method used in each scenario
(a) A market researcher stands outside a mall and asks the first 50 shoppers who pass by. (b) Every employee is given a unique ID number; 30 IDs are then chosen using a random number generator. (c) A city is divided by district, and each district is sampled in proportion to its population.
(a) Asking whoever passes by β easiest to reach
β convenience sampling
(b) Numbered list + random selection of n IDs
β simple random sampling
(c) Districts = strata; proportional sampling from each
β stratified sampling
(a) convenience; (b) simple random; (c) stratified
key cue words: “first who pass” = convenience; “random number generator” = simple random; “in proportion” = stratified
WE 5Identify a source of bias and suggest an improvement
To estimate the average height of adults in a city, a researcher measures everyone leaving a basketball stadium after a game. (a) State why this sampling method is likely to be biased. (b) Suggest one improvement.
(a) The sample is not representative
basketball spectators (and players) are likely to be taller on average than the general population
β the sample mean would overestimate the city’s true average height
(b) Improvement
use a stratified random sample by age and sex from a city-wide list (e.g., electoral register)
(a) basketball fans/players are unrepresentatively tall; (b) use stratified random sampling from a city-wide register
always ask: “is every section of the population given a fair chance?”
WE 6Library survey β combined sampling problem
A library has 480 fiction books and 320 non-fiction books. The librarian wants a stratified sample of 40 books to inspect for damage. (a) How many of each type should be sampled? (b) The librarian decides instead to sample 40 books by walking through the library and picking up the first 40 books that look worn. State the name of this method and one disadvantage. (c) Suggest one way to make the original stratified sample more reliable.
(a) Apply stratified formula
Total: 480 + 320 = 800; sample fraction = 40/800 = 1/20
Fiction: (40/800) Γ 480 = 24
Non-fiction: (40/800) Γ 320 = 16
Verify: 24 + 16 = 40 β
(b) “First 40 worn books” β easiest to reach
β convenience sampling
disadvantage: heavily biased toward visibly damaged books β overestimates damage rate
(c) Improvement to stratified method
increase sample size (e.g., to 80) to reduce variability
(a) 24 fiction + 16 non-fiction; (b) convenience, biased toward worn books; (c) larger sample size
three classic improvements: (i) larger sample, (ii) random within strata, (iii) repeat the survey
π‘ Top tips
- Identify the sampling frame first β its presence (or absence) usually decides which method is feasible.
- For stratified problems, always verify your group sizes add up to n.
- “In proportion to” is the giveaway phrase for stratified sampling.
- Round sensibly for stratified sample sizes β half-people don’t exist; round to nearest while keeping the total at n.
- Always justify in context: relate bias to the population being studied, not just generic “this is biased”.
β Common mistakes
- Confusing quota with stratified β both split into groups, but stratified samples randomly within each, while quota does not.
- Calling all “split into groups” methods stratified β without random selection within groups, it’s quota.
- Mixing up discrete and continuous β “shoe size” looks continuous but is discrete; “age in years” is discrete; “exact age” is continuous.
- Saying “biased” without explaining why β exam answers need a specific reason linked to the scenario.
- Forgetting that systematic and simple random need a sampling frame β they fail when no list exists.
Next: Measures of Central Tendency. Once data is collected, you summarise it. The three classical averages β mean, median, mode β each capture a different kind of “centre”, and each is right for different situations. Mean uses every value, median resists outliers, mode finds the most common.
Need help with Statistics & Probability?
Get 1-on-1 help from an IB examiner who knows exactly what Paper 1 & 2 are looking for.
Book Free Session β