IB Maths AA SLTopic 4 โ Statistics ToolkitPaper 1 & 2~10 min read
Sampling & Data Collection
Before you can analyse anything, you have to collect data โ and how you collect it matters more than most students realise. This note covers what data even is, how to grab a fair slice of it, and why a bad sample can make even perfect maths give you the wrong answer.
๐ What you need to know
Data is either qualitative (words/categories) or quantitative (numbers). Quantitative data is then either discrete (counted) or continuous (measured).
The population is everyone you care about. A sample is a small group taken from the population.
The five sampling techniques you need: simple random, systematic, stratified, quota, and convenience.
Stratified sampling uses the formula: sample size from group = nN ร group size.
A biased sample gives misleading results. Random sampling is what we use to fight bias.
Larger samples are usually more reliable, but not if the method is biased.
The four types of data
Every piece of data fits into one of four boxes. Knowing which box yours sits in tells you what graphs to draw and what calculations make sense.
Qualitative
Words / categories
Describes something โ usually colour, name, type, or label. Not a number.
e.g. eye colour, favourite subject, blood type
Quantitative
Numbers (counted or measured)
A number that you can do maths with โ count it or measure it.
e.g. number of pets, height, exam score, time taken
Discrete (a type of quantitative)
Counted โ only specific values
Whole or fixed values. Nothing between them.
e.g. number of students (3 or 4, never 3.7), shoe size
Continuous (a type of quantitative)
Measured โ any value in a range
Can take any value within a range โ limited only by how precise your tool is.
e.g. height (172.3 cm, 172.34 cmโฆ), mass, time, temperature
The data type tree
A trick for telling discrete from continuous: ask “could you ever get half of one?” If yes, it’s continuous (1.5 cm is fine). If no, it’s discrete (1.5 children is impossible โ you can’t have half a kid!).
What about age?
Age is a sneaky one โ it depends on how the question phrases it.
“How many years old is the person?” โ discrete (you’d say “16”, not “16.4”).
“How long has the person been alive?” โ continuous (16 years, 4 months, 2 days, 5 hoursโฆ).
Population vs sample
Imagine a vet wants to know the average sleep time of French bulldogs. The population would be every French bulldog in the world. That’s clearly impossible to measure โ so instead, the vet would take a sample: maybe 50 dogs from a few different cities, and use them to estimate what the population looks like.
Definitions
Population = every member you care about
Sample = a smaller group taken from the population
Sampling frame = a list of every member of the population
Why use a sample at all?
๐ Pros of sampling
Quicker and cheaper than measuring everyone.
Less data to handle and analyse.
Sometimes the only practical option (you can’t measure every fish in the ocean!).
๐ Cons of sampling
The sample might not fully represent the population.
Bias can creep in if the method isn’t fair.
Different samples can lead to different conclusions.
The five sampling techniques
The IB exam expects you to know five different sampling techniques. Each one suits a different situation. Here’s a tour:
1
Simple random sampling
How: Number every member of the population, then use a random number generator (or pull names from a hat) to pick n of them. Every member has the same chance.
โ Truly random and unbiased. Best choice when you have a small population.
โ Slow if the population is huge. Impossible if you can’t list every member (e.g. fish in a lake).
2
Systematic sampling
How: Calculate k = Nn (population รท sample size). Pick a random start between 1 and k, then take every kth member after that.
โ Quick and easy. Great when there’s a natural order โ a list of names, conveyor belt, etc.
โ Can’t use if you can’t list members. Risk of bias if the order has a hidden pattern.
3
Stratified sampling
How: Split the population into disjoint groups (called strata) โ like males/females, or different age bands. From each group, take a random sample, sized so the proportions match the population.
Formula: sample from group = nN ร number in group.
โ Sample reflects the population structure. Good when groups within the population are very different.
โ Can’t be used if the population can’t be split into groups, or if groups overlap.
4
Quota sampling
How: Like stratified, you split into groups and decide how many to pick from each โ but you don’t pick randomly. Just keep selecting until each quota is filled.
โ Useful when no list of the population exists. Common in street surveys.
โ Can be biased โ people who refuse to take part skew the results.
5
Convenience sampling
How: Just pick whoever is easiest to reach โ friends, classmates, the first 20 people who walk past.
โ Fast and free. Used when no list of the population is available.
โ Almost always biased. Sample probably won’t represent the wider population.
๐ง
Memory trick: “Random vs Roughly Right vs Whoever Shows Up”
Think of the methods on a “random scale”. Simple random & systematic are properly random. Stratified is random within each group. Quota picks the right numbers from each group but not randomly. Convenience just grabs whoever is around. The further down the list, the higher the bias risk.
Bias and reliability
A biased sample is one that gives a misleading picture of the population. The whole point of using a careful sampling method is to fight bias.
๐
What makes data reliable?
A sample is reliable if you’d get similar results from a different sample of the same population. The sample needs to be representative (the right mix) and big enough. Tiny samples โ even random ones โ can fluctuate a lot.
What causes data to be unreliable?
Bias: the sample isn’t random โ some members had a higher chance than others.
Recording errors: numbers written down wrong, duplicated, or missing.
The collector “cherry-picks”: they include or exclude data to push a desired outcome.
Missing data: some members refuse to take part or aren’t reachable, leaving holes.
If a question asks you to “suggest one improvement” to a sampling method, the safest answers are nearly always: increase the sample size, or use a more random method. Both attack the bias problem head-on.
Worked examples
WE 1
Identify the type of data
For each of the following, state whether the data is qualitative or quantitative. If quantitative, also state whether it’s discrete or continuous.
(a) Eye colour of students (b) Number of pets owned (c) Time to run 100 m (d) Mass of an apple (e) Shoe size
Qualitative = words. Quantitative = numbers. Discrete = counted, Continuous = measured.(a) Eye colour:words โ Qualitative(b) Number of pets:numbers, counted โ Quantitative, Discrete(c) Time to run 100 m:numbers, measured โ Quantitative, Continuous(d) Mass of an apple:numbers, measured โ Quantitative, Continuous(e) Shoe size:numbers, fixed values โ Quantitative, Discretea โ Qualitative | b, e โ Discrete | c, d โ Continuoustip: ask “could I have half of this?” โ if not, it’s discrete
WE 2
Stratified sampling โ Mike’s mice
Mike is a biologist studying mice in an open enclosure. He has approximately 540 field mice and 260 harvest mice. Mike wants to sample 10 mice and he wants the proportions to match the population.
(a) Calculate how many of each type Mike should include. (b) Given that Mike has no list of the mice, name the sampling method. (c) Suggest one way to improve the method.
Total: 540 + 260 = 800 micepart (a)Field mice:540800 ร 10 = 6.75Harvest mice:260800 ร 10 = 3.25Round to whole mice (you can’t sample half a mouse!):7 field mice, 3 harvest micepart (b)No list of the mice โ can’t be random or stratified.Mike picks until each group quota is filled.Quota samplingpart (c)The simplest improvement:Increase the sample sizea bigger sample is more representative and more reliable
WE 3
Systematic sampling on a production line
A factory produces 600 chocolate bars per hour. The quality controller wants to sample 30 bars using systematic sampling.
(a) Calculate the interval k. (b) If the controller starts at bar number 7, list the next 4 bars she would sample.
N = 600, n = 30part (a)Use k = N รท n:k = 60030 = 20k = 20part (b)Starting bar = 7. Then add 20 each time:7, 27, 47, 67, 87Next 4 bars: 27, 47, 67, 87just keep adding k = 20 until you reach the end of the run
WE 4
Stratified sampling at a school
A school has 480 boys and 320 girls. The headteacher wants to survey 40 students using stratified sampling. Calculate how many boys and girls should be in the sample.
N = 480 + 320 = 800, n = 40Use the stratified formula for each group separately.Sample fraction:nN = 40800 = 120Boys:120 ร 480 = 24Girls:120 ร 320 = 16Check: 24 + 16 = 40 โ24 boys, 16 girlsalways check your group totals add up to the sample size
WE 5
Identify and evaluate a sampling method
A market researcher stands outside a supermarket from 10 am to 12 pm and asks the first 100 people who walk past about their shopping preferences.
(a) Identify the sampling method. (b) Give one disadvantage. (c) Suggest a better method.
She picks whoever is easiest to reach โ no list, no proportions, no random selection.part (a)Convenience samplingpart (b)10 am โ 12 pm misses people at work.Sample is biased toward stay-at-home shoppers.Sample is not representative of all shopperspart (c)Survey at varied times across the week, or use stratified sampling on a customer database.Use stratified sampling across different times of daynaming a specific better method scores higher than just saying “use a bigger sample”
๐ก Top tips
Memorise the five sampling methods by name. If you can match a description to “simple random / systematic / stratified / quota / convenience”, you’ve got the easy marks.
The stratified formula is your friend. sample from group = (n รท N) ร group size. It works every time.
Round stratified answers carefully. If you get 6.75 mice, you can’t sample three-quarters of a mouse โ round to a whole number, and check the total still equals n.
Read the question for “no list available”. If the population can’t be listed, simple random and systematic are out โ leaving stratified, quota, or convenience.
“Suggest an improvement” almost always wants you to say either “increase the sample size” or “use a more random method”.
For data type questions, ask: words or numbers? If numbers, can you have a half of one?
Be specific in answers. Saying “the sample is biased” is okay; saying “the sample is biased because it misses working adults” gets full marks.
โ Common mistakes
Confusing stratified with quota. Stratified takes a random sample from each group. Quota just fills the numbers โ no randomness inside groups.
Forgetting that stratified totals must add to n. If you’ve calculated 7 field mice and 3 harvest mice, check 7 + 3 = 10 โ. Rounding errors can leave you 1 short or 1 over.
Calling “ask my classmates” simple random sampling. It’s convenience sampling โ your friends aren’t a random selection of the school.
Saying “discrete” for time, height, or mass. Anything measured is continuous, not discrete.
Saying “bias” without explaining what kind. Examiners want the specific reason โ “biased because the sample missed group X”.
Mixing up population and sample. Population = everyone you care about. Sample = the small chunk you actually measured.
Thinking “more data = always better”. A huge biased sample is still wrong. Method matters more than size โ fix the method first.
Welcome to Topic 4! The Statistics Toolkit is mostly about gathering, summarising, and visualising data. The next note covers measures of central tendency โ mean, median, and mode โ which is where the actual number-crunching kicks in.
Need help with Sampling & Data Collection?
Get 1-on-1 help from an IB examiner who knows exactly what Paper 1 & 2 are looking for.