IB Maths AI HL Statistics Toolkit Paper 1 & 2 ~7 min read

Reliability & Validity of Data Collection Methods

Picking a sampling technique is only half the battle. The other half is checking whether the data you actually collect is any good. Two key questions: is it reliable (would you get the same results if you ran the process again?) and is it valid (are you really measuring the thing you think you’re measuring?). The two ideas sound similar but they are very different — a measurement can be reliable yet completely invalid, or valid on average but wildly unreliable. The IB tests both with specific check methods: test–retest and parallel forms for reliability, plus content-related and criterion-related for validity.

📘 What you need to know

Reliable vs valid — the key distinction

The two terms feel similar but they describe completely different things. Reliability is about repeatability; validity is about accuracy. The classic target analogy makes the difference vivid.

reliability = clustering  ·  validity = hitting the centre
Reliable ✓ Valid ✓ the goal Reliable ✓ Valid ✗ consistent, but wrong Reliable ✗ Valid ✓ right on average, scattered Reliable ✗ Valid ✗ just noise A process can be reliable but wrong (panel 2), or accurate on average but inconsistent (panel 3). You want both — tight clustering on the bullseye (panel 1).
Reliability is how tightly your arrows cluster. Validity is whether they cluster on the centre. The two are independent qualities — you can have either without the other.

🤔 A real example of “reliable but invalid”

A thermometer that always reads 5°C too high is perfectly reliable — give it the same water on different days and it gives the same wrong answer every time. But it’s not valid, because it doesn’t actually measure temperature accurately. The fix isn’t to repeat the readings — it’s to recalibrate the thermometer.

Tests for reliability

Two tests. Both compare two sets of results from the same sample. If the two sets correlate strongly (positively), the process is reliable.

Test–Retest
same questions, later
Run the same process with the same sample at a later time. Look for positive correlation. Weakness: participants may remember the first attempt.
Parallel Forms
similar questions, once
Give the same sample a different but equivalent set of questions covering the same variable. Weakness: hard to make the two sets equally difficult.
Reliability check high positive correlation between the two sets ⇒ reliable low correlation or no pattern ⇒ not reliable

Tests for validity

Two tests again. Both ask, in different ways, whether the data collection process is actually measuring the right thing.

Content-related
covers all aspects?
Does the process measure all aspects of the variable? Usually needs expert judgement to confirm full coverage.
Criterion-related
predicts another variable?
Does this variable accurately predict the outcome of another (the “criterion”) variable? If so, the process is valid as a predictor.

Content-related validity — examples

Valid: a calculus test that includes differentiation, integration, and applications questions — covers the full variable.
Not valid: assessing a chef’s overall cooking ability by asking them to make 10 apple pies — only one dish is tested.

Criterion-related validity — examples

Valid: using mock exam scores to predict final exam scores — strong link.
Not valid: using meerkat heights to predict squirrel heights — no meaningful relationship.

Designing good surveys & questionnaires

Even a well-chosen sampling technique can produce bad data if the survey questions are loaded, ambiguous, or invasive. The IB expects you to spot these flaws.

Good survey design

Good questionnaire questions

AvoidExample to avoidWhy
Leading questions“You enjoy school, don’t you?”Suggests the “right” answer
Personal questions“What is your home address?”Unnecessary; reduces trust
Loaded language“Do you watch the boring news?”Embeds an opinion in the question
Self-judgement“How smart are you?”People struggle to rate themselves
Ambiguous phrasing“Do you study French or Spanish?”Yes/no answer? Or pick one?
Missing options“Which sport: football or tennis?”Other sports excluded

🧠 Structured vs unstructured

Structured questions (multiple-choice, rating, ranking) are quick to analyse but give limited insight. Unstructured (open-ended) questions yield richer answers but take much longer to analyse. Most real questionnaires use a mix.

🧭 Recipe — identifying reliability or validity

  1. Spot the test type: same process repeated → test–retest; new similar questions → parallel forms; covers all aspects → content; predicts another variable → criterion.
  2. For reliability: compare the two sets of results. Are they similar / strongly correlated?
  3. For validity: ask whether the process actually measures what it claims to (content) or whether the predictor is the right one (criterion).
  4. Comment in context: state the conclusion (reliable / valid or not) and give a reason linked to the scenario.
  5. If unreliable or invalid: suggest a larger sample, a different sampling technique, or changes to the process.

Worked examples

WE 1

Identify problems in survey questions

A school designs a questionnaire about school meals. State a problem with each question and rewrite it appropriately.
(a) “Don’t you agree that the canteen food is great?”
(b) “Do you prefer pizza or salad: Yes or No?”
(c) “How would you rate your own cooking ability out of 10?”

(a) leading question suggests the answer; introduces bias (a) “How would you rate the canteen food on a scale of 1–5?” (b) ambiguous phrasing “Yes/No” doesn’t match a choice question (b) “Which do you prefer: pizza or salad?” (pick one) (c) self-judgement people struggle to rate themselves objectively (c) “How often do you cook a meal at home per week?” questions should be neutral, precise, and not require self-judgement.
WE 2

Test–retest reliability with data

A psychologist tests 6 students on a memory task. Two weeks later, the same task is repeated with the same students.

Student123456
Test 11215810149
Test 21314911159

(a) State the reliability test being used.
(b) Comment on the reliability of the process.

(a) same task, same sample, repeated later (a) test–retest (b) compare the two sets differences: +1, −1, +1, +1, +1, 0 all within ±1 ⇒ strong positive correlation (b) the process is reliable justify with the data — small differences ⇒ high reliability.
WE 3

Parallel forms reliability

A maths teacher creates two different but similar quizzes (Quiz A and Quiz B) testing the same topic. She gives Quiz A to her class on Monday and Quiz B to the same class on Wednesday. The results show a strong positive correlation.
(a) Identify the reliability test used.
(b) State one advantage of this test over test–retest.

(a) two different but similar tests, same sample (a) parallel forms (b) advantage over test–retest students can’t remember answers from a previous identical test (b) eliminates the memory / familiarity effect parallel forms reduces bias from students recognising the same questions.
WE 4

Content-related validity

For each scenario, state whether the process has content-related validity and justify.
(a) An IB Maths AI HL teacher creates an end-of-year exam covering only calculus, ignoring statistics, vectors, and graph theory.
(b) A driving examiner tests new drivers on highway driving, parking, reversing, and emergency stops.

(a) covers only one syllabus topic major sections missing (a) NOT valid — doesn’t cover all aspects of AI HL (b) covers multiple driving skills representative of real driving requirements (b) valid — covers the full range of the variable content validity asks: are all important aspects measured?
WE 5

Criterion-related validity

For each scenario, state whether the process has criterion-related validity and justify.
(a) Using SAT scores to predict university grade point averages.
(b) Using shoe size to predict reading ability in adults.

(a) SAT and GPA both measure academic ability strong, well-established predictor (a) valid — predictor matches the criterion (b) shoe size and reading are unrelated no meaningful predictive relationship (b) NOT valid — the variables are unrelated criterion validity asks: does one variable accurately predict the other?
WE 6

Improving a flawed survey

A health researcher stands outside a gym at 7 a.m. for one hour and surveys exiting customers about the daily exercise levels of all adults in the country. State two flaws in this design and suggest one specific improvement for each.

flaw 1: convenience sampling at a gym gym-goers exercise more than average ⇒ biased sample improvement: sample from multiple locations (gym, park, homes) flaw 2: one hour only at one time excludes anyone who doesn’t go to that gym at 7am improvement: sample across multiple times and days also: increase sample size; use random or stratified sampling.

💡 Top tips

⚠ Common mistakes

Next up — Measures of Central Tendency. Once you’ve collected reliable, valid data, the first thing you’ll want is a single number that describes where the centre of the data lies. There are three classic measures — the mean, the median, and the mode — and each has its strengths, weaknesses, and a typical scenario where it’s the right choice.

Need help with Statistics?

Get 1-on-1 help from an IB examiner who knows exactly what Paper 1 & 2 are looking for.

Book Free Session →