IB Maths AI HLStatistics ToolkitPaper 1 & 2~7 min read
Reliability & Validity of Data Collection Methods
Picking a sampling technique is only half the battle. The other half is checking whether the data you actually collect is any good. Two key questions: is it reliable (would you get the same results if you ran the process again?) and is it valid (are you really measuring the thing you think you’re measuring?). The two ideas sound similar but they are very different — a measurement can be reliable yet completely invalid, or valid on average but wildly unreliable. The IB tests both with specific check methods: test–retest and parallel forms for reliability, plus content-related and criterion-related for validity.
📘 What you need to know
Reliability = consistency. A reliable process gives the same results when repeated on the same sample under the same conditions.
Validity = accuracy. A valid process accurately measures the variable it claims to measure.
Test–retest reliability — run the same process with the same sample at a later time. Reliable ⇒ positive correlation between the two sets of results.
Parallel forms reliability — give the same sample a second, similar set of questions covering the same variable. Reliable ⇒ positive correlation between the two sets.
Content-related validity — does the process cover all aspects of the variable? Usually needs expert input.
Criterion-related validity — does one variable accurately predict the outcome of another (the criterion variable)?
If unreliable or invalid: change the sampling technique, adjust the data collection process, or use a larger sample.
Good survey — anonymous where possible, no leading interviewer, no in-person bias.
Good questionnaire — unbiased questions, no leading wording, options cover all responses, no unnecessary personal info.
Reliable vs valid — the key distinction
The two terms feel similar but they describe completely different things. Reliability is about repeatability; validity is about accuracy. The classic target analogy makes the difference vivid.
reliability = clustering · validity = hitting the centre
Reliability is how tightly your arrows cluster. Validity is whether they cluster on the centre. The two are independent qualities — you can have either without the other.
🤔 A real example of “reliable but invalid”
A thermometer that always reads 5°C too high is perfectly reliable — give it the same water on different days and it gives the same wrong answer every time. But it’s not valid, because it doesn’t actually measure temperature accurately. The fix isn’t to repeat the readings — it’s to recalibrate the thermometer.
Tests for reliability
Two tests. Both compare two sets of results from the same sample. If the two sets correlate strongly (positively), the process is reliable.
Test–Retest
same questions, later
Run the same process with the same sample at a later time. Look for positive correlation. Weakness: participants may remember the first attempt.
Parallel Forms
similar questions, once
Give the same sample a different but equivalent set of questions covering the same variable. Weakness: hard to make the two sets equally difficult.
Reliability check
high positive correlation between the two sets ⇒ reliablelow correlation or no pattern ⇒ not reliable
Tests for validity
Two tests again. Both ask, in different ways, whether the data collection process is actually measuring the right thing.
Content-related
covers all aspects?
Does the process measure all aspects of the variable? Usually needs expert judgement to confirm full coverage.
Criterion-related
predicts another variable?
Does this variable accurately predict the outcome of another (the “criterion”) variable? If so, the process is valid as a predictor.
Content-related validity — examples
Valid: a calculus test that includes differentiation, integration, and applications questions — covers the full variable. Not valid: assessing a chef’s overall cooking ability by asking them to make 10 apple pies — only one dish is tested.
Criterion-related validity — examples
Valid: using mock exam scores to predict final exam scores — strong link. Not valid: using meerkat heights to predict squirrel heights — no meaningful relationship.
Designing good surveys & questionnaires
Even a well-chosen sampling technique can produce bad data if the survey questions are loaded, ambiguous, or invasive. The IB expects you to spot these flaws.
Good survey design
Decide in-person vs remote. In-person can introduce response bias — people answer to please the interviewer.
Avoid interviewers who can unintentionally influence answers (a headteacher asking students “do you enjoy school?” is likely to get inflated yeses).
Keep the survey anonymous when possible to encourage honest responses.
Good questionnaire questions
Avoid
Example to avoid
Why
Leading questions
“You enjoy school, don’t you?”
Suggests the “right” answer
Personal questions
“What is your home address?”
Unnecessary; reduces trust
Loaded language
“Do you watch the boring news?”
Embeds an opinion in the question
Self-judgement
“How smart are you?”
People struggle to rate themselves
Ambiguous phrasing
“Do you study French or Spanish?”
Yes/no answer? Or pick one?
Missing options
“Which sport: football or tennis?”
Other sports excluded
🧠 Structured vs unstructured
Structured questions (multiple-choice, rating, ranking) are quick to analyse but give limited insight. Unstructured (open-ended) questions yield richer answers but take much longer to analyse. Most real questionnaires use a mix.
🧭 Recipe — identifying reliability or validity
Spot the test type: same process repeated → test–retest; new similar questions → parallel forms; covers all aspects → content; predicts another variable → criterion.
For reliability: compare the two sets of results. Are they similar / strongly correlated?
For validity: ask whether the process actually measures what it claims to (content) or whether the predictor is the right one (criterion).
Comment in context: state the conclusion (reliable / valid or not) and give a reason linked to the scenario.
If unreliable or invalid: suggest a larger sample, a different sampling technique, or changes to the process.
Worked examples
WE 1
Identify problems in survey questions
A school designs a questionnaire about school meals. State a problem with each question and rewrite it appropriately.
(a) “Don’t you agree that the canteen food is great?”
(b) “Do you prefer pizza or salad: Yes or No?”
(c) “How would you rate your own cooking ability out of 10?”
(a) leading questionsuggests the answer; introduces bias(a) “How would you rate the canteen food on a scale of 1–5?”(b) ambiguous phrasing“Yes/No” doesn’t match a choice question(b) “Which do you prefer: pizza or salad?” (pick one)(c) self-judgementpeople struggle to rate themselves objectively(c) “How often do you cook a meal at home per week?”questions should be neutral, precise, and not require self-judgement.
WE 2
Test–retest reliability with data
A psychologist tests 6 students on a memory task. Two weeks later, the same task is repeated with the same students.
Student
1
2
3
4
5
6
Test 1
12
15
8
10
14
9
Test 2
13
14
9
11
15
9
(a) State the reliability test being used.
(b) Comment on the reliability of the process.
(a) same task, same sample, repeated later(a) test–retest(b) compare the two setsdifferences: +1, −1, +1, +1, +1, 0all within ±1 ⇒ strong positive correlation(b) the process is reliablejustify with the data — small differences ⇒ high reliability.
WE 3
Parallel forms reliability
A maths teacher creates two different but similar quizzes (Quiz A and Quiz B) testing the same topic. She gives Quiz A to her class on Monday and Quiz B to the same class on Wednesday. The results show a strong positive correlation.
(a) Identify the reliability test used.
(b) State one advantage of this test over test–retest.
(a) two different but similar tests, same sample(a) parallel forms(b) advantage over test–reteststudents can’t remember answers from a previous identical test(b) eliminates the memory / familiarity effectparallel forms reduces bias from students recognising the same questions.
WE 4
Content-related validity
For each scenario, state whether the process has content-related validity and justify.
(a) An IB Maths AI HL teacher creates an end-of-year exam covering only calculus, ignoring statistics, vectors, and graph theory.
(b) A driving examiner tests new drivers on highway driving, parking, reversing, and emergency stops.
(a) covers only one syllabus topicmajor sections missing(a) NOT valid — doesn’t cover all aspects of AI HL(b) covers multiple driving skillsrepresentative of real driving requirements(b) valid — covers the full range of the variablecontent validity asks: are all important aspects measured?
WE 5
Criterion-related validity
For each scenario, state whether the process has criterion-related validity and justify.
(a) Using SAT scores to predict university grade point averages.
(b) Using shoe size to predict reading ability in adults.
(a) SAT and GPA both measure academic abilitystrong, well-established predictor(a) valid — predictor matches the criterion(b) shoe size and reading are unrelatedno meaningful predictive relationship(b) NOT valid — the variables are unrelatedcriterion validity asks: does one variable accurately predict the other?
WE 6
Improving a flawed survey
A health researcher stands outside a gym at 7 a.m. for one hour and surveys exiting customers about the daily exercise levels of all adults in the country. State two flaws in this design and suggest one specific improvement for each.
flaw 1: convenience sampling at a gymgym-goers exercise more than average ⇒ biased sampleimprovement: sample from multiple locations (gym, park, homes)flaw 2: one hour only at one timeexcludes anyone who doesn’t go to that gym at 7amimprovement: sample across multiple times and daysalso: increase sample size; use random or stratified sampling.
💡 Top tips
Reliability vs validity: reliability = repeatability; validity = accuracy. You can be reliable but invalid (panel 2 of the target), or valid but unreliable (panel 3).
Identify the test: same process repeated → test–retest; similar new questions → parallel forms; covers all aspects → content; predicts another → criterion.
Always comment in context: don’t just say “reliable” — explain why with reference to the data or scenario.
Good questions are neutral: no leading words, no opinion, no ambiguity, no missing options.
⚠ Common mistakes
Confusing reliability with validity. A thermometer that always reads 5°C high is reliable but not valid.
Saying “yes/no” to a reliability question without justifying with the data.
Confusing test–retest with parallel forms. Same questions = test–retest; new similar questions = parallel forms.
Treating “covers the topic” as criterion validity. Covering the topic is content validity; criterion is about prediction.
Suggesting fixes that don’t match the flaw. If the issue is bias, “use a larger sample” alone doesn’t help — fix the bias first.
Next up — Measures of Central Tendency. Once you’ve collected reliable, valid data, the first thing you’ll want is a single number that describes where the centre of the data lies. There are three classic measures — the mean, the median, and the mode — and each has its strengths, weaknesses, and a typical scenario where it’s the right choice.
Need help with Statistics?
Get 1-on-1 help from an IB examiner who knows exactly what Paper 1 & 2 are looking for.