IB Maths AI HL Statistics Toolkit Paper 1 & 2 ~7 min read

Reliability & Validity of Data Collection Methods

Picking a sampling technique is only half the battle. The other half is checking whether the data you actually collect is any good. Two key questions: is it reliable (would you get the same results if you ran the process again?) and is it valid (are you really measuring the thing you think you’re measuring?). The two ideas sound similar but they are very different — a measurement can be reliable yet completely invalid, or valid on average but wildly unreliable. The IB tests both with specific check methods: test–retest and parallel forms for reliability, plus content-related and criterion-related for validity.

📘 What you need to know

Reliability = consistency. A reliable process gives the same results when repeated on the same sample under the same conditions.
Validity = accuracy. A valid process accurately measures the variable it claims to measure.
Test–retest reliability — run the same process with the same sample at a later time. Reliable ⇒ positive correlation between the two sets of results.
Parallel forms reliability — give the same sample a second, similar set of questions covering the same variable. Reliable ⇒ positive correlation between the two sets.
Content-related validity — does the process cover all aspects of the variable? Usually needs expert input.
Criterion-related validity — does one variable accurately predict the outcome of another (the criterion variable)?
If unreliable or invalid: change the sampling technique, adjust the data collection process, or use a larger sample.
Good survey — anonymous where possible, no leading interviewer, no in-person bias.
Good questionnaire — unbiased questions, no leading wording, options cover all responses, no unnecessary personal info.

Reliable vs valid — the key distinction

The two terms feel similar but they describe completely different things. Reliability is about repeatability; validity is about accuracy. The classic target analogy makes the difference vivid.

reliability = clustering · validity = hitting the centre

Reliability is how tightly your arrows cluster. Validity is whether they cluster on the centre. The two are independent qualities — you can have either without the other.

🤔 A real example of “reliable but invalid”

A thermometer that always reads 5°C too high is perfectly reliable — give it the same water on different days and it gives the same wrong answer every time. But it’s not valid, because it doesn’t actually measure temperature accurately. The fix isn’t to repeat the readings — it’s to recalibrate the thermometer.

Tests for reliability

Two tests. Both compare two sets of results from the same sample. If the two sets correlate strongly (positively), the process is reliable.

Test–Retest

same questions, later

Run the same process with the same sample at a later time. Look for positive correlation. Weakness: participants may remember the first attempt.

Parallel Forms

Tests for validity

Two tests again. Both ask, in different ways, whether the data collection process is actually measuring the right thing.

Content-related

covers all aspects?

Does the process measure all aspects of the variable? Usually needs expert judgement to confirm full coverage.

Criterion-related

predicts another variable?

Does this variable accurately predict the outcome of another (the “criterion”) variable? If so, the process is valid as a predictor.

Content-related validity — examples

Valid: a calculus test that includes differentiation, integration, and applications questions — covers the full variable.
Not valid: assessing a chef’s overall cooking ability by asking them to make 10 apple pies — only one dish is tested.

Criterion-related validity — examples

Valid: using mock exam scores to predict final exam scores — strong link.
Not valid: using meerkat heights to predict squirrel heights — no meaningful relationship.

Designing good surveys & questionnaires

Even a well-chosen sampling technique can produce bad data if the survey questions are loaded, ambiguous, or invasive. The IB expects you to spot these flaws.

Good survey design

Decide in-person vs remote. In-person can introduce response bias — people answer to please the interviewer.
Avoid interviewers who can unintentionally influence answers (a headteacher asking students “do you enjoy school?” is likely to get inflated yeses).
Keep the survey anonymous when possible to encourage honest responses.

Good questionnaire questions

Avoid	Example to avoid	Why
Leading questions	“You enjoy school, don’t you?”	Suggests the “right” answer
Personal questions	“What is your home address?”	Unnecessary; reduces trust
Loaded language	“Do you watch the boring news?”	Embeds an opinion in the question
Self-judgement	“How smart are you?”	People struggle to rate themselves
Ambiguous phrasing	“Do you study French or Spanish?”	Yes/no answer? Or pick one?
Missing options	“Which sport: football or tennis?”	Other sports excluded

🧠 Structured vs unstructured

Structured questions (multiple-choice, rating, ranking) are quick to analyse but give limited insight. Unstructured (open-ended) questions yield richer answers but take much longer to analyse. Most real questionnaires use a mix.

🧭 Recipe — identifying reliability or validity

Spot the test type: same process repeated → test–retest; new similar questions → parallel forms; covers all aspects → content; predicts another variable → criterion.
For reliability: compare the two sets of results. Are they similar / strongly correlated?
For validity: ask whether the process actually measures what it claims to (content) or whether the predictor is the right one (criterion).
Comment in context: state the conclusion (reliable / valid or not) and give a reason linked to the scenario.
If unreliable or invalid: suggest a larger sample, a different sampling technique, or changes to the process.

Worked examples

WE 1

Identify problems in survey questions

A school designs a questionnaire about school meals. State a problem with each question and rewrite it appropriately.
(a) “Don’t you agree that the canteen food is great?”
(b) “Do you prefer pizza or salad: Yes or No?”
(c) “How would you rate your own cooking ability out of 10?”

(a) leading question suggests the answer; introduces bias (a) “How would you rate the canteen food on a scale of 1–5?” (b) ambiguous phrasing “Yes/No” doesn’t match a choice question (b) “Which do you prefer: pizza or salad?” (pick one) (c) self-judgement people struggle to rate themselves objectively (c) “How often do you cook a meal at home per week?” questions should be neutral, precise, and not require self-judgement.

WE 2

Test–retest reliability with data

A psychologist tests 6 students on a memory task. Two weeks later, the same task is repeated with the same students.

Student	1	2	3	4	5	6
Test 1	12	15	8	10	14	9
Test 2	13	14	9	11	15	9

(a) State the reliability test being used.
(b) Comment on the reliability of the process.

(a) same task, same sample, repeated later (a) test–retest (b) compare the two sets differences: +1, −1, +1, +1, +1, 0 all within ±1 ⇒ strong positive correlation (b) the process is reliable justify with the data — small differences ⇒ high reliability.

WE 3

Parallel forms reliability

A maths teacher creates two different but similar quizzes (Quiz A and Quiz B) testing the same topic. She gives Quiz A to her class on Monday and Quiz B to the same class on Wednesday. The results show a strong positive correlation.
(a) Identify the reliability test used.
(b) State one advantage of this test over test–retest.

(a) two different but similar tests, same sample (a) parallel forms (b) advantage over test–retest students can’t remember answers from a previous identical test (b) eliminates the memory / familiarity effect parallel forms reduces bias from students recognising the same questions.

WE 4

Content-related validity

For each scenario, state whether the process has content-related validity and justify.
(a) An IB Maths AI HL teacher creates an end-of-year exam covering only calculus, ignoring statistics, vectors, and graph theory.
(b) A driving examiner tests new drivers on highway driving, parking, reversing, and emergency stops.

(a) covers only one syllabus topic major sections missing (a) NOT valid — doesn’t cover all aspects of AI HL (b) covers multiple driving skills representative of real driving requirements (b) valid — covers the full range of the variable content validity asks: are all important aspects measured?

WE 5

Criterion-related validity

For each scenario, state whether the process has criterion-related validity and justify.
(a) Using SAT scores to predict university grade point averages.
(b) Using shoe size to predict reading ability in adults.

(a) SAT and GPA both measure academic ability strong, well-established predictor (a) valid — predictor matches the criterion (b) shoe size and reading are unrelated no meaningful predictive relationship (b) NOT valid — the variables are unrelated criterion validity asks: does one variable accurately predict the other?

WE 6

Improving a flawed survey

A health researcher stands outside a gym at 7 a.m. for one hour and surveys exiting customers about the daily exercise levels of all adults in the country. State two flaws in this design and suggest one specific improvement for each.

flaw 1: convenience sampling at a gym gym-goers exercise more than average ⇒ biased sample improvement: sample from multiple locations (gym, park, homes) flaw 2: one hour only at one time excludes anyone who doesn’t go to that gym at 7am improvement: sample across multiple times and days also: increase sample size; use random or stratified sampling.

💡 Top tips

Reliability vs validity: reliability = repeatability; validity = accuracy. You can be reliable but invalid (panel 2 of the target), or valid but unreliable (panel 3).
Identify the test: same process repeated → test–retest; similar new questions → parallel forms; covers all aspects → content; predicts another → criterion.
Always comment in context: don’t just say “reliable” — explain why with reference to the data or scenario.
To improve: increase sample size, change sampling technique, refine questions, eliminate interviewer bias.
Good questions are neutral: no leading words, no opinion, no ambiguity, no missing options.

⚠ Common mistakes

Confusing reliability with validity. A thermometer that always reads 5°C high is reliable but not valid.
Saying “yes/no” to a reliability question without justifying with the data.
Confusing test–retest with parallel forms. Same questions = test–retest; new similar questions = parallel forms.
Treating “covers the topic” as criterion validity. Covering the topic is content validity; criterion is about prediction.
Suggesting fixes that don’t match the flaw. If the issue is bias, “use a larger sample” alone doesn’t help — fix the bias first.

Next up — Measures of Central Tendency. Once you’ve collected reliable, valid data, the first thing you’ll want is a single number that describes where the centre of the data lies. There are three classic measures — the mean, the median, and the mode — and each has its strengths, weaknesses, and a typical scenario where it’s the right choice.

Need help with Statistics?

Get 1-on-1 help from an IB examiner who knows exactly what Paper 1 & 2 are looking for.

Book Free Session →

Reliability & Validity of Data Collection Methods

📘 What you need to know

Reliable vs valid — the key distinction

🤔 A real example of “reliable but invalid”

Tests for reliability

Tests for validity

Content-related validity — examples

Criterion-related validity — examples

Designing good surveys & questionnaires

Good survey design

Good questionnaire questions

🧠 Structured vs unstructured

🧭 Recipe — identifying reliability or validity

Worked examples

💡 Top tips

⚠ Common mistakes

Need help with Statistics?

Quick Links

Contact us

Follow us