IB Maths AA HL
Topic 4 — Statistics & Probability
Paper 1 & 2
~6 min read
Pearson’s Product-Moment Correlation Coefficient
PMCC turns “describe the correlation” into a single number r between −1 and 1: sign tells direction, magnitude tells strength. You compute r on your GDC (LinReg or Stats menu) — almost never by hand. The critical-value check then decides whether a linear model is appropriate for that sample size.
📘 What you need to know
- Symbol: r — the sample PMCC.
- Range: −1 ≤ r ≤ 1.
- Sign: r > 0 ↔ positive correlation; r < 0 ↔ negative correlation; r = 0 ↔ no linear correlation (could still be a curve).
- Magnitude: |r| close to 1 = strong; close to 0 = weak. r = ±1 = perfect linear (every point on the line).
- How to compute: GDC’s “1-Var Stats” or “LinReg” — enter the two lists, read r directly. Round to 3 s.f.
- Critical value test: if |r| > critical value (given in exam), the correlation is significant → linear model is appropriate.
- Sensitive to outliers: a single bad point can pull r dramatically toward 0.
- Only measures linear relationships — r ≈ 0 doesn’t mean “no relationship”, just “no straight-line one”.
Reading the value of r
| Range of |r| | Strength | Practical meaning |
|---|
| 0.9 – 1.0 | very strong | data lies essentially on a line |
| 0.7 – 0.9 | strong | clear linear trend with some scatter |
| 0.5 – 0.7 | moderate | trend visible but plenty of variation |
| 0.3 – 0.5 | weak | slight tendency, lots of noise |
| 0 – 0.3 | very weak / none | no useful linear relationship |
Always quote both sign and magnitude: “r = −0.85, indicating a strong negative linear correlation”.
Formula (for understanding only)
PMCC — formula booklet
r = SxySx · Sy
where Sxy = Σxy − nx̄ȳ, Sx = √(Σx² − nx̄²), Sy = √(Σy² − nȳ²). The GDC computes all three sums internally — you don’t need to. The formula simply shows that r is a ratio of “joint variation” (covariance) to the “individual variation” (standard deviations).
The critical-value test
An r-value alone doesn’t say whether the linear pattern is statistically significant — small samples can produce large |r| by chance. The exam will give a critical value depending on n:
Decision rule
if |r| > critical value → significant; linear model appropriate
Larger samples have smaller critical values, because more data means less chance of a “fluke” correlation.
🧭 Recipe — find and interpret r
- Enter the x and y lists into the GDC.
- Run “LinReg(ax+b)” or “Stat → 2-Var Stats” — read r from the output.
- Round to 3 s.f.
- State direction and strength: e.g., “r = 0.84 → strong positive linear correlation”.
- If a critical value is given: compare |r| to it; if larger, conclude linear model is appropriate.
- Comment in context — what does this mean for the variables?
Worked examples
WE 1Compute r from raw data — practice hours vs score
The hours of practice (x) and exam score (y) for 8 students:
| Hours (x) | 2 | 4 | 5 | 7 | 9 | 11 | 12 | 14 |
|---|
| Score (y) | 28 | 32 | 45 | 48 | 63 | 65 | 78 | 82 |
|---|
Find r to 3 s.f. and interpret.
Step 1: Enter both lists into the GDC
L1 = hours; L2 = scores
Step 2: Run LinReg(ax+b) or 2-Var Stats
GDC output: r = 0.984353…
Step 3: Round and interpret
r ≈ 0.984 (very close to 1)
r = 0.984 (3 s.f.) → very strong positive linear correlation
more practice hours strongly associated with higher scores
WE 2Compute r — coffee cups vs hours of sleep
The number of cups of coffee (x) drunk during the day and hours of sleep that night (y) for 7 people:
| Coffee | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|
| Sleep (h) | 8.5 | 8.0 | 7.0 | 6.5 | 5.8 | 5.0 | 4.2 |
|---|
Find r to 3 s.f. and describe the correlation.
Step 1: Enter into GDC, run LinReg
GDC: r = −0.99774…
Step 2: Round
r ≈ −0.998
Step 3: Interpret
|r| ≈ 1 → near-perfect linear correlation; sign negative → as coffee ↑, sleep ↓
r ≈ −0.998 → very strong negative linear correlation
in real data this would be unusually strong; expect more scatter in genuine bivariate samples
WE 3Use the critical-value test
For a sample of size n = 12, the critical value is 0.576. (a) Test the r-value 0.62. (b) Test the r-value 0.45. State in each case whether a linear model is appropriate.
Decision rule: linear model appropriate iff |r| > critical value
(a) r = 0.62
|0.62| = 0.62 > 0.576 ✓
→ correlation is significant; linear model APPROPRIATE
(b) r = 0.45
|0.45| = 0.45 < 0.576 ✗
→ correlation NOT significant; linear model NOT supported
(a) appropriate; (b) not supported
a moderate-looking r can still fail the significance test if the sample is small
WE 4Effect of an outlier on r
For the data:
x: 1, 2, 3, 4, 5, 6, 7, 8 y: 10, 14, 19, 23, 28, 32, 37, 41
the GDC gives r ≈ 0.9997.
Now suppose the last y-value is mis-recorded as 5 instead of 41 (so the data becomes 10, 14, 19, 23, 28, 32, 37, 5). Re-running gives r ≈ 0.326. Comment on the effect.
Compare the two values
Before outlier: r ≈ 1.000 (essentially perfect)
After outlier: r ≈ 0.326 (very weak)
→ a single rogue point dropped r by about 0.67
Interpret
PMCC is highly sensitive to outliers because deviations are squared
(via the variance terms in the formula)
A single outlier reduced r from 1.000 to 0.326 — PMCC is NOT robust to outliers
always inspect a scatter plot first; one rogue point can hide a strong real relationship
WE 5Predict the sign of r without calculating
For each pair of variables, state whether r would be positive, negative, or close to zero, with a brief reason.
(a) Outdoor temperature (winter) and monthly heating bill.
(b) Engine size in litres and fuel efficiency in mpg.
(c) Hours of sunshine in a city in June and ice-cream sales.
(d) A person’s shoe size and their phone number.
(a) Cold day → more heating → bill ↑; as temp ↓, bill ↑
→ NEGATIVE r
(b) Larger engines burn more fuel → fewer mpg
→ NEGATIVE r
(c) More sunshine → more demand for ice cream
→ POSITIVE r
(d) Phone numbers are arbitrary identifiers — no relationship to shoe size
→ r ≈ 0 (no linear relationship)
(a) negative; (b) negative; (c) positive; (d) ≈ 0
predicting the sign first is a good sanity check on your GDC output
WE 6Height vs arm span — full analysis with critical value
The heights and arm spans (in cm) of 8 adults:
| Height | 155 | 162 | 168 | 170 | 175 | 178 | 182 | 188 |
|---|
| Arm span | 152 | 165 | 167 | 173 | 174 | 180 | 184 | 191 |
|---|
(a) Find r to 3 s.f. (b) Given the critical value at n = 8 is 0.707, decide whether a linear model is appropriate. (c) Comment in context.
(a) Enter into GDC, run LinReg
GDC: r = 0.98683…
→ r ≈ 0.987 (3 s.f.)
(b) Critical-value test
|r| = 0.987 > 0.707 ✓
→ linear model is APPROPRIATE
(c) Context
Very strong positive linear correlation
Tall people tend to have long arm spans (consistent with anatomy)
r = 0.987; linear model appropriate; very strong positive linear correlation
always pair a numerical answer with a contextual interpretation
💡 Top tips
- Always use the GDC’s LinReg or 2-Var Stats — by-hand calculation is slow and error-prone.
- Quote sign AND strength — “r = 0.7 means strong positive linear correlation”.
- Round to 3 s.f. unless the question specifies otherwise.
- Check your scatter plot first — a high r from data with an outlier can still be misleading.
- For critical-value problems, compare |r| (not r) — the test is two-sided.
⚠ Common mistakes
- Saying “r = 0 means no relationship” — it only means no linear relationship; data could still follow a curve.
- Comparing r instead of |r| to the critical value — sign doesn’t matter for significance.
- Concluding causation from a high r — strong correlation only means strong linear association.
- Confusing r with r² — r² (the coefficient of determination) is a separate quantity.
- Using too few significant figures — 3 s.f. is the IB default for r.
Next: Linear Regression. Once you know the correlation is linear, you fit the line itself. The least-squares regression line of y on x minimises the sum of squared vertical distances; the line of x on y minimises the horizontal ones (HL extension). Both pass through the mean point — and the choice between them depends on which variable you’re predicting.
Need help with Statistics & Probability?
Get 1-on-1 help from an IB examiner who knows exactly what Paper 1 & 2 are looking for.
Book Free Session →