IB Maths AA HL Topic 4 — Statistics & Probability Paper 1 & 2 ~6 min read

Pearson’s Product-Moment Correlation Coefficient

PMCC turns “describe the correlation” into a single number r between −1 and 1: sign tells direction, magnitude tells strength. You compute r on your GDC (LinReg or Stats menu) — almost never by hand. The critical-value check then decides whether a linear model is appropriate for that sample size.

📘 What you need to know

Reading the value of r

Range of |r|StrengthPractical meaning
0.9 – 1.0very strongdata lies essentially on a line
0.7 – 0.9strongclear linear trend with some scatter
0.5 – 0.7moderatetrend visible but plenty of variation
0.3 – 0.5weakslight tendency, lots of noise
0 – 0.3very weak / noneno useful linear relationship
Always quote both sign and magnitude: “r = −0.85, indicating a strong negative linear correlation”.

Formula (for understanding only)

PMCC — formula booklet r  =  SxySx · Sy

where Sxy = Σxynx̄ȳ,   Sx = √(Σx² − nx̄²),   Sy = √(Σy² − ²). The GDC computes all three sums internally — you don’t need to. The formula simply shows that r is a ratio of “joint variation” (covariance) to the “individual variation” (standard deviations).

The critical-value test

An r-value alone doesn’t say whether the linear pattern is statistically significant — small samples can produce large |r| by chance. The exam will give a critical value depending on n:

Decision rule if  |r| > critical value  →  significant; linear model appropriate

Larger samples have smaller critical values, because more data means less chance of a “fluke” correlation.

🧭 Recipe — find and interpret r

  1. Enter the x and y lists into the GDC.
  2. Run “LinReg(ax+b)” or “Stat → 2-Var Stats” — read r from the output.
  3. Round to 3 s.f.
  4. State direction and strength: e.g., “r = 0.84 → strong positive linear correlation”.
  5. If a critical value is given: compare |r| to it; if larger, conclude linear model is appropriate.
  6. Comment in context — what does this mean for the variables?

Worked examples

WE 1

Compute r from raw data — practice hours vs score

The hours of practice (x) and exam score (y) for 8 students:

Hours (x)24579111214
Score (y)2832454863657882

Find r to 3 s.f. and interpret.

Step 1: Enter both lists into the GDC L1 = hours; L2 = scores Step 2: Run LinReg(ax+b) or 2-Var Stats GDC output: r = 0.984353… Step 3: Round and interpret r ≈ 0.984 (very close to 1) r = 0.984 (3 s.f.) → very strong positive linear correlation more practice hours strongly associated with higher scores
WE 2

Compute r — coffee cups vs hours of sleep

The number of cups of coffee (x) drunk during the day and hours of sleep that night (y) for 7 people:

Coffee1234567
Sleep (h)8.58.07.06.55.85.04.2

Find r to 3 s.f. and describe the correlation.

Step 1: Enter into GDC, run LinReg GDC: r = −0.99774… Step 2: Round r ≈ −0.998 Step 3: Interpret |r| ≈ 1 → near-perfect linear correlation; sign negative → as coffee ↑, sleep ↓ r ≈ −0.998 → very strong negative linear correlation in real data this would be unusually strong; expect more scatter in genuine bivariate samples
WE 3

Use the critical-value test

For a sample of size n = 12, the critical value is 0.576. (a) Test the r-value 0.62. (b) Test the r-value 0.45. State in each case whether a linear model is appropriate.

Decision rule: linear model appropriate iff |r| > critical value (a) r = 0.62 |0.62| = 0.62 > 0.576 ✓ → correlation is significant; linear model APPROPRIATE (b) r = 0.45 |0.45| = 0.45 < 0.576 ✗ → correlation NOT significant; linear model NOT supported (a) appropriate; (b) not supported a moderate-looking r can still fail the significance test if the sample is small
WE 4

Effect of an outlier on r

For the data:
x: 1, 2, 3, 4, 5, 6, 7, 8   y: 10, 14, 19, 23, 28, 32, 37, 41
the GDC gives r ≈ 0.9997.
Now suppose the last y-value is mis-recorded as 5 instead of 41 (so the data becomes 10, 14, 19, 23, 28, 32, 37, 5). Re-running gives r ≈ 0.326. Comment on the effect.

Compare the two values Before outlier: r ≈ 1.000 (essentially perfect) After outlier: r ≈ 0.326 (very weak) → a single rogue point dropped r by about 0.67 Interpret PMCC is highly sensitive to outliers because deviations are squared (via the variance terms in the formula) A single outlier reduced r from 1.000 to 0.326 — PMCC is NOT robust to outliers always inspect a scatter plot first; one rogue point can hide a strong real relationship
WE 5

Predict the sign of r without calculating

For each pair of variables, state whether r would be positive, negative, or close to zero, with a brief reason.
(a) Outdoor temperature (winter) and monthly heating bill.
(b) Engine size in litres and fuel efficiency in mpg.
(c) Hours of sunshine in a city in June and ice-cream sales.
(d) A person’s shoe size and their phone number.

(a) Cold day → more heating → bill ↑; as temp ↓, bill ↑ → NEGATIVE r (b) Larger engines burn more fuel → fewer mpg → NEGATIVE r (c) More sunshine → more demand for ice cream → POSITIVE r (d) Phone numbers are arbitrary identifiers — no relationship to shoe size → r ≈ 0 (no linear relationship) (a) negative; (b) negative; (c) positive; (d) ≈ 0 predicting the sign first is a good sanity check on your GDC output
WE 6

Height vs arm span — full analysis with critical value

The heights and arm spans (in cm) of 8 adults:

Height155162168170175178182188
Arm span152165167173174180184191

(a) Find r to 3 s.f. (b) Given the critical value at n = 8 is 0.707, decide whether a linear model is appropriate. (c) Comment in context.

(a) Enter into GDC, run LinReg GDC: r = 0.98683… → r ≈ 0.987 (3 s.f.) (b) Critical-value test |r| = 0.987 > 0.707 ✓ → linear model is APPROPRIATE (c) Context Very strong positive linear correlation Tall people tend to have long arm spans (consistent with anatomy) r = 0.987; linear model appropriate; very strong positive linear correlation always pair a numerical answer with a contextual interpretation

💡 Top tips

⚠ Common mistakes

Next: Linear Regression. Once you know the correlation is linear, you fit the line itself. The least-squares regression line of y on x minimises the sum of squared vertical distances; the line of x on y minimises the horizontal ones (HL extension). Both pass through the mean point — and the choice between them depends on which variable you’re predicting.

Need help with Statistics & Probability?

Get 1-on-1 help from an IB examiner who knows exactly what Paper 1 & 2 are looking for.

Book Free Session →