IB Maths AA HL Topic 4 — Statistics & Probability Paper 1 & 2 HL only (x on y line) ~7 min read

Linear Regression

When the correlation is strong and linear, you replace the “by-eye” line of best fit with a least-squares regression line. There are two such lines — one for predicting y from x, one for predicting x from y — and choosing the wrong one gives unreliable predictions.

📘 What you need to know

y on x: y = ax + b — minimises sum of squared vertical distances. Used to predict y from x.
x on y (HL): x = cy + d — minimises sum of squared horizontal distances. Used to predict x from y.
Both lines pass through the mean point (⎯x, ⎯y) — they intersect there.
Sign of the gradient matches the sign of r: positive ↔ positive correlation, negative ↔ negative.
Compute via GDC: enter both lists, run “LinReg(ax+b)” — for the x on y line, swap the two lists.
Interpolation (predicting inside the data range) is reliable when correlation is strong.
Extrapolation (predicting outside the data range) is unreliable — the pattern may not continue.
Use the y on x line ONLY to predict y; use the x on y line ONLY to predict x.

The two regression lines

Regression of y on x

y = ax + b

minimises vertical squared distances; predict y from x

Regression of x on y (HL)

x = cy + d

minimises horizontal squared distances; predict x from y

Why two lines? The choice of “which axis to project onto” changes which line is optimal. The two lines coincide only when r = ±1. The closer to ±1, the closer they are to each other.

Using the regression line to predict

Predict y from x ŷ = ax + b (use the y on x line)

Predict x from y (HL) x̂ = cy + d (use the x on y line)

Using the wrong line — say, rearranging y = ax + b to predict x from y — gives a different (and worse) prediction than the correct x on y line. Rearrangement of one line does NOT give the other.

Reliability of predictions

Prediction type	Reliability	Why
Interpolation (inside data range)	reliable	linear pattern is supported by data here
Extrapolation (outside data range)	unreliable	pattern may not continue; no data to confirm
Strong correlation (\|r\| ≈ 1)	more reliable	data tightly fits the line
Weak correlation	less reliable	large variation around the line
Larger sample	more reliable	regression line itself is better estimated

🧭 Recipe — find and use the regression line

Check correlation: only fit a regression line if |r| is reasonably large (or critical-value test passes).
Identify which line you need: are you predicting y or x?
Enter the data into the GDC (x on L1, y on L2). For x on y, swap the lists.
Run LinReg(ax+b) — record a and b (or c and d) to enough decimals.
Substitute the given value into the equation to predict.
Check whether the prediction is interpolation or extrapolation and comment on reliability.

Worked examples

WE 1

Find the y on x line and use it to predict y

The hours of practice per week (x) and piano test score out of 100 (y) for 8 students:

Hours	1	3	5	7	9	11	13	15
Score	25	32	41	50	56	64	70	78

(a) Find the equation of the regression line of y on x, giving coefficients to 3 s.f. (b) Predict the score for a student who practises 10 hours per week.

(a) Enter into GDC, run LinReg(ax+b) a = 3.7857… ≈ 3.79 b = 21.7143… ≈ 21.7 y = 3.79x + 21.7 (b) Predict y at x = 10 y = 3.7857(10) + 21.7143 = 59.57… → predicted score ≈ 59.6 y = 3.79x + 21.7; score ≈ 59.6 use the FULL DISPLAY values from the GDC for predictions, not the rounded 3 s.f.

WE 2 · HL

Find the x on y line and use it to predict x

The distance run (x, in km) and calories burned (y) for 6 athletes:

Distance (km)	3	5	6	8	10	12
Calories	240	380	460	600	770	920

(a) Find the equation of the regression line of x on y. (b) Estimate the distance run by an athlete who burned 500 calories.

(a) Enter into GDC with calories as L1 and distance as L2 (swap them for x on y) Run LinReg(ax+b) on the swapped lists c = 0.013142… ≈ 0.0131 d = −0.04832… ≈ −0.0483 x = 0.0131y − 0.0483 (b) Predict x at y = 500 x = 0.013142(500) − 0.04832 = 6.5229… → predicted distance ≈ 6.52 km x = 0.0131y − 0.0483; distance ≈ 6.52 km always use x on y when the given value is y and you want to find x

WE 3

Choose which regression line to use

For each scenario, state which regression line should be used.
(a) A student studies for 6 hours and you want to estimate their test score.
(b) A student wants to score 80 on the test and you want to estimate the hours of study they should put in.

Identify what is given and what is being predicted (a) Given hours (x), predict score (y) → predict y from x → use y on x line (b) Given target score (y), predict hours (x) → predict x from y → use x on y line (HL) (a) y on x; (b) x on y don’t rearrange one line into the other — they are mathematically different

WE 4

Interpolation vs extrapolation

Using the regression line from WE 1 (y = 3.7857x + 21.7143, with data x ranging from 1 to 15 hours): (a) Predict the score for a student who practises 6 hours. (b) Predict the score for a student who practises 25 hours. (c) Comment on the reliability of each.

(a) x = 6 — INSIDE data range y = 3.7857(6) + 21.7143 = 44.43 → predicted score ≈ 44.4 (b) x = 25 — OUTSIDE data range (max was 15) y = 3.7857(25) + 21.7143 = 116.36 → predicted score ≈ 116 (c) Comment (a) interpolation → reliable, supported by data (b) extrapolation → unreliable; also exceeds the maximum 100, so physically impossible (a) ≈ 44 (reliable interpolation); (b) ≈ 116 (unreliable extrapolation, exceeds 100) extrapolation often produces values outside any meaningful range

WE 5 · HL

Verify both regression lines intersect at the mean point

For the data x: 2, 4, 6, 8, 10, 12 y: 10, 16, 18, 26, 28, 36, find both regression lines and verify that they intersect at (⎯x, ⎯y).

Step 1: Mean point x̄ = (2+4+6+8+10+12)/6 = 42/6 = 7 ȳ = (10+16+18+26+28+36)/6 = 134/6 ≈ 22.33 Step 2: Run LinReg on GDC for y on x y = 2.4857x + 4.9333 Step 3: Run LinReg on swapped lists for x on y x = 0.3925y − 1.7654 Step 4: Check both lines at the mean point y on x at x = 7: y = 2.4857(7) + 4.9333 = 22.33 ✓ x on y at y = 22.33: x = 0.3925(22.33) − 1.7654 = 7.00 ✓ Both lines pass through (7, 22.33) ✓ this is a useful sanity check on your calculations

WE 6

Ice-cream sales — full regression analysis

An ice-cream stall records the daily temperature (x, in °C) and units sold (y) over 8 days:

Temp (°C)	15	18	22	25	28	30	32	35
Sales	120	145	180	210	245	270	290	320

(a) Find r and the equation of the regression line of y on x. (b) Predict sales when the temperature is 26 °C and comment on reliability. (c) Predict sales at 45 °C and comment.

(a) GDC: r and LinReg r = 0.998 (very strong positive) a = 10.20, b = −38.96 y = 10.2x − 39.0 (b) Predict sales at x = 26 (inside range 15-35) y = 10.2035(26) − 38.9641 = 226.33 → ≈ 226 units Reliable: interpolation + very strong correlation (c) Predict at x = 45 (outside range) y = 10.2035(45) − 38.9641 = 420.19 → ≈ 420 units Unreliable: extrapolation; demand may saturate or staffing may limit sales (a) r = 0.998, y = 10.2x − 39.0; (b) ≈ 226 (reliable); (c) ≈ 420 (unreliable) always justify reliability with both “inside/outside range” and the strength of correlation

💡 Top tips

Store the regression coefficients in your GDC’s variable memory — avoids rounding errors when predicting.
For x on y, swap the lists in the GDC and run LinReg as usual; don’t try to algebraically rearrange the y on x line.
Always state which line you’re using when answering — examiners check this.
Check whether your prediction is interpolation or extrapolation and mention reliability.
Both lines pass through (⎯x, ⎯y) — a good way to verify your work.

⚠ Common mistakes

Rearranging the y on x line to predict x — this gives the wrong answer; you must use the x on y line.
Trusting an extrapolated prediction without comment — examiners want you to flag the unreliability.
Using rounded coefficients when computing predictions — leads to rounding errors that compound.
Forgetting to verify correlation is strong before fitting — a regression line on weak data is meaningless.
Confusing the gradient signs of the two lines — both have the same sign as r, but they are different numerically.

That closes the Correlation & Regression sub-section. Three notes, one tight workflow: scatter plot → check r → fit the right regression line → predict carefully. The next sub-section moves to Probability — sample spaces, set notation, Venn diagrams, and the formal language of “and / or / given that”.

Need help with Statistics & Probability?

Get 1-on-1 help from an IB examiner who knows exactly what Paper 1 & 2 are looking for.

Book Free Session →

Linear Regression

📘 What you need to know

The two regression lines

Using the regression line to predict

Reliability of predictions

🧭 Recipe — find and use the regression line

Worked examples

💡 Top tips

⚠ Common mistakes

Need help with Statistics & Probability?

Quick Links

Contact us

Follow us