IB Maths AA HL
Topic 4 โ Statistics & Probability
Paper 1 & 2
HL only (x on y line)
~7 min read
Linear Regression
When the correlation is strong and linear, you replace the “by-eye” line of best fit with a least-squares regression line. There are two such lines โ one for predicting y from x, one for predicting x from y โ and choosing the wrong one gives unreliable predictions.
๐ What you need to know
- y on x: y = ax + b โ minimises sum of squared vertical distances. Used to predict y from x.
- x on y (HL): x = cy + d โ minimises sum of squared horizontal distances. Used to predict x from y.
- Both lines pass through the mean point (โฏx, โฏy) โ they intersect there.
- Sign of the gradient matches the sign of r: positive โ positive correlation, negative โ negative.
- Compute via GDC: enter both lists, run “LinReg(ax+b)” โ for the x on y line, swap the two lists.
- Interpolation (predicting inside the data range) is reliable when correlation is strong.
- Extrapolation (predicting outside the data range) is unreliable โ the pattern may not continue.
- Use the y on x line ONLY to predict y; use the x on y line ONLY to predict x.
The two regression lines
Regression of y on x
y = ax + b
minimises vertical squared distances; predict y from x
Regression of x on y (HL)
x = cy + d
minimises horizontal squared distances; predict x from y
Why two lines? The choice of “which axis to project onto” changes which line is optimal. The two lines coincide only when r = ยฑ1. The closer to ยฑ1, the closer they are to each other.
Using the regression line to predict
Predict y from x
ลท = ax + b (use the y on x line)
Predict x from y (HL)
xฬ = cy + d (use the x on y line)
Using the wrong line โ say, rearranging y = ax + b to predict x from y โ gives a different (and worse) prediction than the correct x on y line. Rearrangement of one line does NOT give the other.
Reliability of predictions
| Prediction type | Reliability | Why |
|---|
| Interpolation (inside data range) | reliable | linear pattern is supported by data here |
| Extrapolation (outside data range) | unreliable | pattern may not continue; no data to confirm |
| Strong correlation (|r| โ 1) | more reliable | data tightly fits the line |
| Weak correlation | less reliable | large variation around the line |
| Larger sample | more reliable | regression line itself is better estimated |
๐งญ Recipe โ find and use the regression line
- Check correlation: only fit a regression line if |r| is reasonably large (or critical-value test passes).
- Identify which line you need: are you predicting y or x?
- Enter the data into the GDC (x on L1, y on L2). For x on y, swap the lists.
- Run LinReg(ax+b) โ record a and b (or c and d) to enough decimals.
- Substitute the given value into the equation to predict.
- Check whether the prediction is interpolation or extrapolation and comment on reliability.
Worked examples
WE 1Find the y on x line and use it to predict y
The hours of practice per week (x) and piano test score out of 100 (y) for 8 students:
| Hours | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 15 |
|---|
| Score | 25 | 32 | 41 | 50 | 56 | 64 | 70 | 78 |
|---|
(a) Find the equation of the regression line of y on x, giving coefficients to 3 s.f. (b) Predict the score for a student who practises 10 hours per week.
(a) Enter into GDC, run LinReg(ax+b)
a = 3.7857โฆ โ 3.79
b = 21.7143โฆ โ 21.7
y = 3.79x + 21.7
(b) Predict y at x = 10
y = 3.7857(10) + 21.7143 = 59.57โฆ
โ predicted score โ 59.6
y = 3.79x + 21.7; score โ 59.6
use the FULL DISPLAY values from the GDC for predictions, not the rounded 3 s.f.
WE 2 ยท HLFind the x on y line and use it to predict x
The distance run (x, in km) and calories burned (y) for 6 athletes:
| Distance (km) | 3 | 5 | 6 | 8 | 10 | 12 |
|---|
| Calories | 240 | 380 | 460 | 600 | 770 | 920 |
|---|
(a) Find the equation of the regression line of x on y. (b) Estimate the distance run by an athlete who burned 500 calories.
(a) Enter into GDC with calories as L1 and distance as L2 (swap them for x on y)
Run LinReg(ax+b) on the swapped lists
c = 0.013142โฆ โ 0.0131
d = โ0.04832โฆ โ โ0.0483
x = 0.0131y โ 0.0483
(b) Predict x at y = 500
x = 0.013142(500) โ 0.04832 = 6.5229โฆ
โ predicted distance โ 6.52 km
x = 0.0131y โ 0.0483; distance โ 6.52 km
always use x on y when the given value is y and you want to find x
WE 3Choose which regression line to use
For each scenario, state which regression line should be used.
(a) A student studies for 6 hours and you want to estimate their test score.
(b) A student wants to score 80 on the test and you want to estimate the hours of study they should put in.
Identify what is given and what is being predicted
(a) Given hours (x), predict score (y)
โ predict y from x โ use y on x line
(b) Given target score (y), predict hours (x)
โ predict x from y โ use x on y line (HL)
(a) y on x; (b) x on y
don’t rearrange one line into the other โ they are mathematically different
WE 4Interpolation vs extrapolation
Using the regression line from WE 1 (y = 3.7857x + 21.7143, with data x ranging from 1 to 15 hours): (a) Predict the score for a student who practises 6 hours. (b) Predict the score for a student who practises 25 hours. (c) Comment on the reliability of each.
(a) x = 6 โ INSIDE data range
y = 3.7857(6) + 21.7143 = 44.43
โ predicted score โ 44.4
(b) x = 25 โ OUTSIDE data range (max was 15)
y = 3.7857(25) + 21.7143 = 116.36
โ predicted score โ 116
(c) Comment
(a) interpolation โ reliable, supported by data
(b) extrapolation โ unreliable; also exceeds the maximum 100, so physically impossible
(a) โ 44 (reliable interpolation); (b) โ 116 (unreliable extrapolation, exceeds 100)
extrapolation often produces values outside any meaningful range
WE 5 ยท HLVerify both regression lines intersect at the mean point
For the data x: 2, 4, 6, 8, 10, 12 y: 10, 16, 18, 26, 28, 36, find both regression lines and verify that they intersect at (โฏx, โฏy).
Step 1: Mean point
xฬ = (2+4+6+8+10+12)/6 = 42/6 = 7
ศณ = (10+16+18+26+28+36)/6 = 134/6 โ 22.33
Step 2: Run LinReg on GDC for y on x
y = 2.4857x + 4.9333
Step 3: Run LinReg on swapped lists for x on y
x = 0.3925y โ 1.7654
Step 4: Check both lines at the mean point
y on x at x = 7: y = 2.4857(7) + 4.9333 = 22.33 โ
x on y at y = 22.33: x = 0.3925(22.33) โ 1.7654 = 7.00 โ
Both lines pass through (7, 22.33) โ
this is a useful sanity check on your calculations
WE 6Ice-cream sales โ full regression analysis
An ice-cream stall records the daily temperature (x, in ยฐC) and units sold (y) over 8 days:
| Temp (ยฐC) | 15 | 18 | 22 | 25 | 28 | 30 | 32 | 35 |
|---|
| Sales | 120 | 145 | 180 | 210 | 245 | 270 | 290 | 320 |
|---|
(a) Find r and the equation of the regression line of y on x. (b) Predict sales when the temperature is 26 ยฐC and comment on reliability. (c) Predict sales at 45 ยฐC and comment.
(a) GDC: r and LinReg
r = 0.998 (very strong positive)
a = 10.20, b = โ38.96
y = 10.2x โ 39.0
(b) Predict sales at x = 26 (inside range 15-35)
y = 10.2035(26) โ 38.9641 = 226.33
โ โ 226 units
Reliable: interpolation + very strong correlation
(c) Predict at x = 45 (outside range)
y = 10.2035(45) โ 38.9641 = 420.19
โ โ 420 units
Unreliable: extrapolation; demand may saturate or staffing may limit sales
(a) r = 0.998, y = 10.2x โ 39.0; (b) โ 226 (reliable); (c) โ 420 (unreliable)
always justify reliability with both “inside/outside range” and the strength of correlation
๐ก Top tips
- Store the regression coefficients in your GDC’s variable memory โ avoids rounding errors when predicting.
- For x on y, swap the lists in the GDC and run LinReg as usual; don’t try to algebraically rearrange the y on x line.
- Always state which line you’re using when answering โ examiners check this.
- Check whether your prediction is interpolation or extrapolation and mention reliability.
- Both lines pass through (โฏx, โฏy) โ a good way to verify your work.
โ Common mistakes
- Rearranging the y on x line to predict x โ this gives the wrong answer; you must use the x on y line.
- Trusting an extrapolated prediction without comment โ examiners want you to flag the unreliability.
- Using rounded coefficients when computing predictions โ leads to rounding errors that compound.
- Forgetting to verify correlation is strong before fitting โ a regression line on weak data is meaningless.
- Confusing the gradient signs of the two lines โ both have the same sign as r, but they are different numerically.
That closes the Correlation & Regression sub-section. Three notes, one tight workflow: scatter plot โ check r โ fit the right regression line โ predict carefully. The next sub-section moves to Probability โ sample spaces, set notation, Venn diagrams, and the formal language of “and / or / given that”.
Need help with Statistics & Probability?
Get 1-on-1 help from an IB examiner who knows exactly what Paper 1 & 2 are looking for.
Book Free Session โ