Looking at the same scatter, you could draw many different straight lines through it. Which one is “best”? The least-squares regression line minimises the total of the squared vertical gaps from each data point to the line.
The maths gives you two numbers. Their meaning in context is what earns marks:
Interpolation: predicting
y for an
x WITHIN the observed range — usually reliable.
Extrapolation: predicting
y for an
x OUTSIDE the range — assumes the same linear relationship holds beyond what you’ve seen, which often isn’t true (physical limits, saturation, regime changes). Always flag extrapolation as a caveat.
🧠Recipe — any linear regression question
- Check linearity first: compute (or be given) r. If |r| is not strong (and especially if it fails the critical value), don’t fit a line.
- Enter data on the GDC and run Linear Regression (y = ax + b). Read a and b; store in memory to avoid rounding.
- Write the equation with values plugged in. Round to 3 sf unless told otherwise.
- Interpret a and b in context: “per unit of x, y changes by a“; “b is y at x = 0 (check if meaningful)”.
- To predict: substitute the x-value. Check if x is in or out of range; flag extrapolation if outside.
Worked examples
WE 1Find the regression equation and interpret it
A gardener measures fertiliser used (x, g) and tomato yield (y, kg) for 7 plants:
x: 40, 80, 120, 160, 200, 240, 280 | y: 2.8, 4.2, 4.9, 6.1, 7.2, 7.9, 9.0
(a) Find the regression line y = ax + b. (b) Interpret a and b in context.
(a) GDC: LinReg(ax+b)
a = 0.0253 (3 sf)
b = 1.97 (3 sf)
y = 0.0253x + 1.97
(b) interpret in context
a: per extra GRAM of fertiliser,
yield increases by 0.0253 kg (~25 g)
b: with no fertiliser, model predicts 1.97 kg
— this is meaningful (natural yield)
(a) y = 0.0253x + 1.97 · (b) +25 g per gram fertiliser; 1.97 kg natural yield
always state interpretations in the ORIGINAL UNITS. “0.0253 kg per gram” is right; “0.0253 per unit” loses the mark.
WE 2Interpolation — predicting within the data range
Using the regression line from WE 1, y = 0.0253x + 1.97, estimate the yield from a plant given 100 g of fertiliser. Comment on the reliability.
substitute x = 100
y = 0.0253(100) + 1.97
y = 2.53 + 1.97
y = 4.50 kg
check reliability
data range: 40 g to 280 g
x = 100 is INSIDE this range ⇒ INTERPOLATION
|r| was very close to 1 (r ≈ 0.998)
y ≈ 4.50 kg · interpolation, highly reliable
a good answer always mentions BOTH: (i) interpolation (x is inside the data), and (ii) the strong correlation supporting the prediction.
WE 3Extrapolation — flag the warning
Using the same line y = 0.0253x + 1.97, predict the yield for a plant given 500 g of fertiliser. Comment on the validity of the prediction.
substitute x = 500
y = 0.0253(500) + 1.97
y = 12.65 + 1.97 = 14.62
model predicts ~ 14.6 kg
but check the range
data range: 40 to 280 g
x = 500 is FAR outside ⇒ EXTRAPOLATION
assumes linear pattern continues
in reality, too much fertiliser may HARM plants
y ≈ 14.6 kg, but UNRELIABLE — extrapolation
always give BOTH the numerical prediction AND the warning. The maths answer is 14.6 kg; the marks are for spotting that you shouldn’t trust it.
WE 4Negative regression — car depreciation
A used-car dealer records age (x, years) and resale value (y, $1000s) for 8 cars of the same model:
x: 1, 2, 3, 4, 5, 6, 7, 8 | y: 18.5, 15.8, 13.2, 11.6, 9.8, 8.3, 6.9, 5.4
(a) Find the regression line. (b) Interpret the gradient. (c) Estimate the value of a 5.5-year-old car.
(a) GDC: LinReg(ax+b)
a = −1.82, b = 19.4 (3 sf)
y = −1.82x + 19.4
(b) gradient interpretation
a = −1.82 means:
value DROPS by $1820 per year of age
(c) substitute x = 5.5 (inside 1 to 8)
y = −1.82(5.5) + 19.4
y = −10.01 + 19.4 = 9.39
~ $9 390 (interpolation, reliable)
(a) y = −1.82x + 19.4 · (b) loses $1820/year · (c) ~$9 390
b = 19.4 here represents the value of a “0-year-old” car (i.e. brand new): $19 400. That’s meaningful in this context, so worth quoting.
WE 5Which prediction is more reliable?
Two researchers fit regression lines. Researcher A has r = 0.98 over a data range 10 < x < 50 and wants to predict at x = 30. Researcher B has r = 0.55 over the same range and also wants x = 30. Compare the reliability of their predictions.
both predictions are INSIDE the data range
both: interpolation, not extrapolation ✓
compare the correlation strengths
A: |r| = 0.98 (very strong)
B: |r| = 0.55 (moderate)
interpret
A’s points hug the line closely ⇒ tight prediction
B’s points scatter widely around the line
B’s prediction could be far from any actual y
A’s prediction is much more reliable (stronger linear fit)
interpolation alone isn’t enough — you also need a strong linear relationship to trust a prediction. Weak r = weak prediction even inside the data range.
WE 6Full problem — coffee shop sales
A café records daily maximum temperature (x, °C) and iced-coffee sales (y) for 7 summer days:
x: 8, 12, 15, 18, 22, 26, 30 | y: 95, 110, 128, 145, 162, 178, 195
(a) Find r and the regression line. (b) Interpret a. (c) Predict sales on a 20°C day and on a 40°C heatwave day. Comment on both predictions.
(a) GDC
r = 0.998 (very strong + linear)
a = 4.64, b = 57.9 (3 sf)
y = 4.64x + 57.9
(b) interpret a
per 1°C rise in temperature,
~ 4.6 more iced coffees sold per day
(c) predict at x = 20 (inside 8 to 30)
y = 4.64(20) + 57.9 = 92.8 + 57.9 = 150.7
~ 151 sales, INTERPOLATION, reliable
predict at x = 40 (outside)
y = 4.64(40) + 57.9 = 185.6 + 57.9 = 243.5
~ 244 sales, but EXTRAPOLATION
at 40°C demand could saturate (or supply runs out)
y = 4.64x + 57.9 · 20°C: ~151 (reliable) · 40°C: ~244 (unreliable)
classic exam structure: equation + interpret a + two predictions (one inside, one outside) + reliability comments. Every step is a separate mark.
💡 Top tips
- Store a and b in GDC memory after computing them. Use the unrounded values for any prediction; round only the final answer.
- Quote 3 sf for a and b unless told otherwise. Don’t write “y = 0.025268x + 1.97142857”.
- Always interpret in CONTEXT: “5 marks per hour of study” beats “a = 5″. Examiners give a separate mark for the contextual sentence.
- Check the range before predicting: write down (data min, data max) and decide interpolation vs extrapolation explicitly.
- Use “y on x” lines to predict y. Using them backwards (to predict x from y) is less reliable — that needs the “x on y” line.
âš Common mistakes
- Fitting a line when r is weak: if |r| is small, the line is meaningless. Check linearity first.
- Forgetting to flag extrapolation: a numerical prediction at an out-of-range x without warning loses a mark every time.
- Interpreting b when x = 0 is meaningless: e.g. “sales at 0°C” might be far outside the data and not physically reasonable. Note this when relevant.
- Confusing a with r: gradient and correlation share a sign, but their magnitudes are unrelated. r measures fit; a measures slope.
- Rounding too early: using a = 0.025 (instead of 0.0253) for predictions can shift the final answer noticeably. Keep full precision in the GDC.
That’s the Correlation & Regression chapter complete. You’ve moved from describing relationships in words (scatter diagrams) → quantifying them with r and rs → choosing between them → fitting a line and predicting with it. Every IB AI SL exam question in this chapter is a combination of these steps. The next chapter (Geometry & Trigonometry) is a different beast, but the workflow — sketch first, compute, interpret in context — carries over.
Need help with AI SL Correlation & Regression?
Get 1-on-1 help from an IB examiner who knows exactly what Paper 1 & 2 are looking for.
Book Free Session →