IB Maths AI SL Correlation & Regression Paper 1 & 2 Regression line ~7 min read

Linear Regression

If the data shows strong linear correlation, the least-squares regression line y = ax + b is the “best” straight line through it — the one that minimises the total of the squared vertical gaps between the line and the data points. The GDC gives you a and b in seconds. The real skill: interpret what they mean in context, and know when predictions are reliable.

📘 What you need to know

The line is y = ax + b: a is the gradient, b is the y-intercept. Same as straight-line work from algebra.
“Least squares”: the line is chosen so the sum of the squared vertical distances from points to the line is as small as possible.
It passes through the mean point (x̄, ȳ). Always. Useful sanity check.
Interpret a: “for each one-unit increase in x, y changes by a units”. Sign matches the correlation sign.
Interpret b: the predicted value of y when x = 0. Sometimes meaningful, sometimes nonsense in context — always check.
Interpolation (predicting inside the data range) is reliable. Extrapolation (outside) is risky — always flag it.

What the line is doing — least squares

Looking at the same scatter, you could draw many different straight lines through it. Which one is “best”? The least-squares regression line minimises the total of the squared vertical gaps from each data point to the line.

The least-squares regression line is the unique straight line that minimises the sum of the squared vertical distances (residuals) from each data point to the line. The gradient a shows how much y rises for each unit increase in x; the intercept b is where the line meets the y-axis.

Interpreting a and b

The maths gives you two numbers. Their meaning in context is what earns marks:

a — the gradient. For every one-unit increase in x, y changes by a units. If a > 0, y goes UP; if a < 0, DOWN. In a real-world question always state the change in the units of the variables: “each extra hour of practice raises the score by 5 marks”, not “a = 5″.

b — the y-intercept. The predicted y when x = 0. Sometimes meaningful (e.g. “natural yield with no fertiliser”), sometimes nonsense (e.g. “iced coffee sales at 0°C” — almost certainly outside the data range and beyond reality). Always check before reporting.

The regression line y = ax + b with a = S_xyS_x² and b = ȳ − ax̄ (your GDC computes both — this is for understanding)

Predicting — interpolation vs extrapolation

The regression equation lets you predict y for any x by substitution. Reliability depends on TWO things:

1. How strong the correlation is. If |r| is close to 1, predictions are tight. If |r| is moderate (around 0.6), predictions are rough.

2. Whether x is inside or outside the data range:

Interpolation: predicting y for an x WITHIN the observed range — usually reliable. Extrapolation: predicting y for an x OUTSIDE the range — assumes the same linear relationship holds beyond what you’ve seen, which often isn’t true (physical limits, saturation, regime changes). Always flag extrapolation as a caveat.

🧭 Recipe — any linear regression question

Check linearity first: compute (or be given) r. If |r| is not strong (and especially if it fails the critical value), don’t fit a line.
Enter data on the GDC and run Linear Regression (y = ax + b). Read a and b; store in memory to avoid rounding.
Write the equation with values plugged in. Round to 3 sf unless told otherwise.
Interpret a and b in context: “per unit of x, y changes by a“; “b is y at x = 0 (check if meaningful)”.
To predict: substitute the x-value. Check if x is in or out of range; flag extrapolation if outside.

Worked examples

WE 1

Find the regression equation and interpret it

A gardener measures fertiliser used (x, g) and tomato yield (y, kg) for 7 plants:

x: 40, 80, 120, 160, 200, 240, 280 | y: 2.8, 4.2, 4.9, 6.1, 7.2, 7.9, 9.0

(a) Find the regression line y = ax + b. (b) Interpret a and b in context.

(a) GDC: LinReg(ax+b) a = 0.0253 (3 sf) b = 1.97 (3 sf) y = 0.0253x + 1.97 (b) interpret in context a: per extra GRAM of fertiliser, yield increases by 0.0253 kg (~25 g) b: with no fertiliser, model predicts 1.97 kg — this is meaningful (natural yield) (a) y = 0.0253x + 1.97 · (b) +25 g per gram fertiliser; 1.97 kg natural yield always state interpretations in the ORIGINAL UNITS. “0.0253 kg per gram” is right; “0.0253 per unit” loses the mark.

WE 2

Interpolation — predicting within the data range

Using the regression line from WE 1, y = 0.0253x + 1.97, estimate the yield from a plant given 100 g of fertiliser. Comment on the reliability.

substitute x = 100 y = 0.0253(100) + 1.97 y = 2.53 + 1.97 y = 4.50 kg check reliability data range: 40 g to 280 g x = 100 is INSIDE this range ⇒ INTERPOLATION |r| was very close to 1 (r ≈ 0.998) y ≈ 4.50 kg · interpolation, highly reliable a good answer always mentions BOTH: (i) interpolation (x is inside the data), and (ii) the strong correlation supporting the prediction.

WE 3

Extrapolation — flag the warning

Using the same line y = 0.0253x + 1.97, predict the yield for a plant given 500 g of fertiliser. Comment on the validity of the prediction.

substitute x = 500 y = 0.0253(500) + 1.97 y = 12.65 + 1.97 = 14.62 model predicts ~ 14.6 kg but check the range data range: 40 to 280 g x = 500 is FAR outside ⇒ EXTRAPOLATION assumes linear pattern continues in reality, too much fertiliser may HARM plants y ≈ 14.6 kg, but UNRELIABLE — extrapolation always give BOTH the numerical prediction AND the warning. The maths answer is 14.6 kg; the marks are for spotting that you shouldn’t trust it.

WE 4

Negative regression — car depreciation

A used-car dealer records age (x, years) and resale value (y, $1000s) for 8 cars of the same model:

x: 1, 2, 3, 4, 5, 6, 7, 8 | y: 18.5, 15.8, 13.2, 11.6, 9.8, 8.3, 6.9, 5.4

(a) Find the regression line. (b) Interpret the gradient. (c) Estimate the value of a 5.5-year-old car.

(a) GDC: LinReg(ax+b) a = −1.82, b = 19.4 (3 sf) y = −1.82x + 19.4 (b) gradient interpretation a = −1.82 means: value DROPS by $1820 per year of age (c) substitute x = 5.5 (inside 1 to 8) y = −1.82(5.5) + 19.4 y = −10.01 + 19.4 = 9.39 ~ $9 390 (interpolation, reliable) (a) y = −1.82x + 19.4 · (b) loses $1820/year · (c) ~$9 390 b = 19.4 here represents the value of a “0-year-old” car (i.e. brand new): $19 400. That’s meaningful in this context, so worth quoting.

WE 5

Which prediction is more reliable?

Two researchers fit regression lines. Researcher A has r = 0.98 over a data range 10 < x < 50 and wants to predict at x = 30. Researcher B has r = 0.55 over the same range and also wants x = 30. Compare the reliability of their predictions.

both predictions are INSIDE the data range both: interpolation, not extrapolation ✓ compare the correlation strengths A: |r| = 0.98 (very strong) B: |r| = 0.55 (moderate) interpret A’s points hug the line closely ⇒ tight prediction B’s points scatter widely around the line B’s prediction could be far from any actual y A’s prediction is much more reliable (stronger linear fit) interpolation alone isn’t enough — you also need a strong linear relationship to trust a prediction. Weak r = weak prediction even inside the data range.

WE 6

Full problem — coffee shop sales

A café records daily maximum temperature (x, °C) and iced-coffee sales (y) for 7 summer days:

x: 8, 12, 15, 18, 22, 26, 30 | y: 95, 110, 128, 145, 162, 178, 195

(a) Find r and the regression line. (b) Interpret a. (c) Predict sales on a 20°C day and on a 40°C heatwave day. Comment on both predictions.

(a) GDC r = 0.998 (very strong + linear) a = 4.64, b = 57.9 (3 sf) y = 4.64x + 57.9 (b) interpret a per 1°C rise in temperature, ~ 4.6 more iced coffees sold per day (c) predict at x = 20 (inside 8 to 30) y = 4.64(20) + 57.9 = 92.8 + 57.9 = 150.7 ~ 151 sales, INTERPOLATION, reliable predict at x = 40 (outside) y = 4.64(40) + 57.9 = 185.6 + 57.9 = 243.5 ~ 244 sales, but EXTRAPOLATION at 40°C demand could saturate (or supply runs out) y = 4.64x + 57.9 · 20°C: ~151 (reliable) · 40°C: ~244 (unreliable) classic exam structure: equation + interpret a + two predictions (one inside, one outside) + reliability comments. Every step is a separate mark.

💡 Top tips

Store a and b in GDC memory after computing them. Use the unrounded values for any prediction; round only the final answer.
Quote 3 sf for a and b unless told otherwise. Don’t write “y = 0.025268x + 1.97142857”.
Always interpret in CONTEXT: “5 marks per hour of study” beats “a = 5″. Examiners give a separate mark for the contextual sentence.
Check the range before predicting: write down (data min, data max) and decide interpolation vs extrapolation explicitly.
Use “y on x” lines to predict y. Using them backwards (to predict x from y) is less reliable — that needs the “x on y” line.

⚠ Common mistakes

Fitting a line when r is weak: if |r| is small, the line is meaningless. Check linearity first.
Forgetting to flag extrapolation: a numerical prediction at an out-of-range x without warning loses a mark every time.
Interpreting b when x = 0 is meaningless: e.g. “sales at 0°C” might be far outside the data and not physically reasonable. Note this when relevant.
Confusing a with r: gradient and correlation share a sign, but their magnitudes are unrelated. r measures fit; a measures slope.
Rounding too early: using a = 0.025 (instead of 0.0253) for predictions can shift the final answer noticeably. Keep full precision in the GDC.

That’s the Correlation & Regression chapter complete. You’ve moved from describing relationships in words (scatter diagrams) → quantifying them with r and r_s → choosing between them → fitting a line and predicting with it. Every IB AI SL exam question in this chapter is a combination of these steps. The next chapter (Geometry & Trigonometry) is a different beast, but the workflow — sketch first, compute, interpret in context — carries over.

Need help with AI SL Correlation & Regression?

Get 1-on-1 help from an IB examiner who knows exactly what Paper 1 & 2 are looking for.

Book Free Session →

IB Demystified is a trusted online learning platform led by certified IB examiners and educators.

Linear Regression

📘 What you need to know

What the line is doing — least squares

Interpreting a and b

Predicting — interpolation vs extrapolation

🧭 Recipe — any linear regression question

Worked examples

💡 Top tips

⚠ Common mistakes

Need help with AI SL Correlation & Regression?

Quick Links

Contact us

Follow us