IB Maths AI SL Correlation & Regression Paper 1 & 2 Regression line ~7 min read

Linear Regression

If the data shows strong linear correlation, the least-squares regression line y = ax + b is the “best” straight line through it — the one that minimises the total of the squared vertical gaps between the line and the data points. The GDC gives you a and b in seconds. The real skill: interpret what they mean in context, and know when predictions are reliable.

📘 What you need to know

What the line is doing — least squares

Looking at the same scatter, you could draw many different straight lines through it. Which one is “best”? The least-squares regression line minimises the total of the squared vertical gaps from each data point to the line.

Least-squares regression line 0 1 2 3 4 5 6 7 8 9 0 5 10 15 20 x y b = y-intercept run = 2 rise = 4 gradient a = rise/run = 2 vertical gaps = residuals (minimise ∑gap²) y = 2x + 1
The least-squares regression line is the unique straight line that minimises the sum of the squared vertical distances (residuals) from each data point to the line. The gradient a shows how much y rises for each unit increase in x; the intercept b is where the line meets the y-axis.

Interpreting a and b

The maths gives you two numbers. Their meaning in context is what earns marks:

a — the gradient. For every one-unit increase in x, y changes by a units. If a > 0, y goes UP; if a < 0, DOWN. In a real-world question always state the change in the units of the variables: “each extra hour of practice raises the score by 5 marks”, not “a = 5″.

b — the y-intercept. The predicted y when x = 0. Sometimes meaningful (e.g. “natural yield with no fertiliser”), sometimes nonsense (e.g. “iced coffee sales at 0°C” — almost certainly outside the data range and beyond reality). Always check before reporting.

The regression line y = ax + b   with   a = SxySx2   and   b = ȳ − ax̄ (your GDC computes both — this is for understanding)

Predicting — interpolation vs extrapolation

The regression equation lets you predict y for any x by substitution. Reliability depends on TWO things:

1. How strong the correlation is. If |r| is close to 1, predictions are tight. If |r| is moderate (around 0.6), predictions are rough.

2. Whether x is inside or outside the data range:

Interpolation: predicting y for an x WITHIN the observed range — usually reliable.   Extrapolation: predicting y for an x OUTSIDE the range — assumes the same linear relationship holds beyond what you’ve seen, which often isn’t true (physical limits, saturation, regime changes). Always flag extrapolation as a caveat.

🧭 Recipe — any linear regression question

  1. Check linearity first: compute (or be given) r. If |r| is not strong (and especially if it fails the critical value), don’t fit a line.
  2. Enter data on the GDC and run Linear Regression (y = ax + b). Read a and b; store in memory to avoid rounding.
  3. Write the equation with values plugged in. Round to 3 sf unless told otherwise.
  4. Interpret a and b in context: “per unit of x, y changes by a“; “b is y at x = 0 (check if meaningful)”.
  5. To predict: substitute the x-value. Check if x is in or out of range; flag extrapolation if outside.

Worked examples

WE 1

Find the regression equation and interpret it

A gardener measures fertiliser used (x, g) and tomato yield (y, kg) for 7 plants:

x: 40, 80, 120, 160, 200, 240, 280  |  y: 2.8, 4.2, 4.9, 6.1, 7.2, 7.9, 9.0

(a) Find the regression line y = ax + b. (b) Interpret a and b in context.

(a) GDC: LinReg(ax+b) a = 0.0253 (3 sf) b = 1.97 (3 sf) y = 0.0253x + 1.97 (b) interpret in context a: per extra GRAM of fertiliser, yield increases by 0.0253 kg (~25 g) b: with no fertiliser, model predicts 1.97 kg — this is meaningful (natural yield) (a) y = 0.0253x + 1.97 · (b) +25 g per gram fertiliser; 1.97 kg natural yield always state interpretations in the ORIGINAL UNITS. “0.0253 kg per gram” is right; “0.0253 per unit” loses the mark.
WE 2

Interpolation — predicting within the data range

Using the regression line from WE 1, y = 0.0253x + 1.97, estimate the yield from a plant given 100 g of fertiliser. Comment on the reliability.

substitute x = 100 y = 0.0253(100) + 1.97 y = 2.53 + 1.97 y = 4.50 kg check reliability data range: 40 g to 280 g x = 100 is INSIDE this range ⇒ INTERPOLATION |r| was very close to 1 (r ≈ 0.998) y ≈ 4.50 kg · interpolation, highly reliable a good answer always mentions BOTH: (i) interpolation (x is inside the data), and (ii) the strong correlation supporting the prediction.
WE 3

Extrapolation — flag the warning

Using the same line y = 0.0253x + 1.97, predict the yield for a plant given 500 g of fertiliser. Comment on the validity of the prediction.

substitute x = 500 y = 0.0253(500) + 1.97 y = 12.65 + 1.97 = 14.62 model predicts ~ 14.6 kg but check the range data range: 40 to 280 g x = 500 is FAR outside ⇒ EXTRAPOLATION assumes linear pattern continues in reality, too much fertiliser may HARM plants y ≈ 14.6 kg, but UNRELIABLE — extrapolation always give BOTH the numerical prediction AND the warning. The maths answer is 14.6 kg; the marks are for spotting that you shouldn’t trust it.
WE 4

Negative regression — car depreciation

A used-car dealer records age (x, years) and resale value (y, $1000s) for 8 cars of the same model:

x: 1, 2, 3, 4, 5, 6, 7, 8  |  y: 18.5, 15.8, 13.2, 11.6, 9.8, 8.3, 6.9, 5.4

(a) Find the regression line. (b) Interpret the gradient. (c) Estimate the value of a 5.5-year-old car.

(a) GDC: LinReg(ax+b) a = −1.82, b = 19.4 (3 sf) y = −1.82x + 19.4 (b) gradient interpretation a = −1.82 means: value DROPS by $1820 per year of age (c) substitute x = 5.5 (inside 1 to 8) y = −1.82(5.5) + 19.4 y = −10.01 + 19.4 = 9.39 ~ $9 390 (interpolation, reliable) (a) y = −1.82x + 19.4 · (b) loses $1820/year · (c) ~$9 390 b = 19.4 here represents the value of a “0-year-old” car (i.e. brand new): $19 400. That’s meaningful in this context, so worth quoting.
WE 5

Which prediction is more reliable?

Two researchers fit regression lines. Researcher A has r = 0.98 over a data range 10 < x < 50 and wants to predict at x = 30. Researcher B has r = 0.55 over the same range and also wants x = 30. Compare the reliability of their predictions.

both predictions are INSIDE the data range both: interpolation, not extrapolation ✓ compare the correlation strengths A: |r| = 0.98 (very strong) B: |r| = 0.55 (moderate) interpret A’s points hug the line closely ⇒ tight prediction B’s points scatter widely around the line B’s prediction could be far from any actual y A’s prediction is much more reliable (stronger linear fit) interpolation alone isn’t enough — you also need a strong linear relationship to trust a prediction. Weak r = weak prediction even inside the data range.
WE 6

Full problem — coffee shop sales

A café records daily maximum temperature (x, °C) and iced-coffee sales (y) for 7 summer days:

x: 8, 12, 15, 18, 22, 26, 30  |  y: 95, 110, 128, 145, 162, 178, 195

(a) Find r and the regression line. (b) Interpret a. (c) Predict sales on a 20°C day and on a 40°C heatwave day. Comment on both predictions.

(a) GDC r = 0.998 (very strong + linear) a = 4.64, b = 57.9 (3 sf) y = 4.64x + 57.9 (b) interpret a per 1°C rise in temperature, ~ 4.6 more iced coffees sold per day (c) predict at x = 20 (inside 8 to 30) y = 4.64(20) + 57.9 = 92.8 + 57.9 = 150.7 ~ 151 sales, INTERPOLATION, reliable predict at x = 40 (outside) y = 4.64(40) + 57.9 = 185.6 + 57.9 = 243.5 ~ 244 sales, but EXTRAPOLATION at 40°C demand could saturate (or supply runs out) y = 4.64x + 57.9 · 20°C: ~151 (reliable) · 40°C: ~244 (unreliable) classic exam structure: equation + interpret a + two predictions (one inside, one outside) + reliability comments. Every step is a separate mark.

💡 Top tips

  • Store a and b in GDC memory after computing them. Use the unrounded values for any prediction; round only the final answer.
  • Quote 3 sf for a and b unless told otherwise. Don’t write “y = 0.025268x + 1.97142857”.
  • Always interpret in CONTEXT: “5 marks per hour of study” beats “a = 5″. Examiners give a separate mark for the contextual sentence.
  • Check the range before predicting: write down (data min, data max) and decide interpolation vs extrapolation explicitly.
  • Use “y on x” lines to predict y. Using them backwards (to predict x from y) is less reliable — that needs the “x on y” line.

âš  Common mistakes

  • Fitting a line when r is weak: if |r| is small, the line is meaningless. Check linearity first.
  • Forgetting to flag extrapolation: a numerical prediction at an out-of-range x without warning loses a mark every time.
  • Interpreting b when x = 0 is meaningless: e.g. “sales at 0°C” might be far outside the data and not physically reasonable. Note this when relevant.
  • Confusing a with r: gradient and correlation share a sign, but their magnitudes are unrelated. r measures fit; a measures slope.
  • Rounding too early: using a = 0.025 (instead of 0.0253) for predictions can shift the final answer noticeably. Keep full precision in the GDC.
That’s the Correlation & Regression chapter complete. You’ve moved from describing relationships in words (scatter diagrams) → quantifying them with r and rs → choosing between them → fitting a line and predicting with it. Every IB AI SL exam question in this chapter is a combination of these steps. The next chapter (Geometry & Trigonometry) is a different beast, but the workflow — sketch first, compute, interpret in context — carries over.

Need help with AI SL Correlation & Regression?

Get 1-on-1 help from an IB examiner who knows exactly what Paper 1 & 2 are looking for.

Book Free Session →