IB Maths AA SL Topic 4 — Correlation & Regression Paper 1 & 2 ~10 min read

Linear Regression

Once you know two variables are linearly correlated, the natural next question is: “what’s the equation of the best line through the data?” That’s linear regression — and once you have the equation, you can predict new values.

📘 What you need to know

The regression line of y on x has the form y = ax + b. Use it to predict y from a known x.
The regression line of x on y has the form x = cy + d. Use it to predict x from a known y.
Both regression lines pass through the mean point (x̄, ȳ).
Use your GDC to find a, b, c, d — never by hand.
Interpolation = predicting inside the data range. Usually reliable.
Extrapolation = predicting outside the data range. Less reliable.
Predictions are only trustworthy if r shows strong correlation.

What is linear regression?

If your data has strong linear correlation, you can model the relationship with a straight line. Drawing a line of best fit “by eye” is not very precise — different people draw different lines. Linear regression uses maths to find the best possible straight line.

🤔 What does “best possible” mean?

For each data point, the regression line has a gap — the vertical (or horizontal) distance between the line and the point. The least squares regression line is the line that minimises the sum of the squared gaps. Squaring the gaps means positive and negative ones don’t cancel, and bigger gaps get penalised more.

You don’t need to compute this — your GDC handles it. But the name “least squares” comes from this idea.

The two regression lines

There are two different regression lines you can find — and they’re for two different jobs. Picking the right one matters.

y on x line

y = ax + b

Use to predict y from a given x.

e.g. predict exam score from hours studied.

a = gradient (change in y per unit x); b = y-intercept.

x on y line

x = cy + d

Use to predict x from a given y.

e.g. predict hours studied from exam score.

c = gradient (change in x per unit y); d = x-intercept.

📍

Match the line to what you’re predicting

Want to predict y? Use the y on x line. Want to predict x? Use the x on y line. Mixing them up gives unreliable predictions and loses marks.

🧠

Memory trick: “Predict what’s on the LEFT”

“y on x” → equation starts with y = … → predict y. “x on y” → equation starts with x = … → predict x. The variable on the LEFT of the equation is the one you can predict with that line.

How to find the regression line on your GDC

Finding the y on x line

Enter the x-values in List 1 (L1).
Enter the y-values in List 2 (L2).
Run “LinReg(ax+b)” with XList = L1, YList = L2.
Read off a and b from the output. Your line is y = ax + b.
Optional bonus: the same screen also gives you r!

Finding the x on y line

It’s the same process — but you swap the lists:

Finding the x on y line

Put the y-values in L1 (the “input” list).
Put the x-values in L2 (the “output” list).
Run “LinReg(ax+b)” with XList = L1, YList = L2.
The output a is now your c (gradient), and b becomes d. Your line is x = cy + d.

In other words: when finding the x on y line, just imagine you’ve renamed your y‘s as the new “x‘s” and your x‘s as the new “y‘s”. Plug them in like that, and the calculator will spit out the right line.

📍

Store the constants in your GDC

After you calculate a and b, store the full unrounded values in your calculator’s memory. When making predictions, use the full values — not the rounded ones — to avoid rounding errors. Then round at the very end.

Using a regression line to predict

Once you have the equation, predicting is just substitution. But there are two important warnings.

Pick the right line first

Decision table

Given the value of x and want y?	→	Use y = ax + b
Given the value of y and want x?	→	Use x = cy + d

Interpolation vs Extrapolation

Whether your prediction is reliable depends on whether the value is inside or outside the original data range.

✓ Interpolation

Predicting inside the data range — within the smallest and largest values you collected.

Usually reliable

✗ Extrapolation

Predicting outside the data range — beyond the smallest or largest values you collected.

Much less reliable

🤔 Why is extrapolation risky?

Imagine you collected data on hours studied (1 to 8 hours) vs exam score. The line works great in that range. But what about 50 hours of study? The line predicts an outrageous score — but in reality, you’d have plateaued, gotten tired, or hit the maximum mark. The linear pattern only works in the range where you actually have data.

📍

Three things to check before trusting a prediction

1. Is r strong? (|r| close to 1) — if not, predictions are weak.
2. Is the value within the data range? — extrapolation is risky.
3. Did you use the right line? — match the line to what you’re predicting.

What does the gradient mean?

For a regression line y = ax + b:

If a is positive, y increases by a for every unit increase in x → positive correlation.
If a is negative, y decreases by |a| for every unit increase in x → negative correlation.
The bigger |a|, the steeper the line.

If you’re given a regression equation but no scatter diagram, the sign of the gradient instantly tells you the type of correlation. It’s a quick shortcut for some exam questions.

Key fact about both lines

The y on x and x on y regression lines INTERSECT at the mean point (x̄, ȳ)

Worked examples

WE 1

Find the regression line and use it to predict

The table shows the maths (x) and English (y) scores of 8 students:

Maths (x)	7	18	37	52	61	68	75	82
English (y)	5	3	9	12	17	41	49	97

(a) Write down the value of r. (b) Find the equation of the y on x regression line. (c) Find the equation of the x on y regression line. (d) Predict the maths score of a student who got 63 in English.

Use GDC’s LinReg(ax+b) — once for each line.part (a) — pmcc Enter Maths in L1, English in L2, run LinReg: r = 0.79433… r = 0.794 (3 s.f.)part (b) — y on x line From same LinReg output: a = 0.943579…, b = −18.05398… y = 0.944x − 18.1part (c) — x on y line Swap the lists: English in L1, Maths in L2. Run LinReg again: a = 0.668700… (= c), b = 30.52410… (= d) x = 0.669y + 30.5part (d) — predict maths from english Given y = 63 (English), want x (Maths) → use x on y line. Use full unrounded values: x = 0.668700 × 63 + 30.52410 = 42.128 + 30.524 = 72.652… Maths score ≈ 72.7 use the FULL stored values, not rounded ones — avoids errors!

WE 2

Predict y from x using the right regression line

The table shows hours of revision (x) and test scores (y) for 6 students.

Hours (x)	2	4	5	7	9	10
Score (y)	45	52	60	68	78	85

(a) Find the y on x regression line. (b) Predict the score of a student who studies 6 hours.

Predicting y from x → use y on x line.part (a) Enter Hours in L1, Score in L2, run LinReg(ax+b): a ≈ 4.86, b ≈ 36.4 y = 4.86x + 36.4part (b) x = 6 is inside the data range (2 to 10) — interpolation, reliable. Substitute: y = 4.86 × 6 + 36.4 = 29.16 + 36.4 = 65.56 Predicted score ≈ 65.6 x = 6 is between 2 and 10 → interpolation is fine here

WE 3

Identify and explain extrapolation

Using the same data from WE 2 (hours of study from 2 to 10, scores from 45 to 85), a student wants to predict the score after 25 hours of study.

(a) Make the prediction. (b) Comment on its reliability.

25 is outside the original data range (2 to 10) — this is extrapolation.part (a) Substitute x = 25 into y = 4.86x + 36.4: y = 4.86 × 25 + 36.4 = 121.5 + 36.4 = 157.9 Predicted score ≈ 157.9part (b) 157.9 is well above 100 — impossible for a typical test out of 100! 25 is way outside the data range (2–10). The linear pattern only fits within the range we measured. Unreliable — this is extrapolation always check the data range before trusting a prediction!

WE 4

Interpret the gradient of a regression line

A study finds a regression line of y on x as y = 1.8x + 12, where x is hours of training per week and y is fitness score (0–100).

(a) Interpret the gradient. (b) Interpret the y-intercept. (c) Is the y-intercept meaningful here?

Gradient = change in y per unit x. Y-intercept = y when x = 0.part (a) — gradient a = 1.8 means: each extra hour of training increases the fitness score by 1.8 points. Each extra hour of training → +1.8 fitness pointspart (b) — y-intercept b = 12 means: when x = 0 (no training), y = 12. Predicted fitness score with no training = 12part (c) If 0 hours wasn’t in the original data, b is an extrapolation. Reading too much into b would be unreliable. Probably not meaningful — likely extrapolation always interpret intercepts carefully — they’re often outside the data!

WE 5

Identify correlation from a regression line

Without seeing a scatter diagram, what kind of correlation does each regression line suggest?

(a) y = 2.5x + 4 (b) y = −0.8x + 50 (c) x = −1.4y + 22

Sign of the gradient = direction of correlation. Positive gradient = positive correlation.part (a) Gradient = 2.5 (positive) → as x ↑, y ↑. Positive correlationpart (b) Gradient = −0.8 (negative) → as x ↑, y ↓. Negative correlationpart (c) Gradient (c) = −1.4 (negative) → as y ↑, x ↓. Same as: as x ↑, y ↓. Negative correlation no scatter diagram needed — just look at the sign of the gradient!

💡 Top tips

Match the line to what you’re predicting. Predicting y → use y on x. Predicting x → use x on y.
For the x on y line, swap the lists on your GDC. Put y‘s in L1 and x‘s in L2, then read off c and d.
Store the full unrounded constants in your calculator’s memory before predicting. Rounding too early creates errors.
Always check whether you’re interpolating or extrapolating before trusting a prediction.
Reliable predictions need strong correlation. If r is weak, even an in-range prediction is unreliable.
The gradient sign tells you correlation direction — useful when no scatter is shown.
Both lines pass through the mean point (x̄, ȳ). Where they cross is the mean point.
Round your final regression equation coefficients to 3 s.f. unless told otherwise.

⚠ Common mistakes

Using the wrong line. Using y on x to predict x from y (or vice versa) — predictions become unreliable.
Forgetting to swap the lists for the x on y line. Just running LinReg twice with the same lists won’t give you a different answer.
Trusting extrapolated predictions. Once you’re outside the data range, the linear pattern may not hold.
Rounding the coefficients before substituting. Use the full GDC values when computing predictions, then round at the end.
Using regression lines when correlation is weak. If r is close to 0, the line isn’t a good model — don’t predict from it.
Interpreting the y-intercept literally. The intercept is y when x = 0 — but if 0 isn’t in the data range, it might not be meaningful.
Forgetting to write the equation properly. Always write “y = ax + b” with both numbers, not just one.
Confusing r (correlation) with a (gradient). They’re different numbers — both come from the same LinReg output.

🎉 You’ve now finished the entire Correlation & Regression subtopic! You can spot a relationship between two variables, measure how strong it is, find the best line through the data, and use it to make predictions. The next chunk of Topic 4 dives into probability — a totally different branch of statistics.

Need help with Linear Regression?

Get 1-on-1 help from an IB examiner who knows exactly what Paper 1 & 2 are looking for.

Book Free Session →

Linear Regression

📘 What you need to know

What is linear regression?

🤔 What does “best possible” mean?

The two regression lines

y on x line

x on y line

Match the line to what you’re predicting

Memory trick: “Predict what’s on the LEFT”

How to find the regression line on your GDC

Finding the y on x line

Finding the x on y line

Finding the x on y line

Store the constants in your GDC

Using a regression line to predict

Pick the right line first

Decision table

Interpolation vs Extrapolation

✓ Interpolation

✗ Extrapolation

🤔 Why is extrapolation risky?

Three things to check before trusting a prediction

What does the gradient mean?

Worked examples

💡 Top tips

⚠ Common mistakes

Need help with Linear Regression?

Quick Links

Contact us

Follow us