IB Maths AI SLCorrelation & RegressionPaper 1 & 2Bivariate data~7 min read
Scatter Diagrams & Correlation
Bivariate data is data on two variables at once (e.g. hours studied AND test score). A scatter diagram plots one against the other so you can see the relationship at a glance. From the picture you describe type (positive / negative / none) and strength (strong / weak). If the dots cluster near a line you can draw a line of best fit by eye — and it must pass through the mean point (x̄, ȳ).
📘 What you need to know
Independent variable on x-axis, dependent on y-axis: the one you control (or set) goes horizontal; the one you measure (or that responds) goes vertical.
Type of correlation: positive (both increase together), negative (one up ↔ one down), or no linear correlation (no clear trend).
Strength: strong (points lie close to a line) or weak (points are scattered around it).
Line of best fit goes through the mean point (x̄, ȳ). Only draw one if correlation is reasonably strong.
Correlation ≠ causation: two variables can move together without one CAUSING the other. Always ask “is there a lurking third variable?”
Bivariate outliers are points that don’t follow the overall trend — even if each coordinate looks normal on its own.
The five shapes of correlation
Every scatter diagram falls into one of these five categories. Get the picture in your head and the wording follows automatically.
Five patterns. The tighter the cluster around a line, the stronger the correlation. Upward slope = positive; downward slope = negative; cloud with no slope = no linear correlation. The “r close to…” labels preview the formal correlation coefficient covered in the next note.
Describing what you see
When asked to “describe the correlation”, give three things in this order:
1. Strength: strong or weak (or “no” if there’s no pattern at all). Strong = points hug the line; weak = points scatter widely.
3. Linearity: the word linear (or “non-linear” if the points curve along a clear non-line). At AI SL most questions are linear.
The reliable answer phrase: “strong / weak + positive / negative + linear correlation“. E.g. “strong negative linear correlation”. Examiners look for those three words.
The line of best fit
If the correlation is strong (and roughly linear), you can draw a line of best fit “by eye”. Two rules govern it:
Line of best fit (drawn by eye)
passes through the mean point (x̄, ȳ)
and follows the overall trend — roughly equal points above and below
To plot the mean point: compute x̄ (mean of the x-values) and ȳ (mean of the y-values), mark that single point on the diagram with a different symbol, then rule a straight line through it that matches the trend.
Correlation does NOT imply causation
Two variables can move together for reasons other than one causing the other:
Lurking variable: ice cream sales and sunburns both go up in summer — not because one causes the other, but because both depend on hot weather.
Reverse causation: fire-engine count and damage both go up together — but it’s the size of the fire that drives both.
Pure coincidence: two unrelated trends that happen to point the same way over a small sample. Always ask “could there be a third explanation?”
🧭 Recipe — any scatter-diagram question
Identify independent vs dependent: which variable was controlled or set? That’s x. Which responded? That’s y.
Plot the points with even axes (label both with units!). Use a small cross or dot for each pair.
Describe correlation in three words: strength + direction + “linear”. E.g. “strong positive linear correlation”.
If correlation is strong: compute the mean point (x̄, ȳ), plot it, and rule a line of best fit through it along the trend.
Use the line carefully: predictions inside the data range (interpolation) are reasonable; predictions outside (extrapolation) are risky. And always question causation in your interpretation.
Worked examples
WE 1
Identify independent variable and expected correlation
For each pair of variables, state which is the independent (explanatory) variable and predict the type of correlation you’d expect.
(a) Fertiliser amount (g) used on a tomato plant and the plant’s final height (cm). (b) Age of a used car (years) and its resale value ($). (c) Daily hours of sunshine and number of umbrellas sold that day.
(a) you CONTROL fertiliser; height RESPONDSind = fertiliser (x), dep = height (y)more fertiliser ⇒ taller plantsexpect: positive correlation(b) age comes first; value drops with ageind = age, dep = valueexpect: negative correlation(c) sunshine is the weather variable, sales respondind = sunshine, dep = umbrella salesexpect: negative correlation(a) ind=fertiliser, +ve · (b) ind=age, −ve · (c) ind=sunshine, −veindependent = the variable you SET or that comes FIRST in time. Dependent = the one that REACTS. Phone usage causes screen-time changes, not the other way around.
WE 2
Plot a scatter diagram and describe the correlation
Eight students recorded their weekly maths study time (x hours) and their score (y %) on the next test.
(a) Sketch a scatter diagram. (b) Describe the correlation.
(a) plot the 8 pointsx-axis: study time (hours) 0 to 8y-axis: test score (%) 0 to 100points climb roughly along a straight line(b) check trendx up ⇒ y uppoints lie close to a straight linestrong positive linear correlationthree magic words: strong + positive + linear. Always check that the points lie CLOSE to an imaginary straight line before saying “strong”.
WE 3
Find the mean point and use it for the line of best fit
Daily maximum temperature (x, °C) and ice cream sales (y, $) were recorded over 7 days:
(a) Find the mean point (x̄, ȳ). (b) Explain how you’d use this point to draw a line of best fit.
(a) mean of x’ssum x = 15+18+22+25+28+30+32 = 170x̄ = 170/7 = 24.3 (1 dp)mean of y’ssum y = 80+95+120+140+165+180+200 = 980ȳ = 980/7 = 140(b) the line MUST pass through (24.3, 140)plot that point as a different symbolrule a line through it following the upward trendaim for roughly equal points above & below(a) mean point = (24.3, 140) · (b) line through this point along trendthe mean point is the “anchor” for any hand-drawn line of best fit. It’s the one point you can compute precisely — the rest is judgement.
WE 4
Correlation vs causation
For each pair, state the likely correlation, and decide whether a causal relationship is plausible. Justify briefly.
(a) Ice cream sales in a coastal town and number of reported sunburn cases. (b) Fuel consumed by a car and distance driven on a road trip. (c) Number of fire engines sent to a building fire and the amount of damage caused.
(a) summer drives both uppositive correlationNOT causal: lurking variable = hot weather(b) more km used means more litres burntpositive correlationYES causal: distance directly causes fuel use(c) bigger fires get more engines AND cause more damagepositive correlationNOT causal: fire size causes both (reverse causation)(a) +ve, not causal · (b) +ve, causal · (c) +ve, not causalexaminers reward the EXPLANATION more than the yes/no. Always say WHY the lurking variable or reverse causation exists. “Hot weather causes both” is a complete sentence answer.
WE 5
Spot a bivariate outlier
Eight students’ heights (h, cm) and shoe sizes (s) are:
(a) Identify the outlier and explain why. (b) Comment on how the outlier affects the apparent correlation.
(a) scan for a point that breaks the trend7 points show: taller ⇒ bigger shoesbut (180, 5): tallest student has SMALLEST shoesoutlier = (180, 5)(b) effect on correlationwithout (180,5): strong positive linear correlationwith (180,5): correlation appears WEAKERthe outlier pulls the line of best fit flatter(a) outlier (180, 5) · (b) weakens the apparent correlationnotice both 180 and 5 are perfectly normal values on their own — 180 cm is a valid height, size 5 is a valid shoe size. The point is an outlier in the BIVARIATE sense: the COMBINATION breaks the pattern.
WE 6
Read a line of best fit to estimate a value
A scatter diagram shows the time (t, minutes) a customer spends in a café against the amount they spend (y, $). A line of best fit has been drawn passing through (2, 25) and (8, 55).
(a) Find the equation of the line. (b) Estimate the amount spent by a customer who stayed for 5 minutes.
(a) Step 1 — gradientm = (55 − 25)/(8 − 2) = 30/6 = 5Step 2 — intercept using (2, 25)25 = 5(2) + c ⇒ c = 15y = 5t + 15(b) substitute t = 5y = 5(5) + 15 = 25 + 15 = 40(a) y = 5t + 15 · (b) about $40slope means “for each extra minute, the bill rises by $5”. The intercept ($15) is the predicted spend for someone who stays 0 minutes — nonsense in context, which is fine: lines of best fit aren’t real-world physics, they’re statistical summaries.
💡 Top tips
Always label axes with both variable name AND units (e.g. “Time studied (hours)”). It’s a soft mark on most scatter questions.
Use the three-word formula for describing correlation: strength + direction + “linear”.
Plot the mean point as a different symbol (e.g. a star or square) so the examiner can see you used it.
For “expected correlation” questions, think about the causal story FIRST — if A causes B to go up, that’s positive; if A causes B to go down, that’s negative.
Bivariate outliers can have individually normal coordinates — the COMBINATION is what’s odd. Look at the picture, not the table.
⚠ Common mistakes
Swapping axes: putting the dependent variable on the x-axis. Always check which one is controlled / measured first.
Forgetting “linear” in the description — you can miss a mark just for saying “strong positive correlation” without the word linear.
Drawing a line of best fit by joining first to last point. The line should go through the mean point and balance the spread — not connect any two specific data points.
Claiming causation from correlation: “X is correlated with Y, therefore X causes Y”. Always consider lurking variables and reverse causation.
Ignoring outliers: a single odd point can distort both the apparent correlation and any line of best fit. Always identify and comment on them.
Next up: Pearson’s Product-Moment Correlation Coefficient (PMCC) — the formal numerical version of “strong / weak / positive / negative”. Instead of eyeballing the cloud, you’ll use your GDC to get a number r between −1 and +1 that quantifies the linear correlation exactly. The visual intuition you’ve built here is what makes r easy to interpret.
Need help with AI SL Correlation & Regression?
Get 1-on-1 help from an IB examiner who knows exactly what Paper 1 & 2 are looking for.