IB Maths AI SL Correlation & Regression Paper 1 & 2 Bivariate data ~7 min read

Scatter Diagrams & Correlation

Bivariate data is data on two variables at once (e.g. hours studied AND test score). A scatter diagram plots one against the other so you can see the relationship at a glance. From the picture you describe type (positive / negative / none) and strength (strong / weak). If the dots cluster near a line you can draw a line of best fit by eye — and it must pass through the mean point (x̄, ȳ).

📘 What you need to know

The five shapes of correlation

Every scatter diagram falls into one of these five categories. Get the picture in your head and the wording follows automatically.

Five correlation patterns — learn the look Strong positive r close to +1 No correlation r close to 0 Strong negative r close to −1 Weak positive small positive r Weak negative small negative r
Five patterns. The tighter the cluster around a line, the stronger the correlation. Upward slope = positive; downward slope = negative; cloud with no slope = no linear correlation. The “r close to…” labels preview the formal correlation coefficient covered in the next note.

Describing what you see

When asked to “describe the correlation”, give three things in this order:

1. Strength: strong or weak (or “no” if there’s no pattern at all). Strong = points hug the line; weak = points scatter widely.

2. Direction: positive (line slopes up) or negative (line slopes down).

3. Linearity: the word linear (or “non-linear” if the points curve along a clear non-line). At AI SL most questions are linear.

The reliable answer phrase: “strong / weak + positive / negative + linear correlation“. E.g. “strong negative linear correlation”. Examiners look for those three words.

The line of best fit

If the correlation is strong (and roughly linear), you can draw a line of best fit “by eye”. Two rules govern it:

Line of best fit (drawn by eye) passes through the mean point (x̄, ȳ)
and follows the overall trend — roughly equal points above and below

To plot the mean point: compute x̄ (mean of the x-values) and ȳ (mean of the y-values), mark that single point on the diagram with a different symbol, then rule a straight line through it that matches the trend.

Correlation does NOT imply causation

Two variables can move together for reasons other than one causing the other:

Lurking variable: ice cream sales and sunburns both go up in summer — not because one causes the other, but because both depend on hot weather.

Reverse causation: fire-engine count and damage both go up together — but it’s the size of the fire that drives both.

Pure coincidence: two unrelated trends that happen to point the same way over a small sample. Always ask “could there be a third explanation?”

🧭 Recipe — any scatter-diagram question

  1. Identify independent vs dependent: which variable was controlled or set? That’s x. Which responded? That’s y.
  2. Plot the points with even axes (label both with units!). Use a small cross or dot for each pair.
  3. Describe correlation in three words: strength + direction + “linear”. E.g. “strong positive linear correlation”.
  4. If correlation is strong: compute the mean point (x̄, ȳ), plot it, and rule a line of best fit through it along the trend.
  5. Use the line carefully: predictions inside the data range (interpolation) are reasonable; predictions outside (extrapolation) are risky. And always question causation in your interpretation.

Worked examples

WE 1

Identify independent variable and expected correlation

For each pair of variables, state which is the independent (explanatory) variable and predict the type of correlation you’d expect.

(a) Fertiliser amount (g) used on a tomato plant and the plant’s final height (cm).   (b) Age of a used car (years) and its resale value ($).   (c) Daily hours of sunshine and number of umbrellas sold that day.

(a) you CONTROL fertiliser; height RESPONDS ind = fertiliser (x), dep = height (y) more fertiliser ⇒ taller plants expect: positive correlation (b) age comes first; value drops with age ind = age, dep = value expect: negative correlation (c) sunshine is the weather variable, sales respond ind = sunshine, dep = umbrella sales expect: negative correlation (a) ind=fertiliser, +ve · (b) ind=age, −ve · (c) ind=sunshine, −ve independent = the variable you SET or that comes FIRST in time. Dependent = the one that REACTS. Phone usage causes screen-time changes, not the other way around.
WE 2

Plot a scatter diagram and describe the correlation

Eight students recorded their weekly maths study time (x hours) and their score (y %) on the next test.

(0.5, 32), (1, 45), (2, 55), (3, 60), (4, 70), (5, 76), (6, 82), (7, 88)

(a) Sketch a scatter diagram. (b) Describe the correlation.

(a) plot the 8 points x-axis: study time (hours) 0 to 8 y-axis: test score (%) 0 to 100 points climb roughly along a straight line (b) check trend x up ⇒ y up points lie close to a straight line strong positive linear correlation three magic words: strong + positive + linear. Always check that the points lie CLOSE to an imaginary straight line before saying “strong”.
WE 3

Find the mean point and use it for the line of best fit

Daily maximum temperature (x, °C) and ice cream sales (y, $) were recorded over 7 days:

(15, 80), (18, 95), (22, 120), (25, 140), (28, 165), (30, 180), (32, 200)

(a) Find the mean point (x̄, ȳ). (b) Explain how you’d use this point to draw a line of best fit.

(a) mean of x’s sum x = 15+18+22+25+28+30+32 = 170 x̄ = 170/7 = 24.3 (1 dp) mean of y’s sum y = 80+95+120+140+165+180+200 = 980 ȳ = 980/7 = 140 (b) the line MUST pass through (24.3, 140) plot that point as a different symbol rule a line through it following the upward trend aim for roughly equal points above & below (a) mean point = (24.3, 140) · (b) line through this point along trend the mean point is the “anchor” for any hand-drawn line of best fit. It’s the one point you can compute precisely — the rest is judgement.
WE 4

Correlation vs causation

For each pair, state the likely correlation, and decide whether a causal relationship is plausible. Justify briefly.

(a) Ice cream sales in a coastal town and number of reported sunburn cases.   (b) Fuel consumed by a car and distance driven on a road trip.   (c) Number of fire engines sent to a building fire and the amount of damage caused.

(a) summer drives both up positive correlation NOT causal: lurking variable = hot weather (b) more km used means more litres burnt positive correlation YES causal: distance directly causes fuel use (c) bigger fires get more engines AND cause more damage positive correlation NOT causal: fire size causes both (reverse causation) (a) +ve, not causal · (b) +ve, causal · (c) +ve, not causal examiners reward the EXPLANATION more than the yes/no. Always say WHY the lurking variable or reverse causation exists. “Hot weather causes both” is a complete sentence answer.
WE 5

Spot a bivariate outlier

Eight students’ heights (h, cm) and shoe sizes (s) are:

(150, 5), (155, 6), (160, 6), (165, 7), (170, 7), (172, 8), (175, 9), (180, 5)

(a) Identify the outlier and explain why. (b) Comment on how the outlier affects the apparent correlation.

(a) scan for a point that breaks the trend 7 points show: taller ⇒ bigger shoes but (180, 5): tallest student has SMALLEST shoes outlier = (180, 5) (b) effect on correlation without (180,5): strong positive linear correlation with (180,5): correlation appears WEAKER the outlier pulls the line of best fit flatter (a) outlier (180, 5) · (b) weakens the apparent correlation notice both 180 and 5 are perfectly normal values on their own — 180 cm is a valid height, size 5 is a valid shoe size. The point is an outlier in the BIVARIATE sense: the COMBINATION breaks the pattern.
WE 6

Read a line of best fit to estimate a value

A scatter diagram shows the time (t, minutes) a customer spends in a café against the amount they spend (y, $). A line of best fit has been drawn passing through (2, 25) and (8, 55).

(a) Find the equation of the line. (b) Estimate the amount spent by a customer who stayed for 5 minutes.

(a) Step 1 — gradient m = (55 − 25)/(8 − 2) = 30/6 = 5 Step 2 — intercept using (2, 25) 25 = 5(2) + c ⇒ c = 15 y = 5t + 15 (b) substitute t = 5 y = 5(5) + 15 = 25 + 15 = 40 (a) y = 5t + 15 · (b) about $40 slope means “for each extra minute, the bill rises by $5”. The intercept ($15) is the predicted spend for someone who stays 0 minutes — nonsense in context, which is fine: lines of best fit aren’t real-world physics, they’re statistical summaries.

💡 Top tips

⚠ Common mistakes

Next up: Pearson’s Product-Moment Correlation Coefficient (PMCC) — the formal numerical version of “strong / weak / positive / negative”. Instead of eyeballing the cloud, you’ll use your GDC to get a number r between −1 and +1 that quantifies the linear correlation exactly. The visual intuition you’ve built here is what makes r easy to interpret.

Need help with AI SL Correlation & Regression?

Get 1-on-1 help from an IB examiner who knows exactly what Paper 1 & 2 are looking for.

Book Free Session →