IB Maths AA SLTopic 4 — Correlation & RegressionPaper 1 & 2~10 min read
Scatter Diagrams & Correlation
Sometimes data comes in pairs — like a student’s height and weight, or hours studied vs test score. A scatter diagram lets you spot a pattern between them at a glance, and “correlation” is the maths word for how strongly two variables move together.
📘 What you need to know
Bivariate data = data collected on two variables, paired up. e.g. (height, weight) for each person.
A scatter diagram plots one variable on the x-axis and the other on the y-axis.
The independent variable goes on the x-axis. The dependent variable goes on the y-axis.
Correlation describes how the two variables change together — positive, negative, or none — and how strong (close to a line) it is.
A line of best fit goes through the mean point (x̄, ȳ) and follows the trend of the points.
Correlation does NOT mean causation! Two things moving together doesn’t mean one causes the other.
What is bivariate data?
Bivariate just means “two variables”. You collect two pieces of information from each subject — like a student’s height and their weight, or the temperature outside and the number of ice creams sold that day.
Each pair (x, y) becomes one point on a scatter diagram. Then you can see at a glance whether the two variables are related.
Bivariate data is everywhere — exam scores vs hours studied, age vs reaction time, advertising spend vs sales. Anytime you’re comparing how one thing relates to another, you’re looking at bivariate data.
What is a scatter diagram?
A scatter diagram is a graph of bivariate data. Each pair becomes a single dot — and the pattern of dots shows you the relationship.
Which variable goes on which axis?
Independent (explanatory)
Goes on the x-axis
The variable you control or set. Time, hours studied, temperature.
e.g. “hours of study” — you decide how many hours.
Dependent (response)
Goes on the y-axis
The variable you measure. The one that changes because of the other.
e.g. “exam score” — depends on how much you studied.
🧠
Memory trick: “x is the cause, y is the effect”
The thing you change is on the x-axis (cause). The thing that changes as a result is on the y-axis (effect). Hours studied → score. Temperature → ice creams sold.
Correlation — describing the pattern
Correlation is just a fancy word for “how do the two variables move together?” When you describe correlation, you mention two things: the type and the strength.
The three types of correlation
Positive
x goes up → y goes up. Slopes up from left to right.
Negative
x goes up → y goes down. Slopes down from left to right.
None
No clear pattern. Points scattered randomly.
Five common scatter patterns
The two strengths
Strong linear correlation — the points are close to a straight line.
Weak linear correlation — the points are scattered loosely around a line, with a clear pattern but lots of noise.
Perfect linear correlation — every single point lies exactly on the line.
📍
Always describe BOTH the type AND the strength
Don’t just say “positive correlation” — say “strong positive linear correlation” or “weak negative linear correlation“. The full description gets you both marks.
Line of best fit
If the data shows strong linear correlation, you can draw a line of best fit by eye through the points. This line follows the general trend.
Key fact about the line of best fit
The line of best fit passes through the mean point (x̄, ȳ)
How to draw a line of best fit
Calculate the mean of the x values and the mean of the y values (use your GDC).
Plot the mean point (x̄, ȳ) on the diagram.
Draw a straight line through the mean point that follows the trend of the data — try to balance the points above and below the line.
“By eye” doesn’t mean “guess wildly”. The line should pass through the mean point and have roughly the same number of points above as below it.
Correlation ≠ Causation
This is the most important warning in statistics: just because two variables are correlated doesn’t mean one is causing the other.
🤔 The classic example
Ice cream sales and shark attacks both go up together every summer. Strong positive correlation! But ice cream doesn’t cause shark attacks (and the reverse seems unlikely too). The real cause is that warmer weather brings more swimmers AND more ice cream buyers — a hidden third factor.
When does correlation suggest causation?
You can usually tell by thinking about the context:
Likely causal: Temperature outside vs ice cream sales at a park. Warm weather makes people want ice cream — direct cause.
NOT causal: Global temperatures vs number of monkeys kept as pets in the UK. Both might be increasing over time, but neither causes the other.
📍
“Causal relationship” — what to say in exams
If asked whether two variables have a causal relationship, look at the real-world context. Is there a sensible mechanism by which one would cause the other? If yes, it’s likely causal. If they’re just two unrelated trends, it’s probably a coincidence — or some hidden third factor.
Worked examples
WE 1
Draw a scatter diagram and describe the correlation
A teacher records the hours her 9 students spent on a phone and on a computer per day:
Phone (hrs)
7.6
7.0
8.9
3.0
3.0
7.5
2.1
1.3
5.8
Computer (hrs)
1.7
1.1
0.7
5.8
5.2
1.7
6.9
7.1
3.3
(a) Draw a scatter diagram. (b) Describe the correlation. (c) Plot the mean point and draw a line of best fit.
Phone hours = independent (x-axis). Computer hours = dependent (y-axis).part (a) — scatter diagramPlot each (phone, computer) pair as a single point.part (b)As phone hours increase, computer hours decrease.Points lie close to a downward straight line.Strong negative linear correlationpart (c) — line of best fitFind means using GDC:x̄ ≈ 5.13, ȳ ≈ 3.72Plot mean point (5.13, 3.72), draw a line through it sloping down.Mean point ≈ (5.13, 3.72) — line drawn through italways plot the mean point first — it’s your anchor for the line
WE 2
Identify the variables and predict correlation
For each pair, identify (i) the independent variable, (ii) the dependent variable, and (iii) the type of correlation you’d expect.
(a) Hours of revision and final exam mark.
(b) Outdoor temperature and number of hot drinks sold at a café.
(c) Number of cats owned and student’s height.
Independent = the cause / what we control. Dependent = the effect.part (a)Independent: hours of revision (we control study time)Dependent: exam mark (depends on revision)More revision → better mark.Positive correlation expectedpart (b)Independent: temperatureDependent: hot drinks soldHotter day → fewer hot drinks.Negative correlation expectedpart (c)No mechanism linking cats to height!No correlation expectedalways think about whether there’s a real-world reason for the link
WE 3
Describe correlation from a scatter diagram
The scatter diagram below shows the height (cm) and weight (kg) of 12 students.
Describe the correlation between height and weight.
Look at the type (slope) and strength (closeness to a line).As height increases, weight increases.Points lie very close to a straight line sloping up.Strong positive linear correlationuse all three words: STRONG / POSITIVE / LINEAR
WE 4
Correlation vs causation — comment on a study
A researcher finds a strong positive correlation between the number of swimming pools in a town and the number of pizza shops in that town. Does this mean swimming pools cause pizza shops?
Strong correlation does NOT automatically mean causation. Look for a hidden third factor.Both numbers likely depend on the size of the town.Bigger towns → more pools AND more pizza shops.Town population is the hidden third factor.No — correlation does not imply causationalways check for a hidden cause when two things move together
WE 5
Find the mean point for a line of best fit
The data shows hours of TV watched per week and exam scores for 5 students:
TV hrs (x)
5
10
15
20
25
Score (y)
92
78
70
62
48
Find the mean point (x̄, ȳ) and state the type of correlation.
n = 5Mean of x’s, mean of y’s. Then look at the trend.Mean of x:5+10+15+20+255 = 755 = 15Mean of y:92+78+70+62+485 = 3505 = 70Type:as TV hrs ↑, score ↓ → negativeStrength:drops by ~10–14 marks per 5 extra hrs — fairly consistent → strongMean point = (15, 70) | Strong negative linear correlationthe line of best fit must pass through (15, 70)
💡 Top tips
Always describe correlation with all three words: strength (strong/weak), type (positive/negative), and “linear”. One missing word can lose marks.
Independent on x, dependent on y. If the variable is what you’re controlling or causing, it goes on the bottom.
The line of best fit must pass through the mean point. Plot (x̄, ȳ) first, then draw the line through it.
Use your GDC for the means. Enter the data and run 1-Var Stats to get x̄ and ȳ instantly.
Correlation ≠ Causation. Always question whether one variable is actually causing the other, or if there’s a hidden third factor.
If the question asks “describe the relationship in context”, say something like “as x increases, y tends to decrease” — link to what the variables actually represent.
For “no correlation”, points should look completely random — no upward or downward drift.
Look out for outliers on a scatter diagram — bivariate outliers don’t have to be outliers in either variable on their own.
⚠ Common mistakes
Saying “positive correlation” without “strong” or “weak”. The strength is half the answer.
Forgetting “linear”. The IB wants “strong positive linear correlation” — not just “strong positive correlation”.
Putting the dependent variable on the x-axis. Cause goes on x, effect goes on y.
Drawing a line of best fit that doesn’t pass through the mean point. Always plot (x̄, ȳ) first as your anchor.
Assuming causation from correlation. “Strong correlation” never proves one variable causes the other.
Connecting the data points instead of drawing a line of best fit. The line of best fit is a single straight line through the mean — don’t join the dots.
Using “no correlation” when there’s actually weak correlation. Look carefully — if there’s any slope at all, it’s weak (not none).
Forgetting axis labels and units on your scatter diagram. Marks lost for unlabelled axes.
Scatter diagrams give you a visual feel for correlation, but “strong” or “weak” is subjective. The next note covers Pearson’s PMCC — a single number (between −1 and 1) that measures correlation precisely. No more guessing!
Need help with Scatter Diagrams & Correlation?
Get 1-on-1 help from an IB examiner who knows exactly what Paper 1 & 2 are looking for.