IB Maths AA HLTopic 4 â Statistics & ProbabilityPaper 1 & 2~6 min read
Scatter Diagrams & Correlation
A scatter diagram plots paired (x, y) data â controlled variable on the x-axis, response on the y-axis. The direction (positive/negative), strength (strong/weak), and presence of outliers are read straight off the plot. The line of best fit by eye must pass through the mean point (âŻx, âŻy).
đ What you need to know
Bivariate data: pairs (x, y) â one independent variable, one dependent.
Independent (x-axis): the controlled or explanatory variable.
Dependent (y-axis): the measured or response variable.
Direction: positive (both â), negative (one â, other â), or none.
Strength: strong (points cluster tightly along a line), weak (more scattered), perfect (all points on a line, r = Âą1).
Mean point (âŻx, âŻy) lies on every line of best fit.
Correlation â causation: a relationship in the data doesn’t prove one variable causes the other.
Bivariate outlier: a point that doesn’t follow the overall trend, even if its individual x and y values are normal.
Types of correlation â visual
The closer the points cluster around a straight line, the stronger the correlation. Direction is set by the slope of the trend.
The mean point
Mean point â passes through any line of best fit
(âŻx, âŻy) = ( ÎŁ xin , ÎŁ yin )
Correlation â causation
Likely causal
a real mechanism links the two
e.g., training hours â improved fitness â faster race times
Spurious / coincidence
a third “lurking” variable
e.g., ice cream sales and shark attacks both rise in summer
đ§ Recipe â analyse a scatter diagram
Identify the variables: which is independent (x), which is dependent (y).
Plot the points with appropriate scale and units on each axis.
Look for direction: are points trending up, down, or no clear pattern?
Judge strength: how tightly do points cluster around the trend?
Spot outliers: any points that don’t follow the trend.
Compute the mean point and draw the line of best fit through it (only if correlation is strong).
Comment on causation â is there a plausible mechanism, or could it be coincidence?
Worked examples
WE 1
Identify variables and predict correlation type
A scientist measures the water temperature (in °C) of a lake at different depths and the dissolved oxygen content (in mg/L). (a) State which is the independent and which is the dependent variable. (b) Predict the type of correlation expected.
(a) Identify variablesTemperature is the controlled / explanatory variable â independent â x-axisOxygen is the measured response â dependent â y-axis(b) PredictWarmer water holds less dissolved oxygen (physical fact)â as temperature â, oxygen â â expect NEGATIVE correlationx = temperature, y = oxygen; expect negative correlationindependent variable goes on x-axis ALWAYS
WE 2
Find the mean point and describe the correlation
Eight students record the hours of practice per week (x) and their score on a music exam (y):
Practice (h)
0
2
3
5
7
8
10
12
Score
5
7
9
14
17
20
25
30
(a) Find the mean point. (b) Describe the correlation.
(a) Mean pointÎŁx = 0+2+3+5+7+8+10+12 = 47xĚ = 47/8 = 5.875ÎŁy = 5+7+9+14+17+20+25+30 = 127Čł = 127/8 = 15.875(b) CorrelationAs x increases (0 â 12), y consistently increases (5 â 30)Points lie close to a straight lineMean point: (5.875, 15.875); strong positive linear correlationa line of best fit drawn through (5.875, 15.875) would slope upwards
WE 3
Mean point and correlation â negative trend
The hours of TV watched per evening (x) and quiz score the next day (y) for 7 students:
TV hours
1
2
3
4
5
6
7
Score
90
82
75
68
60
52
45
(a) Find the mean point, giving Čł to 3 s.f. (b) Describe the correlation.
(a) Mean pointÎŁx = 1+2+3+4+5+6+7 = 28; xĚ = 28/7 = 4ÎŁy = 90+82+75+68+60+52+45 = 472Čł = 472/7 â 67.4(b) CorrelationAs x â by 1, y drops by about 7-8 each time (very consistent)Mean point: (4, 67.4); strong negative linear correlationvery consistent step-down in y â high strength
WE 4
Correlation vs causation â three scenarios
For each pair of variables, state whether a strong positive correlation is likely to indicate causation or just spurious correlation, and justify briefly.
(a) Cigarettes smoked per day vs lung cancer rates.
(b) Sales of sunscreen vs number of shark attacks at beaches.
(c) Hours of sleep vs reaction time.
(a) Cigarettes vs lung cancerCAUSAL â direct biological mechanism (carcinogens damage lung tissue)(b) Sunscreen sales vs shark attacksSPURIOUS â both rise in summer when more people go to the beachâ “lurking variable” is warm weather / beach-going(c) Hours of sleep vs reaction timeCAUSAL â fatigue physically slows neural processing(a) causal; (b) spurious â third variable; (c) causalalways ask “is there a plausible mechanism, or could a third variable explain both?”
WE 5
Identify a bivariate outlier
The following data points were collected: (10, 50), (15, 60), (20, 70), (25, 80), (30, 90), (35, 30), (40, 110). Identify any outlier and explain why.
Step 1: Look at the trendMost points follow y â 2x + 30: (10,50) â 2(10)+30 = 50 â (15,60) â 60 â; (20,70) â 70 â; (25,80) â 80 â (30,90) â 90 â; (40,110) â 110 âStep 2: Check the suspect point(35, 30): expected â 2(35) + 30 = 100; actual y = 30â deviation of 70 from the trend(35, 30) is the outlier â does not follow the linear pattern of the other pointsx = 35 and y = 30 individually are inside the data range, but the PAIR doesn’t fit
WE 6
Sprint training â full analysis
A coach records the number of training sessions completed (x) and the 100m time in seconds (y) for 6 athletes:
Sessions
5
8
12
15
20
25
Time (s)
13.2
12.8
12.5
12.1
11.8
11.4
(a) State which is the independent variable. (b) Find the mean point. (c) Describe the correlation. (d) Comment on causation.
(a) Independent variableSessions are controlled by the coach â x = sessions(b) Mean pointÎŁx = 5+8+12+15+20+25 = 85; xĚ = 85/6 â 14.17ÎŁy = 13.2+12.8+12.5+12.1+11.8+11.4 = 73.8Čł = 73.8/6 = 12.30(c) CorrelationAs sessions â, time â consistently â strong negative linear correlation(d) CausationPlausible: more training â improved fitness â faster times(a) sessions; (b) (14.17, 12.30); (c) strong negative linear; (d) likely causalthe line of best fit (drawn through the mean point) would slope downwards
đĄ Top tips
Independent variable always on the x-axis â no exceptions.
Compute the mean point first when drawing a line of best fit by eye â anchor your line through it.
“Strong” / “weak” describes how close points are to a line â not how steep the line is.
For correlation comments, always state direction AND strength (e.g., “strong positive linear”, not just “positive”).
If asked about causation, look for a plausible mechanism â and watch for lurking variables.
â Common mistakes
Confusing “strong” with “steep” â a line with gentle slope can still be strong if all points lie on it.
Assuming causation from correlation â strong r doesn’t prove cause.
Drawing a line of best fit that doesn’t pass through (âŻx, âŻy) â it must.
Treating a bivariate outlier as a univariate one â check whether the (x, y) pair fits the trend, not whether x or y are individually extreme.
Forgetting to label axes with units.
Next: Pearson’s Product-Moment Correlation Coefficient. The PMCC turns “describe the correlation” into a single number r between â1 and 1. Closer to Âą1 = stronger, sign tells direction. Computed using your GDC’s stats mode â and a critical-value check tells you whether a linear model is appropriate.
Need help with Statistics & Probability?
Get 1-on-1 help from an IB examiner who knows exactly what Paper 1 & 2 are looking for.