IB Maths AA HL Topic 4 โ€” Statistics & Probability Paper 1 & 2 HL only (x on y line) ~7 min read

Linear Regression

When the correlation is strong and linear, you replace the “by-eye” line of best fit with a least-squares regression line. There are two such lines โ€” one for predicting y from x, one for predicting x from y โ€” and choosing the wrong one gives unreliable predictions.

๐Ÿ“˜ What you need to know

The two regression lines

Regression of y on x
y = ax + b
minimises vertical squared distances; predict y from x
Regression of x on y (HL)
x = cy + d
minimises horizontal squared distances; predict x from y
Why two lines? The choice of “which axis to project onto” changes which line is optimal. The two lines coincide only when r = ยฑ1. The closer to ยฑ1, the closer they are to each other.

Using the regression line to predict

Predict y from x ลท  =  ax + b    (use the y on x line)
Predict x from y (HL) xฬ‚  =  cy + d    (use the x on y line)

Using the wrong line โ€” say, rearranging y = ax + b to predict x from y โ€” gives a different (and worse) prediction than the correct x on y line. Rearrangement of one line does NOT give the other.

Reliability of predictions

Prediction typeReliabilityWhy
Interpolation (inside data range)reliablelinear pattern is supported by data here
Extrapolation (outside data range)unreliablepattern may not continue; no data to confirm
Strong correlation (|r| โ‰ˆ 1)more reliabledata tightly fits the line
Weak correlationless reliablelarge variation around the line
Larger samplemore reliableregression line itself is better estimated

๐Ÿงญ Recipe โ€” find and use the regression line

  1. Check correlation: only fit a regression line if |r| is reasonably large (or critical-value test passes).
  2. Identify which line you need: are you predicting y or x?
  3. Enter the data into the GDC (x on L1, y on L2). For x on y, swap the lists.
  4. Run LinReg(ax+b) โ€” record a and b (or c and d) to enough decimals.
  5. Substitute the given value into the equation to predict.
  6. Check whether the prediction is interpolation or extrapolation and comment on reliability.

Worked examples

WE 1

Find the y on x line and use it to predict y

The hours of practice per week (x) and piano test score out of 100 (y) for 8 students:

Hours13579111315
Score2532415056647078

(a) Find the equation of the regression line of y on x, giving coefficients to 3 s.f. (b) Predict the score for a student who practises 10 hours per week.

(a) Enter into GDC, run LinReg(ax+b) a = 3.7857โ€ฆ โ‰ˆ 3.79 b = 21.7143โ€ฆ โ‰ˆ 21.7 y = 3.79x + 21.7 (b) Predict y at x = 10 y = 3.7857(10) + 21.7143 = 59.57โ€ฆ โ†’ predicted score โ‰ˆ 59.6 y = 3.79x + 21.7; score โ‰ˆ 59.6 use the FULL DISPLAY values from the GDC for predictions, not the rounded 3 s.f.
WE 2 ยท HL

Find the x on y line and use it to predict x

The distance run (x, in km) and calories burned (y) for 6 athletes:

Distance (km)35681012
Calories240380460600770920

(a) Find the equation of the regression line of x on y. (b) Estimate the distance run by an athlete who burned 500 calories.

(a) Enter into GDC with calories as L1 and distance as L2 (swap them for x on y) Run LinReg(ax+b) on the swapped lists c = 0.013142โ€ฆ โ‰ˆ 0.0131 d = โˆ’0.04832โ€ฆ โ‰ˆ โˆ’0.0483 x = 0.0131y โˆ’ 0.0483 (b) Predict x at y = 500 x = 0.013142(500) โˆ’ 0.04832 = 6.5229โ€ฆ โ†’ predicted distance โ‰ˆ 6.52 km x = 0.0131y โˆ’ 0.0483; distance โ‰ˆ 6.52 km always use x on y when the given value is y and you want to find x
WE 3

Choose which regression line to use

For each scenario, state which regression line should be used.
(a) A student studies for 6 hours and you want to estimate their test score.
(b) A student wants to score 80 on the test and you want to estimate the hours of study they should put in.

Identify what is given and what is being predicted (a) Given hours (x), predict score (y) โ†’ predict y from x โ†’ use y on x line (b) Given target score (y), predict hours (x) โ†’ predict x from y โ†’ use x on y line (HL) (a) y on x; (b) x on y don’t rearrange one line into the other โ€” they are mathematically different
WE 4

Interpolation vs extrapolation

Using the regression line from WE 1 (y = 3.7857x + 21.7143, with data x ranging from 1 to 15 hours): (a) Predict the score for a student who practises 6 hours. (b) Predict the score for a student who practises 25 hours. (c) Comment on the reliability of each.

(a) x = 6 โ€” INSIDE data range y = 3.7857(6) + 21.7143 = 44.43 โ†’ predicted score โ‰ˆ 44.4 (b) x = 25 โ€” OUTSIDE data range (max was 15) y = 3.7857(25) + 21.7143 = 116.36 โ†’ predicted score โ‰ˆ 116 (c) Comment (a) interpolation โ†’ reliable, supported by data (b) extrapolation โ†’ unreliable; also exceeds the maximum 100, so physically impossible (a) โ‰ˆ 44 (reliable interpolation); (b) โ‰ˆ 116 (unreliable extrapolation, exceeds 100) extrapolation often produces values outside any meaningful range
WE 5 ยท HL

Verify both regression lines intersect at the mean point

For the data x: 2, 4, 6, 8, 10, 12   y: 10, 16, 18, 26, 28, 36, find both regression lines and verify that they intersect at (โŽฏx, โŽฏy).

Step 1: Mean point xฬ„ = (2+4+6+8+10+12)/6 = 42/6 = 7 ศณ = (10+16+18+26+28+36)/6 = 134/6 โ‰ˆ 22.33 Step 2: Run LinReg on GDC for y on x y = 2.4857x + 4.9333 Step 3: Run LinReg on swapped lists for x on y x = 0.3925y โˆ’ 1.7654 Step 4: Check both lines at the mean point y on x at x = 7: y = 2.4857(7) + 4.9333 = 22.33 โœ“ x on y at y = 22.33: x = 0.3925(22.33) โˆ’ 1.7654 = 7.00 โœ“ Both lines pass through (7, 22.33) โœ“ this is a useful sanity check on your calculations
WE 6

Ice-cream sales โ€” full regression analysis

An ice-cream stall records the daily temperature (x, in ยฐC) and units sold (y) over 8 days:

Temp (ยฐC)1518222528303235
Sales120145180210245270290320

(a) Find r and the equation of the regression line of y on x. (b) Predict sales when the temperature is 26 ยฐC and comment on reliability. (c) Predict sales at 45 ยฐC and comment.

(a) GDC: r and LinReg r = 0.998 (very strong positive) a = 10.20, b = โˆ’38.96 y = 10.2x โˆ’ 39.0 (b) Predict sales at x = 26 (inside range 15-35) y = 10.2035(26) โˆ’ 38.9641 = 226.33 โ†’ โ‰ˆ 226 units Reliable: interpolation + very strong correlation (c) Predict at x = 45 (outside range) y = 10.2035(45) โˆ’ 38.9641 = 420.19 โ†’ โ‰ˆ 420 units Unreliable: extrapolation; demand may saturate or staffing may limit sales (a) r = 0.998, y = 10.2x โˆ’ 39.0; (b) โ‰ˆ 226 (reliable); (c) โ‰ˆ 420 (unreliable) always justify reliability with both “inside/outside range” and the strength of correlation

๐Ÿ’ก Top tips

โš  Common mistakes

That closes the Correlation & Regression sub-section. Three notes, one tight workflow: scatter plot โ†’ check r โ†’ fit the right regression line โ†’ predict carefully. The next sub-section moves to Probability โ€” sample spaces, set notation, Venn diagrams, and the formal language of “and / or / given that”.

Need help with Statistics & Probability?

Get 1-on-1 help from an IB examiner who knows exactly what Paper 1 & 2 are looking for.

Book Free Session โ†’