# Applying the Pythagorean Expectation to Football: Part Two

In my previous article, I discussed how to apply the baseball Pythagorean expectation to football and how to measure the error of the predictions using RMSE. This second article will demonstrate how to optimize the equation further to improve its accuracy.

One of the major reasons for the error in the predictions is the occurrence of draws in football. The Pythagorean expectation only looks at wins and losses and presumes that if a team scores zero goals then it will achieve zero points. This is of course incorrect, it is perfectly feasible for a team to fail to score but still gain a point through a nil-nil draw so we need to take this into account.

Howard Hamilton of Soccermetrics has published an updated Soccer Pythagorean equation that does just that, and it does a good job of it. For the 2011–2012 season, Howard Hamilton reports an RMSE of 3.81 compared with the RMSE of 5.65 I reported for my previous version of the Pythagorean equation. The downside to Howard Hamilton’s equation though is that it is rather complicated. While the original Pythagorean equation is simple enough to be used by any football fan, Howard Hamilton’s equation requires a decent understanding of mathematics to use it.

Because of that, I thought I would tweak the original Pythagorean formula a bit further to try and improve its accuracy without adding too much extra complexity to it. One easy way to do this is to scale the points scored per match to take into account the occurrence of draws. Applying least squares to this reduces the RMSE for the 2011–2012 season to 4.04 points, just 6% higher than Howard Hamilton’s equation. This is based on only one season’s data though so to get a true idea of how well my enhanced Pythagorean expectation works (abbreviated to MPE) I optimized the equation based on a much larger data set and applied it to the last 10 English Premier League (EPL) seasons ( Figure 1).

The MPE works well, with an average residual (the difference between predicted points and actual points) of 4.08 points. This compares nicely with Howard Hamilton’s published value of 3.81 and is less than half of the error the original Pythagorean Expectation equation gave. It is also worth noting that Howard Hamilton’s RMSE of 3.81 is for just one season, and of the ten seasons analysed here using the MPE, two actually have an RMSE lower than 3.81.

Plotting the MPE predicted points versus actual points for the last ten EPL seasons shows visually how well the MPE equation works (Figure 2). The correlation between the predicted and actual points scored is excellent, with an an r^{2} value 0.938 (Figure 2).

So based on the initial work so far I am pleased that the MPE version of the Pythagorean expectation gives results comparable to Howard Hamilton’s more detailed and advanced derivation but without quite as much added complexity. The final equation for anybody who wants to give it a try is shown below in Figure 3.

Hi Martin,

nice, clear explanation.

Have you looked at using pythag to predict future games rather than explain previous points totals ? I’ve always thought that the drive to reduce rmse for predicted vs actual points can lead to overfitting of the non repeatable luck driven component of the actual games.

I’ve looked at pythag match ups as predictors of future games, but only in the NFL ,not soccer and pythag league points from one year to predict total points in the next season for soccer here

http://thepowerofgoals.blogspot.co.uk/2012/11/a-predictive-pythagorean-for-football.html

Game by game pythag is also a novel twist.

Interesting subject, but a bit confused as to where it’s going at the mo. Posts like yours will go a long way towards clarifying things. Once again, nice read.

Mark

Hi Mark,

Thanks for the comments, they are really interesting points. I agree about the rmse, football is so variable and luck-driven that an error of zero is unrealistic and probably not particularly useful for making predictions from as will not reflect the future accurately.

I am interested in looking at the predictive power of the Pythagorean and will be investigating that further. I am also interested though in what else the Pythagorean shows, such as how much Everton would need to improve in terms of goals scored / conceded to really challenge for a top four place or for a relegation-threatened team to stay up etc. Hopefully over the next few weeks I will be able to find out how useful the Pythagorean can be for this sort of thing in football.

Thanks again!

Martin

Thanks for an interesting blog. I have just one question. I see that you have fitted two different exponent for “GoalsFor”, have you tried fitting the formula with the same parameter for “GoalsFor” in both the numerator and denominator? This would seem more intuitive (not that I find the pythogarean very intuitive), but then again you would probably get larger RMSE.

Yes you could simplify it by using one exponent but the RMSE will go up.

Applying your equation to the current league table for the premiership on a team by team basis, the total points would equal 450, as opposed to the real points total currently 451 (correct 17/12/12), which is amazing accuracy. However, it is out by 2.7 points per team on average, and out by 11 points for Manchester United, which would be great in real life as I detest them! The league table would look as follows:

Manchester City 34

Manchester United 31

Chelsea 28

Arsenal 28

Everton 27

Tottenham Hotspur 25

Swansea City 25

West Bromwich Albion 25

Stoke City 25

West Ham United 23

Liverpool 23

Fulham 22

Norwich City 19

Sunderland 19

Newcastle United 18

Southampton 17

Aston Villa 16

Reading 15

Wigan Athletic 15

Queens Park Rangers 14

Pingback What’s Ailing Arsenal and Can They Finish 4th? – A Statistical Analysis | Mixed kNuts

Pingback Introduction to Soccer Analytics – The Guys I Follow | Mixed kNuts

Hi Martin,

I am slightly confused by the last graph. Surely it is more helpful to measure deviation from the line y=x rather than the line of best fit? I don’t really see what success a high correlation coefficient has if for example the best fit line was vertical. To highlight this, realise that you could achieve this with r^2 = 1 if you just predicted every team to have the same points total.

Using the line y=x would have the added benefit of allowing you to see where your model is performing better/worse (i.e. low point scorers underestimated etc.) with systematic deviations from the line y=x.

Thanks,

Max

Why the formula has GoalsAway instead of GoalsAgainst?

Yes, perhaps GoalsAgainst would have been a better name so in case it isn’t clear to anybody else, I am referring to goals conceded