Applying the Pythagorean Expectation to Football: Part Two

Posted by Martin Eastwood December 3, 2012 13 Comments 7057 views

In my previous article, I discussed how to apply the baseball Pythagorean expectation to football and how to measure the error of the predictions using RMSE. This second article will demonstrate how to optimize the equation further to improve its accuracy.

One of the major reasons for the error in the predictions is the occurrence of draws in football. The Pythagorean expectation only looks at wins and losses and presumes that if a team scores zero goals then it will achieve zero points. This is of course incorrect, it is perfectly feasible for a team to fail to score but still gain a point through a nil-nil draw so we need to take this into account.

Howard Hamilton of Soccermetrics has published an updated Soccer Pythagorean equation that does just that, and it does a good job of it. For the 2011–2012 season, Howard Hamilton reports an RMSE of 3.81 compared with the RMSE of 5.65 I reported for my previous version of the Pythagorean equation. The downside to Howard Hamilton’s equation though is that it is rather complicated. While the original Pythagorean equation is simple enough to be used by any football fan, Howard Hamilton’s equation requires a decent understanding of mathematics to use it.

Because of that, I thought I would tweak the original Pythagorean formula a bit further to try and improve its accuracy without adding too much extra complexity to it. One easy way to do this is to scale the points scored per match to take into account the occurrence of draws. Applying least squares to this reduces the RMSE for the 2011–2012 season to 4.04 points, just 6% higher than Howard Hamilton’s equation. This is based on only one season’s data though so to get a true idea of how well my enhanced Pythagorean expectation works (abbreviated to MPE)  I optimized the equation based on a much larger data set and applied it to the last 10 English Premier League (EPL) seasons ( Figure 1).

Figure 1: MPE Prediction by Season in the EPL

Figure 1: MPE Prediction by Season in the EPL

The MPE works well, with an average residual (the difference between predicted points and actual points) of 4.08 points. This compares nicely with Howard Hamilton’s published value of 3.81 and is less than half of the error the original Pythagorean Expectation equation gave. It is also worth noting that Howard Hamilton’s RMSE of 3.81 is for just one season, and of the ten seasons analysed here using the MPE, two actually have an RMSE lower than 3.81.

Plotting the MPE predicted points versus actual points for the last ten EPL seasons shows visually  how well the MPE equation works (Figure 2). The correlation between the predicted and actual points scored is excellent, with an an  r2 value 0.938 (Figure 2).

Figure 2: MPE Predicted Points Versus Actual Points

Figure 2: MPE Predicted Points Versus Actual Points in the EPL

So based on the initial work so far I am pleased that the MPE version of the Pythagorean expectation gives results comparable to Howard Hamilton’s more detailed and advanced derivation but without quite as much added complexity. The final equation for anybody who wants to give it a try is shown below in Figure 3.

Figure 3: MPE Equation

Figure 3: MPE Equation

About Martin Eastwood

Martin is football fan and data scientist. In his spare time he likes to combine the two and write about the mathematical analysis of football.

View all post by Martin Eastwood

There are 13 Comments

  1. - December 3, 2012
      -   Reply

    Hi Martin,
    nice, clear explanation.

    Have you looked at using pythag to predict future games rather than explain previous points totals ? I’ve always thought that the drive to reduce rmse for predicted vs actual points can lead to overfitting of the non repeatable luck driven component of the actual games.

    I’ve looked at pythag match ups as predictors of future games, but only in the NFL ,not soccer and pythag league points from one year to predict total points in the next season for soccer here
    http://thepowerofgoals.blogspot.co.uk/2012/11/a-predictive-pythagorean-for-football.html

    Game by game pythag is also a novel twist.

    Interesting subject, but a bit confused as to where it’s going at the mo. Posts like yours will go a long way towards clarifying things. Once again, nice read.

    Mark

    • admin
      - December 3, 2012
        -   Reply

      Hi Mark,
      Thanks for the comments, they are really interesting points. I agree about the rmse, football is so variable and luck-driven that an error of zero is unrealistic and probably not particularly useful for making predictions from as will not reflect the future accurately.
      I am interested in looking at the predictive power of the Pythagorean and will be investigating that further. I am also interested though in what else the Pythagorean shows, such as how much Everton would need to improve in terms of goals scored / conceded to really challenge for a top four place or for a relegation-threatened team to stay up etc. Hopefully over the next few weeks I will be able to find out how useful the Pythagorean can be for this sort of thing in football.
      Thanks again!
      Martin

  2. - December 12, 2012
      -   Reply

    Thanks for an interesting blog. I have just one question. I see that you have fitted two different exponent for “GoalsFor”, have you tried fitting the formula with the same parameter for “GoalsFor” in both the numerator and denominator? This would seem more intuitive (not that I find the pythogarean very intuitive), but then again you would probably get larger RMSE.

    • admin
      - December 12, 2012
        -   Reply

      Yes you could simplify it by using one exponent but the RMSE will go up.

  3. Andrew Ferris
    - December 17, 2012
      -   Reply

    Applying your equation to the current league table for the premiership on a team by team basis, the total points would equal 450, as opposed to the real points total currently 451 (correct 17/12/12), which is amazing accuracy. However, it is out by 2.7 points per team on average, and out by 11 points for Manchester United, which would be great in real life as I detest them! The league table would look as follows:

    Manchester City 34
    Manchester United 31
    Chelsea 28
    Arsenal 28
    Everton 27
    Tottenham Hotspur 25
    Swansea City 25
    West Bromwich Albion 25
    Stoke City 25
    West Ham United 23
    Liverpool 23
    Fulham 22
    Norwich City 19
    Sunderland 19
    Newcastle United 18
    Southampton 17
    Aston Villa 16
    Reading 15
    Wigan Athletic 15
    Queens Park Rangers 14

  4. Pingback What’s Ailing Arsenal and Can They Finish 4th? – A Statistical Analysis | Mixed kNuts

  5. Pingback Introduction to Soccer Analytics – The Guys I Follow | Mixed kNuts

  6. Max Steele
    - October 4, 2013
      -   Reply

    Hi Martin,

    I am slightly confused by the last graph. Surely it is more helpful to measure deviation from the line y=x rather than the line of best fit? I don’t really see what success a high correlation coefficient has if for example the best fit line was vertical. To highlight this, realise that you could achieve this with r^2 = 1 if you just predicted every team to have the same points total.

    Using the line y=x would have the added benefit of allowing you to see where your model is performing better/worse (i.e. low point scorers underestimated etc.) with systematic deviations from the line y=x.

    Thanks,
    Max

  7. M
    - December 5, 2013
      -   Reply

    Why the formula has GoalsAway instead of GoalsAgainst?

    • Martin Eastwood
      - December 29, 2013
        -   Reply

      Yes, perhaps GoalsAgainst would have been a better name so in case it isn’t clear to anybody else, I am referring to goals conceded

  8. Pingback Soccer’s Biggest Overachievers and Underachievers | Nelli Analytics

  9. Rick Tee
    - September 6, 2014
      -   Reply

    I have been working on this for a month or so and what i’ve found is accuracy drops from about 60% in the BPL to 40% in the Championship and lower, I also noticed a strange phenomenon where the outcome was the reverse of the test result i.e if team B were deemed the winner then team A would actually be the winner.
    Current result %
    BPL 70%
    Cha 60%
    LG1 60%
    LG2 55%
    CNF %70%
    I should add I am also counting draws in the accuracy quoted above. I will continue to test but as the season is still ‘finding its feet’ i’m sure thing will change.

  10. Rick Tee
    - September 6, 2014
      -   Reply

    Thought I would add, for the draws i am using a variable x and y, these are set differently for each league, eg. BPL x=75 y=84, CNF x=57.25 y=55.

    Hw = 70, Aw = 75, D = 80
    If the points fall between these two numbers then a draw is the predicted outcome.

    I know the numbers are widely different but the system is essentially the same. I just thought i would share some information on how i’ve been calculating a draw.
    If anyone finds it useful I will try to explain my system in more detail.

Write Your Comment

  • fede

    oh too bad you havent time :( i was following with …

  • Martin Eastwood

    The app page has gone for the time being but the p …

  • fede

    First of all, thanks for your hard and very intere …

  • Martin Eastwood

    Sorry Justin, I don't have a spreadsheet with it i …

  • Martin Eastwood

    Cool, will take a look! …

Latest Tweets

  • Loading...