pena.lt/y/Thu, 04 Dec 2014 19:30:00 +0000Massey Ratings For Football Part Two/2014/12/04<h2>Introduction</h2> <p>In <a href="http://pena.lt/y/2014/11/27/english-premier-league">part one</a> I introduced Massey Ratings and how they can be used to rank football teams in a way that accounts for their strength of schedule. Next, we’ll take a look at how Massey Ratings can be extended further to look at team’s attack and defence strength separately.</p> <h2>Massey Ratings</h2> <p>The idea behind Massey Ratings is that they rate teams such that the difference between any two teams is equal to the expected margin of victory between them. For example, if a team rated -1.0 played a team rated +1.0 then we’d expect the average goal difference between them to be two goals.</p> <p>Since Massey Ratings look at goal difference rather than goals scored or conceded they account for a team’s overall strength and combine both their attack and defence strengths together into a single value. This means with a bit of mathematics we should be able to decompose a Massey Rating to split out these two constituent parts.</p> <h2>Attack And Defence</h2> <p>In part One we originally defined the Massey Rating as shown below in Equation One:</p> <p><span class="math">\(y=ra–rb\)</span> </p> <p>where <em>y</em> is the margin of victory for fixture, <em>ra</em> is the rating of team <em>a</em> and <em>rb</em> is the rating of team b. Let’s take this a step further and define the total goals a team should score in a match as Equation Two below:</p> <p><span class="math">\(ya=oa–db\)</span></p> <p>where <em>ya</em> is the number of goals team <em>a</em> is expected to score, <em>oa</em> is team a’s attack strength and <em>db</em> is team b’s defence strength.</p> <p>Extending this further we can say the total goals a given team should score over the course of a season is therefore equal to its attack strength multiplied by the number of matches played minus the sum of the defence strength of all its opponents. Since we know what the team’s overall rating are, how many matches they’ve played, how many goals were scored and who their opponents were we’re getting pretty close to getting what we need.</p> <h2>Decompose The Massey Matrix</h2> <p>Next we need to decompose the Massey Matrix we created in Part One into it’s diagonal and off-diagonal elements to give us two new matrices, G and P, which we use in Equation Three below:</p> <p><span class="math">\((G–P)r=p\)</span></p> <p>where <em>G</em> is total games played, <em>P</em> is the number of pairwise matchups each team has played, <em>r</em> are the team’s Massey Ratings and <em>p</em> is a vector of the team’s goal differentials.</p> <p>From here, Ken Massey uses some clever algebra to derive the equivalent of Equation Four below:</p> <p><span class="math">\((G+P)d=Gr–f\)</span></p> <p>where <em>G</em> is total games played, <em>P</em> is the number of pairwise matchups each team has played, <em>d</em> is the defensive rating and <em>f</em> is the number of goals scored.</p> <p>If you are interested in finding out more about the mathematics behind this then I heartily recommend taking a look through Ken Massey’s thesis where he explains it in much more detail than I’ve gone in to here.</p> <h2>Calculating The Ratings</h2> <p>Finally, we can now solve this linear system to get the attack and defence ratings for each team.</p> <p><img alt="Pelican" src="../../../../images/20141204_def_massey.png" /> <strong>Figure One: Defensive Massey Ratings</strong> </p> <p><img alt="Pelican" src="../../../../images/20141204_off_massey.png" /> <strong>Figure Two: Offensive Massey Ratings</strong> </p> <p>It’s no surprise that Manchester City and Chelsea rate high for offensive strength but Everton are somewhat surprisingly rated third best offensive team even though they only rank mid-table in the league. Everton may only have a goal difference of +2 at the moment though but they are actually joint third highest goal scorers in the Premier League. They are performing well offensively, it’s their defence that is letting them down and is actually ranked worse than relegation-threatened Burnley’s.</p> <p>QPR also rate pretty high in terms of attacking strength for a team in the relegation zone. Looking at their results for this season though they managed to score two against Manchester City, scored against Chelsea and are one of the few teams to actually get a goal against Southampton so they are performing well offensively against the league’s stronger teams. Like Everton though, their defence is performing poorly and dragging down their overall performance.</p> <p>What’s that at the bottom of the offensive chart in red? Why it’s Aston Villa whose attack is so poor it actually gets a negative rating! I’ve mentioned in my last two articles about how Aston Villa’s Pythagorean and Massey Ratings show them to be seriously over-placed in the league and once again here’s another metric showing how poor they are. Bizarrely, Villa are somehow in twelfth place having managed a pitiful eight goals from fourteen matches. Although they are mid-table in the league and their defensive rating is pretty good, from an offensive point of view Aston Villa’s numbers suggest they are perhaps rather fortuitous to be so far away from the relegation zone…</p> <h2>Further Improvements</h2> <p>So far the Massey Ratings have considered each match a team plays equally but Ken Massey suggests they can be improved further by weighting matches based on their importance. For example, playing a cup match against a team from a lower division is probably less relevant to calculating the ratings than say a league match against a close rival. By weighting matches appropriately we can reduce the influence less relevant matches have on a team’s ratings and potentially improve their accuracy.</p> <h2>Example Code</h2> <p>If you are interested in having a go with Massey Ratings then I’ve put some example R code on <a href="https://github.com/martineastwood/penalty/tree/master/massey">GitHub</a>. You’ll need to add your own data though as I’ve stripped out the section where it connects to my database for security reasons.</p> <p><br/></p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>Peter - December 4, 2014</strong></p> <p>Great read as always.</p> <p>Currently in the process of teaching myself R. Just wondering if you could give me a pointer as I’m really interested in giving this a go! What headers should the data be ordered in? Is this all taken from a league table, or from the results csv on the football-data website?</p> <p>Cheers,</p> <p>Peter</p> <div class="hline"></div> <p><strong> Martin Eastwood - December 5, 2014 </strong></p> <p>It was all taken from my PostgreSQL database so you’ll need to make sure your data matches the naming conventions used in the code or change the code to match your data.</p> <div class="hline"></div> <p><strong>Kevin - December 5, 2014</strong></p> <p>Have you thought about improving the ratings by using expected goals, rather than goals, in your matrices?</p> <div class="hline"></div> <p><strong>Martin Eastwood - December 5, 2014</strong></p> <p>Not tried, but it’s an interesting idea!</p> <div class="hline"></div> <p><strong>Peter - December 8, 2014</strong></p> <p>Hi Martin,</p> <p>I have given it a go (through Excel, not R) and while I have taken a different approach, things seem to look fairly consistent regarding the overall ratings. I’ll cautiously refer to it as an Adjusted Massey… I’m thinking decomposing these attack/defense ratings may prove a challenge however. I’m using it in conjunction with Pythagorean Expectation to gauge overall performance, and will have a blog post up fairly soon (with due reference to pena.lt/y/ for lighting the way of course)!</p> <p>Cheers,</p> <p>Peter</p> <div class="hline"></div> <p><strong>Martin Eastwood - December 9, 2014</strong></p> <p>Cool, look forward to reading it Peter!</p> <script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: 'center'," + " displayIndent: '0em'," + " showMathMenu: true," + " tex2jax: { " + " inlineMath: [ ['\\\\(','\\\\)'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" + " } " + "}); "; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); } </script>Martin EastwoodThu, 04 Dec 2014 19:30:00 +0000tag:,2014-12-04:2014/12/04EPLMassey RatingsMassey Ratings For Football Part One/2014/11/27<h2>Introduction</h2> <p>We all know the league table can lie and one of the common causes of this is strength of schedule. Take Southampton, at the time of writing they are currently second in the Premier League twelve matches in yet still haven’t played Chelsea, Manchester City, Manchester United or Arsenal. Without wishing to be dismissive of Southampton, who undoubtedly are a very talented team, there’s a pretty decent chance that they’d currently be lower down the league table had these fixtures come up earlier in the season instead of Leicester, Hull or Aston Villa.</p> <h2>Massey Ratings</h2> <p>So if we can’t rely on the league table to tell us which teams are performing best what do we do? One alternative is to use Massey Ratings. This is a method devised by Ken Massey back in 1997 for his honours thesis that rates teams based on what opposition they’ve played. The system was originally designed for American Football but it can be adapted to football fairly trivially.</p> <p>The idea behind Massey Ratings is that they rate teams such that the difference between any two teams is equal to the expected margin of victory between them, as shown in Equation One below:</p> <p><span class="math">\(y=ra–rb\)</span></p> <p>where <em>y</em> is the margin of victory for fixture, <em>ra</em> is the rating of team <em>a</em>, <em>rb</em> is the rating of team <em>b</em></p> <h2>Error</h2> <p>In an ideal world we’d have enough data that we could calculate true ratings for each team but with players moving from one team to another and with football seasons typically lasting just 38 matches we never have sufficient data for that so we have to settle for approximating ratings based on previous match results. This means we need to modify equation one to add in an error term to allow us to account for any unexplained variation in the outcome of games (Equation Two below).</p> <p><span class="math">\(y=ra–rb+e\)</span></p> <p>where <em>y</em> is the margin of victory, <em>ra</em> is the rating of team a, <em>rb</em> is the rating of team b and <em>e</em> is the remaining error in the model.</p> <p>So far so good, but how do we know what ra and rb should equal? Well, to start with we want that error term we added into Equation Two to be as small as possible so we use a technique called Least Squares to find the optimal set of ratings for each team in order to minimise e based on the past data we have.</p> <h2>The Matrix</h2> <p>Things get slightly trickier here but let’s say our past data comprises m matches involving n teams. We know what the margin of victory was for each match and who won but not the ratings for each team so we have m equations we need to solve to find the n unknown rating values, which we can write as Equation 3 below:</p> <p><span class="math">\(y=Xr+e\)</span></p> <p>Where <em>y</em> is the the margin of victory, <em>r</em> is the rating we are trying to find, <em>e</em> is the remaining error and <em>X</em> is an m x m sized matrix of coefficients where each row represents a matchup containing a 1 for the winning team and -1 for the losing team. Unfortunately though, this gives us a very <a href="http://en.wikipedia.org/wiki/Sparse_matrix">sparse matrix</a> that is likely to be highly <a href="http://en.wikipedia.org/wiki/Overdetermined_system">over-determined</a> making it difficult to find a unique solution to the system.</p> <h2>The Massey Matrix</h2> <p>Thankfully Massey discovered that you can modify the matrix such that the diagonal elements equal the number of games each teams has played and the off-diagonal elements equal the negation of the number of matchups teams have played against each other giving Equation Four below:</p> <p><span class="math">\(p=Mr\)</span></p> <p>where <em>M</em> is the modified Massey Matrix, <em>p</em> is a vector of the score differentials and <em>r</em> is the vector of unknown scores.</p> <p>We are getting closer now but the matrix still doesn’t necessarily have a unique set of Ratings so Massey modifies it further to set the bottom row to zero and the corresponding element of p to zero too. This constraint creates a <a href="http://en.wikipedia.org/wiki/Rank_%28linear_algebra%29">full rank matrix</a> for us and forces the ratings to sum to zero.</p> <h2>Massey Ratings For The English Premier League</h2> <p>Finally, using some linear algebra we can solve the system and get the ratings for each team, shown below in Figure One.</p> <p><img alt="Pelican" src="../../../../images/2014_11_27_massey.png" /> <strong>Figure One: EPL Massey Ratings</strong></p> <p>It’s no surprise that Chelsea are ranked far ahead of anybody else in first place but Southampton do actually get ranked in second place, showing that even accounting for their easier schedule to date they deserve to be second in the league at the moment.</p> <p>Interestingly, Swansea get ranked fourth rather than their current position of seventh in the league. However, Swansea have already played five of the six teams above them so their Massey Rating shows they are performing better than their raw points tally would suggest.</p> <p>At the bottom of the table it’s not looking good for Aston Villa. I showed in my <a href="http://pena.lt/y/2014/11/04/english-premier-league-pythagorean/">last article</a> how their Pythagorean meant they were over performing being even as high as they are and this is now backed up by their Massey Rating ranking them in one of the relegation spots.</p> <h2>Next Steps</h2> <p>In my next article I’ll show how we can take Massey Ratings a step further and decompose teams’ overall ratings into separate ratings for both attack and defence. I’ll also add some example code too so you can have a go calculating them yourself.</p> <p>In the meantime, if you are interested in finding out more about the maths behind Massey Ratings then take a look at <a href="http://pena.lt/y/2014/11/04/english-premier-league-pythagorean/">Ken Massey’s honours thesis</a> which goes into the theory in much more depth than my brief overview here.</p> <script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: 'center'," + " displayIndent: '0em'," + " showMathMenu: true," + " tex2jax: { " + " inlineMath: [ ['\\\\(','\\\\)'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" + " } " + "}); "; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); } </script>Martin EastwoodThu, 27 Nov 2014 19:30:00 +0000tag:,2014-11-27:2014/11/27EPLMassey RatingsEnglish Premier League Pythagorean/2014/11/04<p>I’ve not posted this for a while so here is the latest Pythagorean for the English Premier League.</p> <h2>Football Pythagorean</h2> <p>If you’ve seen this before, it’s an adaptation of the baseball Pythagorean that allows you to estimate how many points a team would be expected to achieve on average based on the number of goals they have scored and conceded. It’s a simple equation but it is surprisingly accurate.</p> <p>Take a look at my <a href="http://pena.lt/y/pythagorean.html">previous blog posts</a> if you want to find out more about the theory behind it, how it was tested and what the equation itself actually looks like.</p> <h2>The Season So Far</h2> <p>Figure One below below shows the difference between how many points teams have achieved in the English Premier League and how many points their Pythagorean record predicts they should have.</p> <p><img alt="Pelican" src="../../../../images/2014_11_04_pythagorean.png" /> <strong>Figure One: EPL Pythagorean Results So Far</strong></p> <p>Worryingly for Aston Villa they’ve currently got five points more than would be expected based on their goal record. These points were all from their crazy start to the season in which they were undefeated in their first four matches, somehow coming out with ten points even though they scored just four goals. However, they appear to have regressed somewhat since then managing just a single goal and zero points from their last six matches. It’s not looking good…</p> <p>Chelsea are also up five points more than expected but in contrast things could not be looking better. They are playing well and gaining more points than their goal scoring record suggests. All the signs of potential champions – a good team that are exceeding their expected points. If you want to win the league you have to be good and lucky!</p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>Anton Bashtavy - December 22, 2014</strong></p> <p>I’m not sure about Aston Villa – they appear to be unlucky with their shots/goals ratio. They score from every 15th shot on average (with 10 being the league average), and it’s not because they shoot a lot.</p> <p>If both “clutch luck” and shots/goals ration reverse to mean, it’ll be more or less the same in terms of points per game.</p>Martin EastwoodTue, 04 Nov 2014 19:30:00 +0000tag:,2014-11-04:2014/11/04EPLPythagoreanPredicting Football Using R/2014/11/02<p>I recently gave a presentation to the Manchester R Users' Group discussing how to predict football results using R. My presentation gave a brief overview of how to create a Poisson model in R and apply the Dixon and Coles adjustment to it to account for dependance in the scores.</p> <p>The slides are below for anybody interested and contain enough example R code to get you started. Unfortunately, there are no slide notes though but hopefully the slides should be descriptive enough to get you going!</p> <p>Example code from the presentation can be found at my <strong><a href="https://github.com/martineastwood/penalty/tree/master/poisson_example">GitHub account</a></strong></p> <iframe src="//www.slideshare.net/slideshow/embed_code/41024430" width="800" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen> </iframe> <p><div style="margin-bottom:5px"> <strong> <a href="//www.slideshare.net/MartinEastwood/predicting-football-using-r" title="Predicting Football Using R" target="_blank">Predicting Football Using R</a> </strong> from <strong><a href="//www.slideshare.net/MartinEastwood" target="_blank">Martin Eastwood</a></strong> </div></p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>Anonymous - November 3, 2014</strong></p> <p>Can you explain the +/- Dixon/Coles adjustment?</p> <div class="hline"></div> <p><strong>Martin Eastwood - November 3, 2014</strong></p> <p>Sure, if you are interested in the theory behind it better than I recommend reading Dixon And Coles paper where they propose their adjustment to account for dependency between the scores – http://www.math.ku.dk/~rolf/teaching/thesis/DixonColes.pdf</p> <div class="hline"></div> <p><strong>Peter - November 4, 2014</strong></p> <p>Is it possible for you to share the R code, how you implemented this adjustment in the model?</p> <div class="hline"></div> <p><strong>Martin Eastwood - November 4, 2014</strong></p> <p>Hi Peter – I’m not planing on adding the Dixon and Coles adjustment to the code as it was intended just as a simple demonstration rather than a full model. The adjustment requires carrying out an optimisation to estimate rho, which in turns requires a cost function etc so it increases the complexity of the example considerably.</p> <div class="hline"></div> <p><strong>Jonas - November 3, 2014</strong></p> <p>Do you apply the Dixon &amp; Coles adjustment to the probabilities you got from the independent goals model? Do you estimate the rho parameter independently of the other parameters then?</p> <div class="hline"></div> <p><strong>Martin Eastwood - November 3, 2014</strong></p> <p>Hi Jonas, yes that’s right. You’ll need to run an optimisation to get rho and then use that to modify the probabilities from the Poisson model.</p> <div class="hline"></div> <p><strong>Jonas - November 3, 2014</strong></p> <p>That’s a neat trick, probably much easier than to fit the comlete Dixon &amp; Coles model :)</p> <div class="hline"></div> <p><strong>Seth Dobson - November 4, 2014</strong></p> <p>Hi Martin! Thanks for posting this. Looking forward to trying it out on the SPFL.</p> <p>Have you ever tried the fbRanks package in R?</p> <div class="hline"></div> <p><strong>Martin Eastwood - November 4, 2014</strong></p> <p>I didn’t even know it existing, will take a look!</p>Martin EastwoodSun, 02 Nov 2014 19:30:00 +0000tag:,2014-11-02:2014/11/02PoissonPredictionRExpected Goals: Foot Shots Versus Headers/2014/08/28<h2>Introduction</h2> <p>My last article on expected goals introduced the concept of using exponential decay to estimate the probability of scoring based on the shooter’s distance from the goal. The article received lots of feedback (thanks everyone!!), with a couple of common comments standing out that I wanted to address.</p> <h2>Simplifying The Model</h2> <p>One common theme was whether the model was at risk of over-fitting and this is certainly something I was concerned about myself. In fact, I have since simplified the model to the equation below to help minimise this risk:</p> <p><span class="math">\(expg=exp(-distance/a)\)</span></p> <p><strong>Figure 1: Simplified Expected Goals Equation</strong></p> <p>As well as reducing the complexity of the model and making it easier to calculate the expected goals, the new equation has fewer parameters so the potential for overfitting is lower. The correlation between actual / expected goals has fallen slightly from 0.98 to 0.97 but the advantages of the simpler equation far outweigh such a minimal change.</p> <h2>Headers Versus Foot Shots</h2> <p>Another common question was whether it was important to split out headers and foot shots into separate models as the previous articles have so far ignored headers due to lack of data.</p> <p>To investigate this I have been busy all summer collecting more shot data. I’m up to 45,000 shots in total now, including around 7,500 headers so I’m at the point where I’m happy to start the preliminary work comparing foot / headed shots although I certainly want more headers before drawing any definite conclusions.</p> <p>I’ve run through all the curve fitting again for both headers and foot shots and plotted the resulting probability curves in Figure Two below.</p> <p><img alt="Pelican" src="../../../../images/20140828_headers_versus_shots_expg.png" /></p> <p><strong>Figure 2: Expected Goals: Shots Versus Headers</strong></p> <p>As you can see, headers have a noticeably lower chance of leading to a goal. The gap between head and foot shots appears largest around the ten metre mark, where foot shots have pretty much twice the probability of scoring. By 22 metres the chance of scoring from a header is virtually zero, while foot shots don’t reach this level until around 40 metres out.</p> <h2>Conclusions</h2> <p>But is this difference significant and do we actually need to bother creating separate expected goals models for headers and foot shots?</p> <p>Well, if we compare the two probability curves against each other then the p value comes out at 0.064. Typically we take p values of 0.05 or lower to signify significance so by that count there is no real difference between the two.</p> <p>However, p values should never be about some absolute cut off where &lt;= 0.05 equals significance and everything else can just be ignored.</p> <p>Having a value close to significance is suggestive that there may be a real difference there, especially when there is still a limited data size for headers so it’s certainly possible that headers and foot shots will warrant separate models. Luckily with the current equation this is really simple to do as we just need to alter the value of a as shown below in the appendix. This is an area I’ll be exploring in more detail as I add more headers to my database.</p> <h2>Appendix: Using the Expected Goals Model</h2> <p>To use the expected goals model you just need two numbers:</p> <p>x = distance from goal in metres along x axis</p> <p>y = distance from centre of goal in metres along y axis</p> <p>These can then be used to calculate the total distance the shot is taken from:</p> <p><span class="math">\(distance=sqrt(x^2+y^2)\)</span></p> <p>The expected goals for the shot is then just:</p> <p><span class="math">\(expected goals=exp(-distance/a)\)</span></p> <p>where a = 4.4 for headers and 7.1 for foot shots</p> <h2>Example</h2> <p>Here’s an example for a player taking a header from the penalty spot.</p> <p>x = 11 as penalty spots are roughly 11 metres from the goals (equal to 12 yards)</p> <p>y = 0 as penalty spots should be level with the centre of the goal</p> <p><span class="math">\(distance=sqrt(11^2+0^2)=11\)</span></p> <p><span class="math">\(expected goals=exp(-11/4.4)=0.08\)</span></p> <p>So on average, a header from the penalty spot would be worth around 0.08 goals.</p> <p>Easy, just don’t forget you need to use negative distance inside the exponential!</p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>Antony Lee - September 1, 2014</strong></p> <p>think the model fitted is much better now, as the previous one with a constant implied that there was a non-zero probability of scoring from 100m!</p> <p>did you manage yet, as you now have 45000 data points, to check the stationarity in the underlying process by comparing season-on-season fitting parameters?</p> <div class="hline"></div> <p><strong>Martin Eastwood - September 2, 2014</strong></p> <p>Thanks, I’ll be taking a look at that soon!</p> <div class="hline"></div> <p><strong>Matthew Langston - September 4, 2014</strong></p> <p>Do you use any particular software or program to calculate the xy co-ordinates from the Squawka stats page?</p> <div class="hline"></div> <p><strong>Martin Eastwood - September 4, 2014</strong></p> <p>All the processing of the data and model fitting etc was done using R and SQL</p> <div class="hline"></div> <p><strong>OI - September 5, 2014</strong></p> <p>Firstly, thank you very much for continuously sharing your model to the readers. This is particularly valuable for other bloggers like me, and I’ll probably publish some Expg results on my German blog linked above (if you permit).</p> <p>Secondly, I think that the separation of headers is a very large step forward. I have to repeat my thanks. As you’ve examined yourself, there is a (nearly) significant difference between headers and foot shots on the long term, and certainly the difference is even more significant in smaller sample sizes (for single chances). Having tried out the older version, I had the feeling that the ExpG values are generally too low. The results from the new formula fit my subjective impressions much better.</p> <p>Thirdly, I still see a possibility to improve your model (although this might sound a bit ridiculous with R2=0.97). In my opinion, the angle is a bit underrepresented. I know you include “dy”, but imagine a foot shot from dx=1 and dy=6.5. The total distance is 6.58 and the ExpG value 0.396. The angle to the middle of the goal of 9° is very sharp. I can’t imagine that players really convert this chance in 39.6 of 100 tries. As R2=0.97 for distance alone proves, the overall difference might not be so big, but similar to head/feet there can be a big difference for a single chance.</p> <p>What about the angle of view (see here: http://blog.kickdex.com/post/52303980749/angle-of-view)? The angle of view for the example shot is 14°, whereas it is 36,7° for a shot from the penalty spot (distance: 6,58m vs. 11m, angle of view 14° vs. 36,7°!). I deduced a formula to compute the angle of view from dy and dx. Unfortunately, it is clearly more complicated than the simple Pythagoras, and I don’t know how to paste a screenshot of it in the comment section. The mathematical text by itself would be unreadable. Are you interested in the angle of view? If yes, we should find a possibility to share the formula, if not, I’d completely understand that you prefer simplicity, especially with the simple version being very accurate (I mainly ask because I myself want to know if the angle of view is more accurate than distance alone ;).)</p> <div class="hline"></div> <p><strong>Martin Eastwood - September 5, 2014</strong></p> <p>Thanks for the message, yes you are welcome to use the ExpG results on your blog but I would appreciate it if you acknowledge me and provide a link back to my site :)</p> <p>Also, thanks for the link about the angle of view, I have not seen that before and it certainly looks interesting. I’ll add it to my todo list to investigate further when I get some free time and will let you know how I get on!</p> <div class="hline"></div> <p><strong>Jamie - September 7, 2014</strong></p> <p>I don’t know how you collected the data (from squawka?) but it can see how it might be possible to extract the location &amp; result of each shot from squawka. I can’t see how to distinguish between shots &amp; headers though.</p> <p>Did you have to collect them separately or where you able to filter the data later and do you have any suggestions for collecting such data?</p> <p>Also, I don’t know how much you use/keep track of your fixture predictions but there seems to be some ‘errors’. For example: Metz v Nantes has Predicted Goals = 0.001 &amp; 0.000 respectively.</p> <p>It is a shame you aren’t able to post more often.</p> <div class="hline"></div> <p><strong>Gareth Owen-Smith - November 12, 2014</strong></p> <p>Hi – really interesting methodology, thanks for sharing! I have had my own go at scraping shot data off squawka using selenium webdriver in python and trying to get a model based on both x-distance and y-distance, based on your approach (using R, which I’m happy to share, if you want?). I haven’t separated by foot shots or headers, but that should be easy enough to do later. My expected goals model based on x, y coordinates (in yards), is: xG = exp(-x/9.67)*exp(-y/11.0)</p> <div class="hline"></div> <p><strong>Martin Eastwood - November 12, 2014</strong></p> <p>Looks interesting Gareth! Definitely take a look at splitting out the headers / foot shots though as I expect you’ll see a difference in the model coefficients between the two.</p> <script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: 'center'," + " displayIndent: '0em'," + " showMathMenu: true," + " tex2jax: { " + " inlineMath: [ ['\\\\(','\\\\)'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" + " } " + "}); "; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); } </script>Martin EastwoodThu, 28 Aug 2014 19:30:00 +0100tag:,2014-08-28:2014/08/28Expected GoalsMathematically Optimising Your Fantasy Football Team/2014/07/24<h2>Introduction</h2> <p>The <a href="http://fantasy.premierleague.com/">Premier League Fantasy Football</a> is back ready for the new season so I thought I’d run through an example of how <a href="http://en.wikipedia.org/wiki/Linear_programming">linear programming</a> can help you select your team. If you haven’t come across linear programming before it’s a mathematical optimisation technique for that can be used to maximise the total number of points your team is worth within a set of constraints, e.g. staying within budget and not signing too many players from the same team.</p> <h2>Collecting The Data</h2> <p>The first thing we are going to need to do is scrape some data to optimise our team with so let’s fire up <a href="http://www.r-project.org/">R</a>. We are going to need the names of all the players that are available, what team they play for, how much they cost to sign and most importantly how many points they are worth. Conveniently, we can exploit of structure of the Premier League’s website to get the data and use it as a pseudo <a href="http://en.wikipedia.org/wiki/Application_programming_interface">API</a>.</p> <p><strong>DISCLAIMER:</strong> there is a fine line between scraping someone’s web site and creating a <a href="http://en.wikipedia.org/wiki/Denial-of-service_attack">denial-of-service attack</a> so make sure you spread out your calls to the website. Trying to scrape all the data in quick succession can put unnecessary strain on the site’s servers. If you scrape somebody’s data please ensure you do it in a way that does not impact the service they are providing!</p> <div class="highlight"><pre><span class="err">#</span><span class="nx">load</span> <span class="nx">libraries</span> <span class="nx">library</span><span class="p">(</span><span class="nx">lpSolve</span><span class="p">)</span> <span class="nx">library</span><span class="p">(</span><span class="nx">stringr</span><span class="p">)</span> <span class="nx">library</span><span class="p">(</span><span class="nx">RCurl</span><span class="p">)</span> <span class="nx">library</span><span class="p">(</span><span class="nx">jsonlite</span><span class="p">)</span> <span class="nx">library</span><span class="p">(</span><span class="nx">plyr</span><span class="p">)</span> <span class="err">#</span> <span class="nx">scrape</span> <span class="nx">the</span> <span class="nx">data</span> <span class="nx">df</span> <span class="o">=</span> <span class="nx">ldply</span><span class="p">(</span><span class="mi">1</span><span class="o">:</span><span class="mi">521</span><span class="p">,</span> <span class="kd">function</span><span class="p">(</span><span class="nx">x</span><span class="p">){</span> <span class="err">#</span> <span class="nx">Scrape</span> <span class="nx">responsibly</span> <span class="nx">kids</span><span class="p">,</span> <span class="nx">we</span> <span class="nx">don</span><span class="s1">&#39;t want to ddos</span> <span class="s1"># the Fantasy Premier League&#39;</span><span class="nx">s</span> <span class="nx">website</span> <span class="nx">Sys</span><span class="p">.</span><span class="nx">sleep</span><span class="p">(</span><span class="mf">2.5</span><span class="p">)</span> <span class="nx">url</span> <span class="o">=</span> <span class="nx">sprintf</span><span class="p">(</span><span class="s2">&quot;http://fantasy.premierleague.com/web/api/elements/%s/?format=json&quot;</span><span class="p">,</span> <span class="nx">x</span><span class="p">)</span> <span class="nx">json</span> <span class="o">=</span> <span class="nx">fromJSON</span><span class="p">(</span><span class="nx">getURL</span><span class="p">(</span><span class="nx">url</span><span class="p">))</span> <span class="nx">json$now_cost</span> <span class="o">=</span> <span class="nx">json$now_cost</span> <span class="o">/</span> <span class="mi">10</span> <span class="nx">data</span><span class="p">.</span><span class="nx">frame</span><span class="p">(</span><span class="nx">json</span><span class="cp">[</span><span class="nx">names</span><span class="p">(</span><span class="nx">json</span><span class="p">)</span> <span class="o">%</span><span class="k">in</span><span class="o">%</span> <span class="nx">c</span><span class="p">(</span><span class="s1">&#39;web_name&#39;</span><span class="p">,</span> <span class="s1">&#39;team_name&#39;</span><span class="p">,</span> <span class="s1">&#39;type_name&#39;</span><span class="p">,</span> <span class="s1">&#39;now_cost&#39;</span><span class="p">,</span> <span class="s1">&#39;total_points&#39;</span><span class="p">)</span><span class="cp">]</span><span class="p">)</span> <span class="p">})</span> </pre></div> <h2>Constraints</h2> <p>Now we have the data we need to think about the constraints we will have to build into the linear system. For example, we can only spend a maximum of £100 million, we cannot have more than three players from the same team and are restricted to two goalkeepers, five defenders, five midfielders and three forwards.</p> <div class="highlight"><pre><span class="c1">#Create the constraints</span> num_gk <span class="o">=</span> <span class="m">2</span> num_def <span class="o">=</span> <span class="m">5</span> num_mid <span class="o">=</span> <span class="m">5</span> num_fwd <span class="o">=</span> <span class="m">3</span> max_cost <span class="o">=</span> <span class="m">100</span> <span class="c1"># Create vectors to constrain by position</span> df<span class="o">$</span>Goalkeeper <span class="o">=</span> ifelse<span class="p">(</span>df<span class="o">$</span>type_name <span class="o">==</span> <span class="s">&quot;Goalkeeper&quot;</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">0</span><span class="p">)</span> df<span class="o">$</span>Defender <span class="o">=</span> ifelse<span class="p">(</span>df<span class="o">$</span>type_name <span class="o">==</span> <span class="s">&quot;Defender&quot;</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">0</span><span class="p">)</span> df<span class="o">$</span>Midfielder <span class="o">=</span> ifelse<span class="p">(</span>df<span class="o">$</span>type_name <span class="o">==</span> <span class="s">&quot;Midfielder&quot;</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">0</span><span class="p">)</span> df<span class="o">$</span>Forward <span class="o">=</span> ifelse<span class="p">(</span>df<span class="o">$</span>type_name <span class="o">==</span> <span class="s">&quot;Forward&quot;</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">0</span><span class="p">)</span> <span class="c1"># Create vector to constrain by max number of players allowed per team</span> team_constraint <span class="o">=</span> unlist<span class="p">(</span>lapply<span class="p">(</span>unique<span class="p">(</span>df<span class="o">$</span>team_name<span class="p">),</span> <span class="kr">function</span><span class="p">(</span>x<span class="p">,</span> df<span class="p">){</span> ifelse<span class="p">(</span>df<span class="o">$</span>team_name<span class="o">==</span>x<span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">0</span><span class="p">)</span> <span class="p">},</span> df<span class="o">=</span>df<span class="p">))</span> <span class="c1"># next we need the constraint directions</span> const_dir <span class="o">&lt;-</span> c<span class="p">(</span><span class="s">&quot;=&quot;</span><span class="p">,</span> <span class="s">&quot;=&quot;</span><span class="p">,</span> <span class="s">&quot;=&quot;</span><span class="p">,</span> <span class="s">&quot;=&quot;</span><span class="p">,</span> rep<span class="p">(</span><span class="s">&quot;&lt;=&quot;</span><span class="p">,</span> <span class="m">21</span><span class="p">))</span> </pre></div> <h2>The Objective</h2> <p>We also need to create the vector defining our objective, which is to maximise the number of points the team is worth within the constraints we are setting.</p> <div class="highlight"><pre><span class="cp"># The vector to optimize against</span> <span class="n">objective</span> <span class="o">=</span> <span class="n">df</span><span class="err">$</span><span class="n">total_points</span> </pre></div> <h2>Solving The Matrix</h2> <p>Finally, we put all the constraints into a matrix and let R solve the linear system to create our mathematically optimised team selection.</p> <div class="highlight"><pre><span class="cp"># Put the complete matrix together</span> <span class="n">const_mat</span> <span class="o">=</span> <span class="n">matrix</span><span class="p">(</span><span class="n">c</span><span class="p">(</span><span class="n">df</span><span class="err">$</span><span class="n">Goalkeeper</span><span class="p">,</span> <span class="n">df</span><span class="err">$</span><span class="n">Defender</span><span class="p">,</span> <span class="n">df</span><span class="err">$</span><span class="n">Midfielder</span><span class="p">,</span> <span class="n">df</span><span class="err">$</span><span class="n">Forward</span><span class="p">,</span> <span class="n">df</span><span class="err">$</span><span class="n">now_cost</span><span class="p">,</span> <span class="n">team_constraint</span><span class="p">),</span> <span class="n">nrow</span><span class="o">=</span><span class="p">(</span><span class="mi">5</span> <span class="o">+</span> <span class="n">length</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">df</span><span class="err">$</span><span class="n">team_name</span><span class="p">))),</span> <span class="n">byrow</span><span class="o">=</span><span class="n">TRUE</span><span class="p">)</span> <span class="n">const_rhs</span> <span class="o">=</span> <span class="n">c</span><span class="p">(</span><span class="n">num_gk</span><span class="p">,</span> <span class="n">num_def</span><span class="p">,</span> <span class="n">num_mid</span><span class="p">,</span> <span class="n">num_fwd</span><span class="p">,</span> <span class="n">max_cost</span><span class="p">,</span> <span class="n">rep</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">20</span><span class="p">))</span> <span class="cp"># And solve the linear system</span> <span class="n">x</span> <span class="o">=</span> <span class="n">lp</span> <span class="p">(</span><span class="s">&quot;max&quot;</span><span class="p">,</span> <span class="n">objective</span><span class="p">,</span> <span class="n">const_mat</span><span class="p">,</span> <span class="n">const_dir</span><span class="p">,</span> <span class="n">const_rhs</span><span class="p">,</span> <span class="n">all</span><span class="p">.</span><span class="n">bin</span><span class="o">=</span><span class="n">TRUE</span><span class="p">,</span> <span class="n">all</span><span class="p">.</span><span class="kt">int</span><span class="o">=</span><span class="n">TRUE</span><span class="p">)</span> <span class="n">print</span><span class="p">(</span><span class="n">arrange</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="n">which</span><span class="p">(</span><span class="n">x</span><span class="err">$</span><span class="n">solution</span><span class="o">==</span><span class="mi">1</span><span class="p">),],</span> <span class="n">desc</span><span class="p">(</span><span class="n">Goalkeeper</span><span class="p">),</span> <span class="n">desc</span><span class="p">(</span><span class="n">Defender</span><span class="p">),</span> <span class="n">desc</span><span class="p">(</span><span class="n">Midfielder</span><span class="p">),</span> <span class="n">desc</span><span class="p">(</span><span class="n">Forward</span><span class="p">),</span> <span class="n">desc</span><span class="p">(</span><span class="n">total_points</span><span class="p">)))</span> </pre></div> <h2>The Results</h2> <p>The team the linear solver selected is shown in the table below – this is team with the highest possible number of points that can be achieved using the constraints we are working within.</p> <table class="table"> <tbody> <tr> <td><strong>Position</strong></td> <td><strong>Team</strong></td> <td><strong>Points</strong></td> <td><strong>Name</strong></td> <td><strong>Cost (£)</strong></td> </tr> <tr> <td>Goalkeeper</td> <td>Everton</td> <td>160</td> <td>Howard</td> <td>5.5</td> </tr> <tr> <td>Goalkeeper</td> <td>Crystal Palace</td> <td>144</td> <td>Speroni</td> <td>5</td> </tr> <tr> <td>Defender</td> <td>Everton</td> <td>180</td> <td>Coleman</td> <td>7</td> </tr> <tr> <td>Defender</td> <td>Chelsea</td> <td>172</td> <td>Terry</td> <td>6.5</td> </tr> <tr> <td>Defender</td> <td>Arsenal</td> <td>157</td> <td>Mertesacker</td> <td>6</td> </tr> <tr> <td>Defender</td> <td>Arsenal</td> <td>155</td> <td>Koscielny</td> <td>6</td> </tr> <tr> <td>Defender</td> <td>Southampton</td> <td>149</td> <td>Fonte</td> <td>5.5</td> </tr> <tr> <td>Midfielder</td> <td>Man City</td> <td>241</td> <td>Yaya Touré</td> <td>11</td> </tr> <tr> <td>Midfielder</td> <td>Liverpool</td> <td>205</td> <td>Gerrard</td> <td>9</td> </tr> <tr> <td>Midfielder</td> <td>Crystal Palace</td> <td>131</td> <td>Puncheon</td> <td>6</td> </tr> <tr> <td>Midfielder</td> <td>Stoke</td> <td>126</td> <td>Sidwell</td> <td>5.5</td> </tr> <tr> <td>Midfielder</td> <td>West Ham</td> <td>125</td> <td>Noble</td> <td>5.5</td> </tr> <tr> <td>Forward</td> <td>Arsenal</td> <td>187</td> <td>Giroud</td> <td>8.5</td> </tr> <tr> <td>Forward</td> <td>Liverpool</td> <td>179</td> <td>Lambert</td> <td>7.5</td> </tr> <tr> <td>Forward</td> <td>Aston Villa</td> <td>106</td> <td>Weimann</td> <td>5.5</td> </tr> <tr> <td></td> </tr> </tbody> </table> <h2>Limitations</h2> <p>Now, before the internet gets grumpy and starts trolling me (whenever I’ve mentioned using mathematics for fantasy football people seem to get very irate) there are a few obvious limitations worth pointing out. First of all the new football season hasn’t started so I’m using the points totals from last season. This means all the players at the promoted teams and any new signings to the Premier League will have zero points and so will not get selected. I’m planning on running this script regularly throughout the coming season though to help guide my transfers, so as these players gain points they will start to get selected by the linear solver if they perform well enough.</p> <p>Also, we’ve set the constraints to optimise for the best squad. You may want to spend all your money on the best possible first eleven and go for budget substitutes instead. For example, the table below shows what happens if you optimise for eleven players playing 1-3-4-3 at a total price of £82 million (this leaves enough to buy four substitutes at £4.5 million each).</p> <table class="table"> <tbody> <tr> <td><strong>Position</strong></td> <td><strong>Team</strong></td> <td><strong>Points</strong></td> <td><strong>Name</strong></td> <td><strong>Cost (£)</strong></td> </tr> <tr> <td>Goalkeeper</td> <td>Everton</td> <td>160</td> <td>Howard</td> <td>5.5</td> </tr> <tr> <td>Defender</td> <td>Everton</td> <td>180</td> <td>Coleman</td> <td>7</td> </tr> <tr> <td>Defender</td> <td>Chelsea</td> <td>172</td> <td>Terry</td> <td>6.5</td> </tr> <tr> <td>Defender</td> <td>Southampton</td> <td>149</td> <td>Fonte</td> <td>5.5</td> </tr> <tr> <td>Midfielder</td> <td>Man City</td> <td>241</td> <td>Yaya Touré</td> <td>11</td> </tr> <tr> <td>Midfielder</td> <td>Liverpool</td> <td>205</td> <td>Gerrard</td> <td>9</td> </tr> <tr> <td>Midfielder</td> <td>Liverpool</td> <td>178</td> <td>Lallana</td> <td>8.5</td> </tr> <tr> <td>Midfielder</td> <td>Stoke</td> <td>126</td> <td>Sidwell</td> <td>5.5</td> </tr> <tr> <td>Forward</td> <td>Arsenal</td> <td>187</td> <td>Giroud</td> <td>8.5</td> </tr> <tr> <td>Forward</td> <td>Liverpool</td> <td>179</td> <td>Lambert</td> <td>7.5</td> </tr> <tr> <td>Forward</td> <td>Southampton</td> <td>152</td> <td>Rodriguez</td> <td>7.5</td> </tr> <tr> <td></td> </tr> </tbody> </table> <p>Interestingly (for me at least) is that the cost of the players is fairly evenly spread across the team. Typically, when I select my fantasy football teams I tend to splash the cash on the big name strikers and then go for cheap defenders. However, based on these results though that’s looking like a bad decision so this season I’m going to follow the data and actually sign some decent defenders. Wish me luck…</p> <h2>Appendix</h2> <p>All code is available on <a href="https://github.com/martineastwood/penalty/tree/master/fantasy_footballl_optimiser">GitHub</a></p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong> Peer - July 25, 2014 </strong></p> <p>Hi. Interesting article.</p> <p>I would be interested to know if the code can be utilised to pick the best available player with the added variables of points achieved per minutes played?</p> <div class="hline"></div> <p><strong> Martin Eastwood - July 25, 2014 </strong></p> <p>Sure, it’s just a question of constructing the necessary constraints for the solver to optimise against</p> <div class="hline"></div> <p><strong>Neal Thurman - July 26, 2014 </strong></p> <p>The theory is fine but for what you’re recommending to have any utility it needs to account for changes in situation from last season. Lambert isn’t likely to start now that he’s at Liverpool, Rodriguez is currently recovering from an injury, Sidwell isn’t likely to be the focal point at Stoke that he was at Fulham. What would be interesting is to see what the top 20 or 50 configurations look like and to see if the “spread the money around” strategy is dominant to buying a few very expensive players and filling in with bargains or if it just happens that the best outcome last season was spreading it around but there was a “galactico” strategy that was almost as good. Regardless, interesting food for thought. Cheers – Neal</p> <div class="hline"></div> <p><strong>EV - July 29, 2014</strong></p> <p>Good work. Will follow this. Possibly try different objective functions, like points/min played? and maybe adjust for this season’s schedule?</p> <p>This approach has real potential for greatness.</p> <div class="hline"></div> <p><strong>jester112358 - July 30, 2014</strong></p> <p>Loved the post.</p> <p>Didn’t read the code, but in the best 11 scenario: have you set the team to play 3-4-3 or why the code leaves out Sagna who has more points with the same price than Sidwell?</p> <div class="hline"></div> <p><strong>Martin Eastwood - July 30, 2014</strong></p> <p>Yes, I set the constraints to use a 3-4-3 formation.</p> <div class="hline"></div> <p><strong>Luis Pacheco - August 7, 2014</strong></p> <p>Thank you so much for the script! I was doing this with Excel and it is not as easy of just clicking enter.</p> <p>I found that with 85 budget the best starting eleven for points was the 5-3-2. Second 4-4-2. I always played the 3-4-3. Now, I’m going to change it!</p> <div class="hline"></div> <p><strong>Martin Eastwood - August 8, 2014</strong></p> <p>That’s really interesting, looks like my trusty 3-4-3 I’ve been using for the past few years may not be the optimal formation!</p> <div class="hline"></div> <p><strong>Shalin - August 8, 2014</strong></p> <p>Hi Martin,</p> <p>As someone who has a beginner knowledge of analytics and related tools, I had a few queries related to obtaining the data required for this solver. It seems you directly scrap the data into R from the FPL API. Are there any other reliable data sources available that you would recommend?</p> <p>I believe it would also be possible to run such a solver in Excel. Your thoughts?</p> <div class="hline"></div> <p><strong>Martin Eastwood - August 8, 2014</strong></p> <p>If you want player data then Squawka and WhoScored are probably your best places to look. Yes, Excel and Open Office both have solvers built in so I expect it’s possible to do something similar with them.</p> <div class="hline"></div> <p><strong>Shalin - August 8, 2014</strong></p> <p>Hi Martin,</p> <p>As someone who has a beginner knowledge of analytics and related tools, I had a few queries related to obtaining the data required for this solver. It seems you directly scrap the data into R from the FPL API. Are there any other reliable data sources available that you would recommend?</p> <p>Thanks!</p> <p>Shalin.</p> <p>I believe it would also be possible to run such a solver in Excel. Your thoughts?</p> <div class="hline"></div> <p><strong>Pete - August 15, 2014</strong></p> <p>Interesting – I always wondered what would the perfect optimisation of value would look like – was thinking of trying it out but rodriguez, lallana are out and coleman might be. Give it another go but without injured players – quick! Haha</p> <div class="hline"></div> <p><strong>Brendan - August 25, 2014</strong></p> <p>This is very cool</p> <p>Is it possible to make the objective function take into account the presence of a captain? I can’t think of a way of doing this that keeps it a linear constraint</p> <div class="hline"></div> <p><strong>Martin Eastwood - August 25, 2014</strong></p> <p>Thanks, it’s something I’d like to include but I’ve not come up with a suitable way to add it in yet.</p> <div class="hline"></div> <p><strong>marko - December 17, 2014</strong></p> <p>teams of 12 players with requirement that one player is played twice? maybe…</p> <p>any chance of running the algorythm with points so far this season and current values?</p> <div class="hline"></div> <p><strong>Martin Eastwood - December 17, 2014</strong></p> <p>Good idea, will post a follow up when I get chance with the updated team recommendation. Thanks!</p>Martin EastwoodThu, 24 Jul 2014 19:30:00 +0100tag:,2014-07-24:2014/07/24Fantasy FootballExpected Goals And Exponential Decay/2014/04/22<h2>Introduction</h2> <p>In my <a href="http://pena.lt/y/2014/04/16/expected-goals-the-y-axis/">last article</a> on expected goals I showed how to incorporate the distance from goal along the Y axis into the expected goal model using <a href="http://en.wikipedia.org/wiki/Pythagorean_theorem">Pythagoras' Thereom</a>.</p> <p>This all worked pretty well, giving us an r squared value of 0.95. However, while the r squared value was good there was still a flaw in the model we need to fix.</p> <h2>Better than Ronaldo</h2> <p>Eagle-eyed readers will have noticed that the fit of the curve broke down for very short distances, meaning the probability of scoring from zero metres was actually slightly above one. And as reader Benjamin Lindqvist commented, not even Ronaldo will score more than 100% of the time, not even from the goal line. Benjamin also had a good suggestion to improve this, adding an exponential decay function into the model to make it behave better around zero</p> <h2>Exponential Decay</h2> <p>If you aren’t familiar with exponential decay it basically means that a value decreases at a rate proportional to its current value. It’s a phenomenon that crops up fairly frequently in science and the natural world. For example, air pressure decays exponentially as you go higher up into the Earth’s atmosphere and radioactivity decreases exponentially over time.</p> <p>A general equation for exponential decay is shown in Figure 1, where Y(t) is the value at time t, a is the starting value, k is the decay constant and t is time.</p> <p><span class="math">\(y(t)=ae^{kt}\)</span></p> <p><strong>Figure 1: Exponential Decay</strong></p> <p>So how do we apply this to football? Well, the first thing to do is replace time with metres and assume that the probability of scoring a goal decreases exponentially based upon the distance from goal the shot is taken from.</p> <p>Next we need to find the correct value for the decay constant as this controls the shape of the curve. Rather than doing this manually through trial and error, we can use something such as R’s <a href="http://stat.ethz.ch/R-manual/R-devel/library/stats/html/optim.html">optim</a> function to find it for us. We can also tweak the equation to add in a multiplier for the independent variable and an intercept as found in a traditional regression model giving us the fit shown in Figure 2.</p> <p><img alt="Pelican" src="../../../../images/20140422_exgp_exp_decay.png" /> <strong>Figure 2: Shots Versus Distance From Goal</strong></p> <p>Notice how the orange line now hits the Y axis just below 1.0? This fixes the problem we had before where it was possible to score more than one goal from a single shot. In fact, if you’re standing on the goal line the model now predicts around 0.96 expected goals, so very likely to score but with a small chance of screwing up (yes Edin Džeko I’m looking at you).</p> <p>The new curve fit also pushes the r squared value up to 0.9883, meaning 98.83% of the variance for the probability of scoring from a shot can be accounted for using just distance from goal along the X and Y axes.</p> <p>The final equation (Figure 3) is slightly more complicated now but it’s still pretty simple to use.</p> <p><span class="math">\(expg=e^{-d/4.79}*0.921985+0.036212\)</span></p> <p><strong>Figure 3: Expected Goals Equation Incorporating Exponential Decay</strong></p> <p>where:</p> <p><span class="math">\(d=sqrt(dx^2+dy^2)\)</span></p> <p><strong>Figure 4: Equation for d</strong></p> <p>and dx, dy are the difference between the x coordinates and y coordinates in metres for the shot location and the goal location.</p> <p>As ever, let me know what you think!</p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>OI - April 24, 2014</strong></p> <p>Two Questions:</p> <p>-Did you use two different samples for this article and the last one about the y-axis? Some points seem to be relocated from Figure 3 in the last piece to Figure 2 here. For example, there is hardly a difference in scoring probability between 5 and 6 metres distance in the other diagramm, whereas in this diagramm the difference is approximately 5%!</p> <p>-Are blocked shots included in your calculations?-</p> <div class="hline"></div> <p><strong>Martin Eastwood - April 24, 2014</strong></p> <p>Yes, in between those articles I increased the number of shots in my database by nearly 25% so hopefully some of the noise for distances where I didn’t have many shots should be smoothed out. Everything categorised as a shot by Squawka is included in the calculation except for penalties and own goals.</p> <div class="hline"></div> <p><strong>Benjamin Lindqvist - April 24, 2014</strong></p> <p>Hi Martin,</p> <p>Glad to have been of help. If you’re interested in hearning more negativity, I think your function is now probably overfitted :)</p> <p>If you have Skype, feel free to add me (benjaminlindqvist). Not all topics regarding football and numbers are suited to the public!</p> <div class="hline"></div> <p><strong>Martin Eastwood - April 25, 2014</strong></p> <p>Ah the perennial conflict between optimising and over-fitting :)</p> <p>It’s certainly a risk considering the number of data points I have but I’m not too concerned at the moment as it’s a fairly simple curve rather than some high-order polynomial weaving between the data points. Plus, even though the exponential decay certainly improved the fit the actual expected goal values predicted haven’t really change too much, so in the grand scheme of things any over-fitting probably isn’t having that much of an impact at the moment. It’s certainly something to bear in mind though!</p> <p>I’ve added you to Skype, would be good to have a chat sometime if you’re free :)</p> <div class="hline"></div> <p><strong>PeP - April 28, 2014</strong></p> <p>Hi Martin,</p> <p>I’m very intrigued by your expected goals model and I’m very impressed with the accuracy. Would it be possible to include the Z-axis into your model or is the lack of data holding you back on this.</p> <div class="hline"></div> <p><strong>Martin Eastwood - April 28, 2014</strong></p> <p>By z axis do you mean location in the goal? If so, then there is no reason it couldn’t be incorporated into the model I just don’t have the data available yet.</p> <div class="hline"></div> <p><strong>PeP - April 28, 2014</strong></p> <p>I meant at what height the ball is struck from off the pitch. For example if the xy coordinates were kept the same then was the ball struck from off the ground , was the ball struck on the volley or was it an overhead kick.</p> <div class="hline"></div> <p><strong>Benjamin Lindqvist - May 5, 2014</strong></p> <p>I highly doubt that would be a convex realtionship so that would be hard to fit into this particular model.</p> <div class="hline"></div> <p><strong>Max - May 20, 2014</strong></p> <p>This is amazing… How did you the power curve? Did you find it by trial and error?</p> <div class="hline"></div> <p><strong>Martin Eastwood - May 20, 2014</strong></p> <p>No, it was created using mathematical optimisation techniques rather than trial and error as they can do a better job than me!</p> <div class="hline"></div> <p><strong>EV - July 21, 2014</strong></p> <p>This is an amazing piece of work Martin. And thank you very much for sharing it with us. The effort of gathering the data must have been enormous. I have some questions that will probably be interesting for you:</p> <p>I assume you group the shot distances into 1m intervals, and got the probability of a goal inside each interval as number of goals / number of shots inside that interval? Then you used some max likehood to fit the curve? However, the number of shots inside each distance interval is different, I’m gussing there were far more shots around 12m than shots around 3m, and probably no shots at 0m, but you would fit a point at (0m,P(goal)=1) for common sense anyway. Does this introduce a bias where some shots are given more weight than others in fitting the curve? So when predicting the total number of goals for a season, this model will be predicting significantly under the actual number of goals? (Because decay constant is too fast due to the heavier weight given to shorter distances)</p> <p>Is it possible to fit a curve that gives equal weight to each shot? I imagine such a curve is likely to significantly under estimate the probability of scoring from short distances, but will predict season totals more accurately. But the real question is, which curve would predict individual teams’ or even individual matches’ goals more accurately?</p> <div class="hline"></div> <p><strong>Martin Eastwood - July 21, 2014</strong></p> <p>Hi EV,</p> <p>Yes the shots are binned by distance so there will be different numbers of shots per game. One way to investigate whether this causes any biases could be to bin by percentiles instead to normalize the bin sizes. Hopefully the fit of the curve will be stable enough that it wouldn’t really affect the results too much but there is only one way to find out…</p> <script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: 'center'," + " displayIndent: '0em'," + " showMathMenu: true," + " tex2jax: { " + " inlineMath: [ ['\\\\(','\\\\)'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" + " } " + "}); "; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); } </script>Martin EastwoodTue, 22 Apr 2014 19:30:00 +0100tag:,2014-04-22:2014/04/22Expected GoalsExpected Goals: The Y Axis/2014/04/16<h2>Introduction</h2> <p>Expected goals are one of the hot topics in the football analytics community at the moment and it’s a topic I’ve previously written a <a href="http://localhost:8000/category/expected-goals.html">number of articles</a> discussing how to calculate them. If you haven’t read those pieces yet it’s probably worth taking a quick look to set the context for the rest of this article.</p> <h2>The Story So Far</h2> <p>A few week’s back I published a simple equation for calculating expected goals that received a lot of positive feedback from readers as it was easy to use and was pretty accurate based on its r squared value of 0.86. This effectively means the equation is capable of explaining 86% of the variance in the shots data I have collected from <a href="http://www.squawka.com/">Squawka</a>.</p> <p>For such a basic equation this is a really good result. I’d purposely tried to keep things simple so that the equation was easy enough for non-mathematicians to use in order to try and encourage its adoption by other people. Rather than keep these sort of things to myself I’d much rather share them around and see them get used elsewhere.</p> <p>One of the restrictions I’d set myself for this was to only use the distance the player shooting was from the goal along the X axis so that the equation only needed data along one dimension. However, I received a lot of messages through <a href="https://twitter.com/penaltyblog">Twitter</a> and on the blog asking about the Y axis so let’s take a look…</p> <h2>The Y Axis</h2> <p>So the first question to ask was whether the Y axis was even worth bothering with, after all the r squared value when just using distance along the X axis was already 0.86 which only left around 14% of the variance in the data to account for.</p> <p>Well, it turns out that how far away you are from the goal along the Y axis does have an impact (Figure 1). Unsurprisingly the further away you are then the less likely you are to score. Before you ask, the r squared value is 0.88 (I have learnt now to include r squared values for pretty much all charts otherwise I get bombarded by requests for them :-)).</p> <p><img alt="Pelican" src="../../../../images/20140416_y_axis.png" /></p> <p><strong>Figure 1: Shots Versus Distance From Goal Along Y Axis</strong></p> <h2>Adding The Y Axis Into The Equation</h2> <p>Okay, we know the Y axis has an effect on expected goals but how do we factor this into my previous equation? There are a number of mathematical techniques we can use to solve for multiple dimensions. However, I am keen to try and make this as simple as possible so that the lay-person can use it so let’s keep it basic and go with Pythagoras’ Theorem, a topic most people have touched on at High School at some point.</p> <p>If we know the xy coordinates of the player taking the shot and the xy coordinates of the goal then using Pythagoras’ Theorem we can calculate the total distance between the two points. Figure two shows the equation for this where dx is the distance between the two x coordinates, dy is the distance between the two y coordinates and AB is the total distance the player is from the goal.</p> <p><span class="math">\(AB=sqrt(dx^2+dy^2)\)</span></p> <p><strong>Figure 2: Calculating the distance between two points</strong></p> <p>I did this for all 17,000 shots I have collected so far from Squawka (excluding penalties) to get their total distances from goal and calculated the probability of scoring from different distances based on the number of shots taken versus goals scored (Figure 3).</p> <p><img alt="Pelican" src="../../../../images/20140416_exp_goals_y_axis.png" /></p> <p><strong>Figure 3: Shots Versus Total Distance From Goal</strong></p> <p>As previous, I’m using a power curve to fit the line through the data and as you can see it’s a pretty good fit. So what is the effect of adding in the Y axis? Well the r squared value has changed from 0.86 to…</p> <p><em>drumroll</em></p> <p>0.95</p> <p>Yep, including both the x and y axis into the expected goals model accounts for 95% of the variance in the data. This barely leaves any room for the shooting player’s talent to have any effect or even for defensive pressure to play a part.</p> <p>At first I thought this seemed a bit odd but thinking about it in more detail it actually seems logical. It doesn’t make much difference whether you are shooting from five metres out against a strong defence or a weak one, you still have the same chance of scoring from that particular position.</p> <p>However, playing against a strong defence will likely mean you will get into that good position less often so your overall expected goals will be lower. Conversely, better players will be able to get into those good positions more often than weaker players so their overall expected goals will be higher.</p> <p>In other words, at the individual shot level expected goals seems to be all about a player’s position in respect to the goal when they shoot. Other factors, such as player talent, defensive pressure etc are probably not visible until you start looking at larger samples, such as expected goals per fixture or even per season.</p> <p>Anyway, here’s the final equation:</p> <p><span class="math">\(ExpG=Distance^{-1.33796}*10^{0.4720605}\)</span></p> <p>Let me know what you think!</p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>Lorenzo - April 17, 2014</strong></p> <p>How to collect data from Squawka?</p> <p>Did you wrote a simple scraper or there is some public API?</p> <div class="hline"></div> <p><strong>Martin Eastwood - April 25, 2014</strong></p> <p>There is no API available so I collected the data myself from their site</p> <div class="hline"></div> <p><strong>Jonas - April 17, 2014</strong></p> <p>I am not sure if I agree with your interpretation that there is little room for player talent. If I have understood what you have done correctly, you are aggregating the shoot data to get frequencies (or probabilities if you want) for each 1 meter interval. You are in other words looking across all teams, both the good ones and the bad ones. In your previous post you can clearly see that some of the good teams have rather large positive residuals. These residuals I think are interesting, as they show how good a team is controlled for where they shoot from.</p> <p>My guess is that if you look at the instances where a shoot from 20 meters or more yielded a goal, they are not going to be evenly distributed across all teams, but the good teams are going to be over represented.</p> <div class="hline"></div> <p><strong>Martin Eastwood - April 17, 2014</strong></p> <p>It’s a really good point and one that I agree with – I’ll take a look at this in more detail in a future post.</p> <p>My point though is that at the point a shot is taken the most overwhelming factor in whether it results in a goal seems to be position the shot is taken from. If talent was a major factor at this stage then I would expect more variability in the data and a much poorer fit in the graph.</p> <p>Where I think talent will play a major role is the positions players get into to take those shots. I would expect better players to be shooting more frequently from better positions and good defences to concede fewer shots from those good positions.</p> <p>I’ll hopefully take a look at all this in the coming weeks to see whether my hypotheses hold up.</p> <div class="hline"></div> <p><strong>Max - April 17, 2014</strong></p> <p>What happens when you take out headers?</p> <div class="hline"></div> <p><strong>Martin Eastwood - April 25, 2014</strong></p> <p>Will be looking into that in more detail soon</p> <div class="hline"></div> <p><strong>John - April 18, 2014</strong></p> <p>Well, I have the equation, but how can I use it?</p> <p>I can know the probabilty of a goal from a certain distance, but how can I get the total expectation?</p> <div class="hline"></div> <p><strong>Martin Eastwood - April 18, 2014</strong></p> <p>If, for example, a particular shot had a probability of 0.5 then the shot would go in once every two shots so is worth half a goal</p> <div class="hline"></div> <p><strong>John - April 18, 2014</strong></p> <p>Thanks for the quick reply.</p> <p>Ok, but during a football game I can shoot from everywhere, so how can I calculate the total probability during that game and how can this be calculated in relation of the teams?</p> <p>How can I find values like those in your app?</p> <div class="hline"></div> <p><strong>Martin Eastwood - April 18, 2014</strong></p> <p>Just sum up the expectancies from the shots to get the total per fixture.</p> <p>These values aren’t in the app yet though, maybe next season…</p> <div class="hline"></div> <p><strong>John - April 18, 2014</strong></p> <p>Oh right, and how can you calculate the fixtures at the moment?</p> <div class="hline"></div> <p><strong>OI - April 18, 2014</strong></p> <p>What I like best about your metric is that it is a continuous one. The expected goal algortihms I have known up to now use zonation, wich is discrete. I’ve always thought that this is the need for improvement, and if I had had the data, I might have examined the concrete influence of shooting distance on my own. But now you did it, laudably.</p> <p>If I were you, having the necessary data, I would be totally curious about “x-/y-axis distance”. According to your last posts, the y-axis distance explains a higher percentage of shot conversion than the x-axis distance (0,88&gt;0,86). How does it affect angles? E.g., imagine the triangle with the dx, dy as the legs and AB as the hypotenuse.</p> <p>The angle next to the goal could be a quite good measurement for a shot’s “centrality”, or concretely its sine: If a shot is taken straight in front of the goal, this angle is 90° (and the triangle doesn’t exist anymore). If you walk some metres to the left or to the right, the angle declines. This fits the sine that has his peak also at 90°.</p> <p>I can imagine to divide the distance AB by the sine of this angle: If a shot is taken from a central position, “AB adjusted” doesn’t increase a lot (sin90°= 1). The effects are’nt too big as long as the angle isn’t too small. That’s beacuse of the sine curve that doesn’t slope extremely before and after 90° (e.g. sin60°= 0,87). I think this suits well to the probability of scoring: It is not a big disadvantage to shoot from slightly lateral positions, but it becomes one if you shoot from too farfrom centre.</p> <p>It’s just one proposal to investigate out of thousands that could be done, but that one I find most interesting. Another one would be to consider the defensive side of your ExpG-metric either. But I’m sure you have developed some good plans on your own!</p> <div class="hline"></div> <p><strong>Martin Eastwood - April 18, 2014</strong></p> <p>Thanks, making it continuous was important to me as I also dislike the discrete approach of using zones. The angle is something I’ve been thinking about and I think it should be one of the next steps to factor in to the model. Distance is important but I agree that the angle of the shot must play a key role too. It’s definitely high up the todo list!</p> <div class="hline"></div> <p><strong>Tom Green - July 10, 2014</strong></p> <p>Hi Martin</p> <p>Really interesting stuff. I’m still new to expected goals models and the X,Y stuff. If a shot is taken from 20 metres out, in the centre of the pitch, what would its co-ordinate be?</p> <p>Thanks</p> <p>Tom</p> <script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: 'center'," + " displayIndent: '0em'," + " showMathMenu: true," + " tex2jax: { " + " inlineMath: [ ['\\\\(','\\\\)'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" + " } " + "}); "; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); } </script>Martin EastwoodWed, 16 Apr 2014 19:30:00 +0100tag:,2014-04-16:2014/04/16Expected GoalsEnglish Premier League Pythagorean Update/2014/04/04<h2>Introduction</h2> <p>I’ve not posted an update on the Pythagorean for the English Premier League (EPL) for a while so the latest figures are below.</p> <h2>Football Pythagorean</h2> <p>In case you haven’t seen it before, my football Pythagorean is an adaptation of the <a href="http://en.wikipedia.org/wiki/Pythagorean_expectation">baseball pythagorean</a> that allows you to quickly estimate how many points a team would be expected to achieve on average based on the number of goals they have scored and conceded. It’s a pretty simple little equation but it is surprisingly accurate.</p> <p>Take a look at my previous blog posts <a href="http://www.pena.lt/y/category/pythagorean.html">here</a> about it if you want to find out more about the theory behind it, how it was tested and what the equation itself actually looks like.</p> <h2>The Season So Far</h2> <p>Figure One below shows the difference between the actual points each Premier League team has achieved this season and how much my Pythagorean predicts they should have on average. For teams in green the difference is positive meaning they have more points than expected while those teams in red have less points than expected based on the number of goals they have scored and conceded.</p> <p><img alt="Pelican" src="../../../../images/20140404_English_Premier_League_pythag.png" /></p> <p><strong>Figure One: EPL Pythagorean Results So Far</strong></p> <p>Once again Tottenham are way ahead of where they would be expected to be, with an astonishing 15 points extra. Either Spurs are doing something fantastically efficient this season or they are extremely lucky to be where they are in the league. Take those 15 points away and they drop down to 10th place just ahead of Stoke. This season has been a bit of a write off for Spurs compared with pre-season expectations but it could / should have been so much worse based on their Pythagorean.</p> <p>Down at the other end of the table Hull should probably be feeling quite pleased with themselves as they are looking a pretty safe bet to avoid relegation even with their Pythagorean of -5.</p> <p>Poor Swansea though have the lowest Pythagorean in the league. On average teams with their goal record would expect to have achieved roughly nine more points than their current total. In fact if Swansea and Tottenham both had the average points their goals suggest then the Swans would actually be the higher placed of the two teams!</p> <p>Let’s see what happens if / when <a href="http://en.wikipedia.org/wiki/Regression_toward_the_mean">regression towards the mean</a> starts to kick in…</p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>Nick - April 16, 2014</strong></p> <p>Great blog and the EI index is very interesting. It seems that teams that overachieve according to their EI index are those that have suffered heavy defeats with a big goal margin (Tottenham losing 4-0, 4-0 and 6-0 to the top 3, Arsenal losing to them 6-3, 6-0 and 5-1; Norwich 7-0 and 5-1). Obviously, their expected points are negatively affected by these few heavy defeats. Wouldn’t it make sense to cap the margin of victory/defeat to 3 goals and re-calculate the EI?</p> <div class="hline"></div> <p><strong>Martin Eastwood - April 16, 2014</strong></p> <p>Perhaps, I did reduce the value of each goal but in the original equation but there may well be a better way of doing it than the polynomial function I used</p>Martin EastwoodFri, 04 Apr 2014 19:30:00 +0100tag:,2014-04-04:2014/04/04PythagoreanEPLExpected Goals Updated/2014/03/01<h2>Introduction</h2> <p>When I introduced my Expected Goals model a few weeks back a number of people commented on the bump in the curve where I had included penalty shots in the data set used to fit the model. The reason I’d originally left penalties in was I felt their number was too few to have an impact on the fit of the model and at the time I hadn’t actually tracked which shots were and were not from penalties.</p> <p>Since that decision seemed to cause quite a kerfuffle I have since gone back to the raw data, removed all the penalties and refitted the curve. While I was at it I also added in more shots I had collected and rescaled all the co-ordinates to use a larger pitch (105 x 68m) as Claus Moeller had suggested my estimate of Premier League pitch size was too small.</p> <p>As expected, the difference in the fit of the curve is very small (Figure 1) but it has pushed the r squared value up to 0.86 from 0.84, meaning that 86% of the variance in goal scoring is due to the distance from the goal the shot is taken from and just 14% is due to other reasons, such as player talent, defensive pressure, goalkeeper etc.</p> <p><img alt="Pelican" src="../../../../images/20140301_expected_goals_update.png" /> <strong>Figure 1: Shots Versus Distance From Goal</strong></p> <p>The equation for expected goals is now updated to -1.014718 for the coefficient and 0.05082859 for the intercept so for my previous example a shot from 8 metres gives:</p> <p><span class="math">\(8^{-1.014718}*10^{0.05082859}=0.1362846\)</span> expected goals</p> <h2>Comments</h2> <p><div class="hline"></div></p> <script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: 'center'," + " displayIndent: '0em'," + " showMathMenu: true," + " tex2jax: { " + " inlineMath: [ ['\\\\(','\\\\)'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" + " } " + "}); "; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); } </script>Martin EastwoodSat, 01 Mar 2014 19:30:00 +0000tag:,2014-03-01:2014/03/01Expected GoalsActual Goals Versus Expected Goals/2014/02/15<h2>Introduction</h2> <p>Since my <a href="http://pena.lt/y/2014/02/12/expected-goals-for-all/">last article</a> about how to calculate expected goals one question has come up more than any other and that is about the correlation between expected goals and actual goals so here you go:</p> <p><img alt="Pelican" src="../../../../images/20140215_goals_for.png" /></p> <p><strong>Figure 1: Shots Versus Distance From Goal</strong></p> <p><img alt="Pelican" src="../../../../images/20140215_goals_away.png" /></p> <p><strong>Figure 2: Expected Goals Away Versus Actual Goals Away</strong></p> <p>The correlations look pretty good, 0.86 for goals for and 0.72 for goals away. I’m not sure yet why the correlations differ slightly for home / away and whether it means anything or is just down to noise in the data but I’ll keep an eye on that as I collect more shots over the course of the season.</p> <p>Another question that popped up a few times was whether my expected goals correlated with actual goals better than Total Shot Ratio (TSR) does and the answer is yes it does.</p> <p>This is to be expected really as expected goals account for shot location while TSR considers all shots to be equal when clearly they are not – a shot from one metre out is vastly more likely to lead to a goal than a shot from 20 metres out.</p> <p><img alt="Pelican" src="../../../../images/20140215_tsr_for.png" /></p> <p><strong>Figure 3: TSR Versus Actual Goals For</strong></p> <p><img alt="Pelican" src="../../../../images/20140215_tsr_away.png" /></p> <p><strong>Figure 4: TSR Versus Actual Goals Away</strong></p> <p>There is still a heap of work to do to improve / optimise / characterise the expected goals model further but it is a promising start for it so far. I’ll post more updates as as I progress with the model’s development over the coming weeks.</p> <h2>Comments</h2> <p><div class="hline"></div></p>Martin EastwoodSat, 15 Feb 2014 19:30:00 +0000tag:,2014-02-15:2014/02/15Expected GoalsExpected Goals For All/2014/02/12<h2>Introduction</h2> <p>It seems that everybody has their own expected goals models for football nowadays but they all seem to be top secret and all appear to give different results so I thought I post a quick example of one technique here to try and stimulate a bit of chat about the best way to model them.</p> <h2>The Data</h2> <p>Over the past few weeks I have tediously collected several thousand xy co-ordinates for shot locations from <a href="http://www.squawka.com">Squawka</a> and converted them into approximate distances from goal in metres, assuming that an average football pitch is 100m x 65m.</p> <h2>Goals Versus Distance</h2> <p>Figure 1 below shows the relationship between the probability of scoring a goal and how far away from the goal line the shot is taken from.</p> <p><img alt="Pelican" src="../../../../images/20140212_exp_scatter.png" /></p> <p><strong>Figure 1: Shots Versus Distance From Goal</strong></p> <p>There seems to be a little bit of noise in the data, particularly around the 12-13m mark but overall I was pleasantly surprised how neat the data looks – there seems to be a pretty clear non-linear relationship between the likelihood of scoring and how far away from the goal the shot is taken from.</p> <p>So how do we model this relationship? Obviously we cannot just stick a linear regression through the graph it as the relationship is clearly not linear so one possibility is to use a polynomial instead of a straight line (Figure 2).</p> <p><img alt="Pelican" src="../../../../images/20140212_exp_poly.png" /></p> <p><strong>Figure 2: Fitting a Polynomial</strong></p> <p>Unfortunately, this does not give particularly good results as low order polynomials (the orange line) do not fit tightly enough to the non-linearity in the relationship while higher-order polynomials (the red line) start to fit to the noise in the data leading to problems with over-fitting.</p> <p>So what do we do now? Well, looking closer the shape of the curve appears exponential so one option is to fit a Power function to it. We can do this pretty easily by taking the log of the data, fitting a linear regression against it and plotting this against our non-logged data (Figure 3).</p> <p><img alt="Pelican" src="../../../../images/20140212_exp_power.png" /></p> <p><strong>Figure 3: Power Curve</strong></p> <p>This gives an extremely good fit with the data and seems a plausible choice. We know goal scoring is Poisson distributed so it would seem natural to fit expected goals using an exponential shaped curve since Poisson and exponential distributions are inherently linked – the exponential distribution in fact describes the time taken between individual events occurring in a Poisson process.</p> <p>If we calculate the r squared value for the fit of the Power curve then we get a value of 0.84, meaning 84% of the variance in goal scoring can be attributed to how far away the player taking the shot is from the goal. This is pretty impressive as it leaves just 16% attributed to other reasons, such as the angle of the shot, goalkeeper positioning, defensive pressure, the shooting player’s talent etc.</p> <p>Before you ask, I’ll be looking at whether adding these additional factors into the model can improve it or whether the added complexity is not worth chasing the 16% for in the coming weeks.</p> <h2>Using the Expected Goals Model</h2> <p>But how do we use the model? Although everybody else’s models seem to be top secret I’m going to give mine away. The coefficient for the regression is <span class="math">\(-1.036884\)</span> and the intercept is <span class="math">\(0.05950286\)</span>.</p> <p>To put this into action all you need to do is raise the distance away from the goal in metres to the power of the coefficient and multiply by 10 to the power of the intercept. For example, a shot from 8 metres gives:</p> <p><span class="math">\(8^{-1.036884} * 10^{0.05950286} = 0.132771\)</span> expected goals</p> <p>So how about we give it a proper test and try it out on this season’s English Premier League to date? The results are shown in Table 1 and overall give a root mean square error of 8.2 goals, which seems a pretty reasonable starting point for developing the model further from.</p> <table class="table"> <tbody> <tr> <th></th> <th><strong>Team</strong></th> <th><strong>Goals</strong></th> <th><strong>expG</strong></th> <th><strong>Residual</strong></th> </tr> <tr> <td >1</td> <td>Man City</td> <td >68.00</td> <td >46.90</td> <td >21.10</td> </tr> <tr> <td >2</td> <td>Liverpool</td> <td >63.00</td> <td >42.56</td> <td >20.44</td> </tr> <tr> <td >3</td> <td>Arsenal</td> <td >48.00</td> <td >35.26</td> <td >12.74</td> </tr> <tr> <td >4</td> <td>Chelsea</td> <td >48.00</td> <td >42.56</td> <td >5.44</td> </tr> <tr> <td >5</td> <td>Man Utd</td> <td >41.00</td> <td >35.26</td> <td >5.74</td> </tr> <tr> <td >6</td> <td>Southampton</td> <td >37.00</td> <td >29.05</td> <td >7.95</td> </tr> <tr> <td >7</td> <td>Everton</td> <td >37.00</td> <td >33.31</td> <td >3.69</td> </tr> <tr> <td >8</td> <td>Newcastle</td> <td >32.00</td> <td >30.32</td> <td >1.68</td> </tr> <tr> <td >9</td> <td>Swansea</td> <td >32.00</td> <td >26.89</td> <td >5.11</td> </tr> <tr> <td >10</td> <td>Tottenham</td> <td >32.00</td> <td >31.21</td> <td >0.79</td> </tr> <tr> <td >11</td> <td>WBA</td> <td >30.00</td> <td >31.17</td> <td >-1.17</td> </tr> <tr> <td >12</td> <td>West Ham</td> <td >28.00</td> <td >26.69</td> <td >1.31</td> </tr> <tr> <td >13</td> <td>Aston Villa</td> <td >27.00</td> <td >24.20</td> <td >2.80</td> </tr> <tr> <td >14</td> <td>Stoke</td> <td >26.00</td> <td >25.29</td> <td >0.71</td> </tr> <tr> <td >15</td> <td>Sunderland</td> <td >25.00</td> <td >25.63</td> <td >-0.63</td> </tr> <tr> <td >16</td> <td>Hull</td> <td >25.00</td> <td >23.95</td> <td >1.05</td> </tr> <tr> <td >17</td> <td>Fulham</td> <td >24.00</td> <td >24.39</td> <td >-0.39</td> </tr> <tr> <td >18</td> <td>Cardiff</td> <td >19.00</td> <td >24.67</td> <td >-5.67</td> </tr> <tr> <td >19</td> <td>Norwich</td> <td >19.00</td> <td >27.61</td> <td >-8.61</td> </tr> <tr> <td >20</td> <td>Crystal Palace</td> <td >18.00</td> <td >25.03</td> <td >-7.03</td> </tr> </tbody> </table> <p><strong>Table 1: Expected Goals For The English Premier League To Date</strong></p> <p>You can also see a pretty clear pattern in that the teams at the top of the league have generally over-performed the goal expectancy while those towards the bottom end have under-performed it. This would seem reasonable as we are predicting average goal expectancy and the top teams are obviously above average so should perhaps do better with their chances, while the lower teams are below average so would be expected to perform worse?</p> <h2>What Next?</h2> <p>I’m not claiming this to be the only way of calculating expected goals, or even the best way but hopefully it will encourage more discussion of how to calculate expected goals rather than a lot of secret black boxes all giving different results.</p> <p>I hope to write more about expected goals over the coming weeks in order to test this equation to see how well it really works, to hopefully improve it further and to try and understand what the metric can and cannot tell us.</p> <p>In the meantime, feel free to use my equation to calculate expected goals, all I ask is that you don’t try and pass the equation off as your own (you know who you are!!) and that if you use it then please acknowledge me and link back to my site.</p> <p>Be warned though it’s a work in progress so is subject to change as and when I improve things…</p> <p>Enjoy!</p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>Christopher Hoeger - February 13, 2014</strong></p> <p>Hey, I’m a chemical engineer, so while my knowledge of statistics is limited to it’s application in my field, I would really like to get into football analytics, and I’d love to contribute, if possible, to your expected goals model. Could you tell me where the best publicly available data is, and, more importantly, the best way to access it? Thank you, Chris</p> <div class="hline"></div> <p><strong>Martin Eastwood - February 13, 2014</strong></p> <p>Hi Chris,</p> <p>I took all the data for constructing the model from Squawka. It’s a tedious process but with a bit of patience you can transcribe approximate xy coordinates for shots from their site.</p> <div class="hline"></div> <p><strong>Ali - March 19, 2014</strong></p> <p>Would love to know the process behind transcribing the shots into x,y coordinates if you have the time..</p> <div class="hline"></div> <p><strong>Claus Moeller - February 16, 2014</strong></p> <p>Interesting model.</p> <p>I think the average of the pitches in Premier League, is way bigger than 100m x 65m.</p> <p>Most of the pitches in Premier League is 105m x 68m – http://www.openplay.co.uk/blog/premiership-football-pitch-sizes-2013-2014/</p> <p>Just for my curiosity, do you rely on all the data from Squawka?</p> <div class="hline"></div> <p><strong>Martin Eastwood - February 16, 2014</strong></p> <p>Thanks for the link, not seen that one before. Should be fairly trivial to rescale should people wish to. Yes I used Squawka to get all the xy coordinates.</p> <div class="hline"></div> <p><strong>Hugo Varandas - February 19, 2014</strong></p> <p>Hi, nice work. I find this conclusion very interesting “This is pretty impressive as it leaves just 16% attributed to other reasons”. Maybe this is the reason why the better teams outperform the prediction.</p> <p>I just do not understand how does this help to predict the expected goals in a particular future game.</p> <div class="hline"></div> <p><strong>Martin Eastwood - February 19, 2014</strong></p> <p>Yes, presumably somehere in that 16% is player talent, defensive pressure, goalkeeper skill etc. So far my equation is more explanatory rather than predictive. More work would be needed to produce a predictive model from shots.</p> <div class="hline"></div> <p><strong>Justin - February 20, 2014</strong></p> <p>I’m sure the spike in probability at the 12m mark is the result of penalties. Analyzing only goals from open play would increase the R^2 I would suspect.</p> <div class="hline"></div> <p><strong>Martin Eastwood - February 20, 2014</strong></p> <p>Yes, when I next get some free time and look at removing them although the fit of the curve is so good is will probably have minimal effect.</p> <div class="hline"></div> <p><strong>Antony - March 15, 2014</strong></p> <p>Do you know whether the decay curve exponent is replicable across seasons?</p> <p>Also do you know how much difference angle of the shot makes and if other “black box”-guys additionally use this? The fit by averaging across angles is good, but wide players may have a detriment when cast against a benchmark without it – or maybe the difference is small.</p> <p>Well done for compiling the data from squawka – when I had a quick search the only way to get x-y’s seemed to be by eye and a ruler from their pitch plots!!</p> <div class="hline"></div> <p><strong>Martin Eastwood - March 17, 2014</strong></p> <p>I’ve not looked season-by-season yet but the decay curve is created from multiple season’s worth of data aggregated together.</p> <p>The angle presumably makes some difference but considering how high the r-squared is it must only account for a relatively small proportion of expected goals although I will look into this in more detail as soon as I get some free time.</p> <p>Thanks!</p> <div class="hline"></div> <p><strong>Benjamin Lindqvist - April 18, 2014</strong></p> <p>Hey, just thought I’d cryptically point out there’s a better function to fit to the data than that.</p> <div class="hline"></div> <p><strong>Martin Eastwood - April 18, 2014</strong></p> <p>Ooh you can’t just leave it that, tell me more :-)</p> <div class="hline"></div> <p><strong>Benjamin Lindqvist - April 18, 2014</strong></p> <p>Well if I’m not mistaken, the probability of scoring from 0 yards is, according to your model, more than 100%. Not even Ronaldo will score more than 100% of the time, not even from the goal line :D</p> <p>Why don’t you try a decaying exponential instead? That has all the properties you’re looking for, but it will also be well behaved at x=0.</p> <div class="hline"></div> <p><strong>Martin Eastwood - April 18, 2014</strong></p> <p>Yes the 0 yards issue is a concern. I thought about forcing the curve through 1 but dislike taking that sort of brute force approach. The decaying exponential sounds a really good idea. I’ll look into that,thanks!</p> <div class="hline"></div> <p><strong>Benjamin Lindqvist - April 18, 2014</strong></p> <p>I.e.: http://fooplot.com/#W3sidHlwZSI6MCwiZXEiOiIxL3giLCJjb2xvciI6IiMwNEZGMDAifSx7InR5cGUiOjAsImVxIjoiZV4oLXgpIiwiY29sb3IiOiIjRkYwMDAwIn0seyJ0eXBlIjoxMDAwfV0-</p> <div class="hline"></div> <p><strong>Anonymous - April 22, 2014</strong></p> <p>Hey Martin,</p> <p>I had a very noobish doubt. Please don’t mind it. Can you explain how you calculated the probability of scoring (Y axis) to arrive at the scatter plot in Figure 1?</p> <p>Thanks in advance. :)</p> <div class="hline"></div> <p><strong>Anonymous - April 22, 2014</strong></p> <p>Poisson distribution of course. My bad!</p> <div class="hline"></div> <p><strong>Anonymous - April 22, 2014</strong></p> <p>If you could run me through the process or maybe send me some useful links, it would be greatly appreciated!</p> <div class="hline"></div> <p><strong>Martin Eastwood - April 22, 2014</strong></p> <p>I split the shots into different bins by distance and then calculated the probability per bin. I then used this set of aggregated probabilities to construct the model and fit the curve.</p> <div class="hline"></div> <p><strong>Anonymous - April 22, 2014</strong></p> <p>Thanks a lot! :)</p> <p>I need to brush up on my stats concepts.</p> <div class="hline"></div> <p><strong>abhinav - July 26, 2014</strong></p> <p>how are you gathering xy coordinates from squawka? or are you using some other metric to approximate?</p> <div class="hline"></div> <p><strong>Martin Eastwood - July 26, 2014</strong></p> <p>It took quite a bit of effort!</p> <script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: 'center'," + " displayIndent: '0em'," + " showMathMenu: true," + " tex2jax: { " + " inlineMath: [ ['\\\\(','\\\\)'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" + " } " + "}); "; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); } </script>Martin EastwoodWed, 12 Feb 2014 19:30:00 +0000tag:,2014-02-12:2014/02/12Expected GoalsComparing Players Using Cluster Analysis/2014/02/10<h2>Introduction</h2> <p>As there were a couple of presentations at the recent Opta Pro Forum talking about identifying player similarities I thought I’d give a quick example of how to do something similar using k-means cluster analysis.</p> <h2>The Data</h2> <p>All the data used in the analysis was taken from public websites, such as <a href="http://www.whoscored.com">whoscored</a>, <a href="http://www.squawka.com">squawka</a>, <a href="http://www.transfermarkt.co.uk/">transfermarkt</a> etc and painstakingly matched together to try and get as much information on each player as possible.</p> <p>The first stage of analysis was to normalize the data so it was all in the same range to avoid biasing the clustering. If you think about how many goals a typical player scores per match compared with how many passes they play then the scale is quite different. Since k-means clustering uses <a href="http://en.wikipedia.org/wiki/Euclidean_distance">Euclidean Distance</a> the clusters formed are influenced strongly by the magnitudes of the variables, especially by outliers. By normalizing all data into the same range this bias can be avoided.</p> <h2>Principal Component Analysis</h2> <p>While normalizing the data, I also performed <a href="http://en.wikipedia.org/wiki/Principal_component_analysis">Principal Component Analysis</a> (PCA) on it too. This step isn’t essential but it is a handy way of reducing the dimensions in the data down to a more manageable size by squashing all the data together into new variables known as principal components.</p> <p>These principal components are created in such as way so that the first one accounts for as much as the variance in the data as possible, the second one then accounts for as much of the remaining variance and so on.</p> <p>As you can see in Figure 1 below, the first component represents pretty much 70% of all the variance in the data with each additional component accounting for less and less. This means we can represent pretty much all the information in the data without losing much using just five components, and around about 80% using just two components.</p> <p><img alt="Pelican" src="../../../../images/20140210_scree.png" /></p> <p><strong>Figure 1: PCA scree plot showing amount of variance accounted for by each principal component</strong></p> <h2>Clustering The Players</h2> <p>The next step was to then run the k-means clustering algorithm on the data. As shown in Figure 2 the players split relatively neatly into five distinct coloured clusters when plotted by the first two principal components.</p> <p><img alt="Pelican" src="../../../../images/20140210_clusters.png" /></p> <p><strong>Figure 2: Players split into different clusters by colour</strong></p> <h2>Goalkeepers</h2> <p>As a quick test we can look at the grey cluster located at the bottom of the image in more detail to see which players are contained within it (Figure 3). If you click the image to zoom in on it you can see it’s done a pretty good job of pulling out the goalkeepers from the rest of the players. This is to be expected since goalkeeper’s stats should be pretty distinct from outfield players but it’s reassuring to check the technique passes this first simple test before we move on.</p> <p><img alt="Pelican" src="../../../../images/20140210_keepers.png" /></p> <p><strong>Figure 3: The grey cluster up close</strong></p> <h2>Vincent Kompany</h2> <p>Now that we have separated out the goalkeepers we can take a look at how well the technique copes with outfield players, starting with Manchester City’s central defender Vincent Kompany located at the centre of Figure 4. The results are pretty good, with Kompany surrounded by players predominantly considered to be defenders. As you move up the image the players start to get a bit more attacking with people like David Luiz, Phil Jones and Fabien Delph starting to appear</p> <p><img alt="Pelican" src="../../../../images/20140210_kompany.png" /></p> <p><strong>Figure 4: Clustering of Vincent Kompany</strong></p> <h2>Adnan Januzaj</h2> <p>Next up is Adnan Januzaj, one of the few Manchester United players to be having anything resembling a decent season this year. Again the results look pretty plausible (Figure 5), with Januzaj surrounded by predominatly attacking midfielders. There are a couple of slightly surprising results in there though, such as Manchester City’s strikers Álvaro Negredo and Edin Džeko.</p> <p><img alt="Pelican" src="../../../../images/20140210_januzaj.png" /></p> <p><strong>Figure 5: Clustering of Adnan Januzaj</strong></p> <h2>Mikel Arteta</h2> <p>Finally, I added in Arsenal’s midfielder Mikel Arteta (Figure 6). This one was probably the most surprising of all the players I’ve looked at as there seems to be quite a mix of players around Arteta, including both offensive and defensive players, although perhaps this is actually representative of Arteta’s role at Arsenal?</p> <p><img alt="Pelican" src="../../../../images/20140210_arteta.png" /></p> <p><strong>Figure 6: Clustering Mikel Arteta</strong></p> <h2>Next Steps</h2> <p>For a first go the results are pretty promising but there are plenty of ways the technique could be improved. At the moment I have used all the data I had available for each player but I suspect more specific results could be obtained by filtering the data.</p> <p>For example, there may be specific attributes of a player you want to match on e.g. looking for attackers by just their creative output may be more useful than including their tackles, interceptions etc, which may be of minor importance to their role.</p> <p>Finally, all the data used here are aggregated. A really interesting next step would be to include xy co-ordinates for shot locations, interceptions, passes etc to cluster players based on the locations of their actions on the pitch (donations of xy data will be gratefully accepted :)).</p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>Filipe Rodrigues - February 20, 2014</strong></p> <p>Hello Martin, great job once again.</p> <p>Can you tell me what data was used to complete this post?</p> <div class="hline"></div> <p><strong>Martin Eastwood - February 20, 2014</strong></p> <p>It is collected from all over the Internet and then algorithmically matched together. Sadly it’s not an easy job to acquire.</p> <div class="hline"></div> <p><strong>Dank - June 29, 2014</strong></p> <p>Hello Martin, i want to know which the values of the axes, can you explain it me?</p> <div class="hline"></div> <p><strong>Martin Eastwood - June 29, 2014</strong></p> <p>Hi Dank, the axes represent the first and second principle components. I don’t have time to explain it all here at the moment so in the meantime it’s probably worth taking a look at the Wikipedia entry as a starting point – http://en.wikipedia.org/wiki/Principal_component_analysis</p> <div class="hline"></div> <p><strong>Antony - March 12, 2014</strong></p> <p>~Hi Martin – really interesting analysis which shows what you can do with pure top down data driven approach.</p> <p>I did a similar analysis with the MCFC/Opta data last year but instead of using PCA I first just considered midfielders and then qualitatively chose the attacking and defensive attributes I wanted to determine out/underperformance in versus the sample (by scaling per 90 and normalising by mean and variance). By adding up defensive and offsensive (rather than projecting onto eigenvectors) a 2D plot revealed many qualitative knowns and some surprises. The population nicely clustered into the enforcers, schemers, creators and luxuries respectively and then PCA helped to characterise the variance within each cluster. Would be interesting to compile these types of systems for many seasons and see what the principal dynamical modes are, like improving within a cluster or moving between them, e.g. the trajectory of Giggs and now Gerrard.</p> <p>Interesting the different approaches we used in the ordering and application of qualitative intuition vs. quantitative rigour, which I think is the marriage that has to be progressed and its limitations understood for analytics to really take off and be adopted.</p> <p>Also like the piece on ExpG….amazing fit to the decay curve!</p> <div class="hline"></div> <p><strong>Martin Eastwood - March 13, 2014</strong></p> <p>Hi Antony,</p> <p>That sounds really cool! At some point I want to go back and try something similar with the clustering using subsets of the players attributes. At the moment I use all the players stats but it would be interesting to try just clustering players based on passing stats or defensive stats etc.</p> <p>I think splitting analyses out over seasons will be a really important thing to do to assess trajectories of player’s careers and how they develop / change with age or move between clusters. Just need the data to do it :)</p>Martin EastwoodMon, 10 Feb 2014 19:30:00 +0000tag:,2014-02-10:2014/02/10Player AnalyticsCluster AnalysisEPL 2013/2014: Football Pythagorean So Far/2014/01/20<h2>Introduction</h2> <p>Welome back! Now that I'm no longer part of Onside Analysis I'm free to start blogging again so let's start off by taking a look at how my football Pythagorean is doing for the English Premier League so far this season.</p> <h2>Football Pythagorean</h2> <p>In case you haven’t seen it before, my <a href="http://pena.lt/y/pythagorean.html">football pythagorean</a> is an adaptation of the baseball Pythagorean that allows you to quickly estimate how many points a team would be expected to achieve on average based on the number of goals they have scored and conceded. It’s a pretty simple little equation but it is surprisingly accurate!</p> <h2>The Season So Far</h2> <p>Figure One below shows the difference between the actual points each Premier League team has achieved this season and how much my Pythagorean predicts they should have on average. For teams in green the difference is positive so they actually have more points than expected while those in read have gained less points than would be expected based on the number of goals they have scored and conceded.</p> <p><img alt="Pelican" src="../../../../images/20140120_epl_pythag.png" /></p> <p><strong>Figure One: EPL Pythagorean Results So Far</strong></p> <p>The stand out team here is obviously Tottenham, who have somehow managed to end up with eleven points more than would be expected based on their goals. Spurs’ Pythagorean has looked pretty big for a while now so I suppose you could look at this two ways – either they have developed an extremely effective and efficient system or they have been lucky to get as may points as they have. I’ll let you decide on the answer to that one…</p> <p>Interestingly, Manchester City are pretty close to their expected points total despite their enormous goal difference. One reason for this is that my football Pythagorean is not linear so as you score more goals they become less valuable to help account for high scoring matches, such as most of City’s home games this season! This helps prevent over-prediction of expected points for teams scoring heavily – having a good goal difference is obviously helpful but whether you win by one goal or five goals you still only get three points from the match.</p> <p>As it stands though Manchester City are in second place behind Arsenal who have acquired six points more than expected, meaning that typically we would not expect Arsenal to be top based on their results so far this season.</p> <h2>How Will The Season End?</h2> <p>As well as looking at how teams are doing so far, we can also extrapolate the results and predict how the teams will end up at the end of the season (Table One). This is a very simplistic prediction, for example it does not take into account strength of schedules, but it is fairly accurate – the r squared value for Pythagorean predicted points versus actual points across multiple leagues worldwide was 0.938 with an average error of less than four points – so it should give a reasonable estimate of how the Premier League will finish next May.</p> <table class="table"> <tbody> <tr> <th></th> <th>Team</th> <th>Points</th> </tr> <tr> <td>1</td> <td>Manchester City</td> <td>84.50</td> </tr> <tr> <td>2</td> <td>Arsenal</td> <td>83.59</td> </tr> <tr> <td>3</td> <td>Chelsea</td> <td>80.99</td> </tr> <tr> <td>4</td> <td>Liverpool</td> <td>73.75</td> </tr> <tr> <td>5</td> <td>Everton</td> <td>72.20</td> </tr> <tr> <td>6</td> <td>Tottenham Hotspur</td> <td>65.89</td> </tr> <tr> <td>7</td> <td>Manchester United</td> <td>62.56</td> </tr> <tr> <td>8</td> <td>Newcastle United</td> <td>59.32</td> </tr> <tr> <td>9</td> <td>Southampton</td> <td>54.42</td> </tr> <tr> <td>10</td> <td>Aston Villa</td> <td>41.52</td> </tr> <tr> <td>11</td> <td>Hull City</td> <td>40.97</td> </tr> <tr> <td>12</td> <td>West Bromwich Albion</td> <td>40.74</td> </tr> <tr> <td>13</td> <td>Swansea City</td> <td>39.66</td> </tr> <tr> <td>14</td> <td>Stoke City</td> <td>36.85</td> </tr> <tr> <td>15</td> <td>Norwich City</td> <td>35.78</td> </tr> <tr> <td>16</td> <td>West Ham United</td> <td>33.91</td> </tr> <tr> <td>17</td> <td>Sunderland</td> <td>32.28</td> </tr> <tr> <td>18</td> <td>Crystal Palace</td> <td>31.30</td> </tr> <tr> <td>19</td> <td>Fulham</td> <td>30.65</td> </tr> <tr> <td>20</td> <td>Cardiff City</td> <td>29.29</td> </tr> </tbody> </table> <p><strong>Table One: Pythagorean Predicting Final Standings For the English Premier League 2013/2014</strong></p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>Rasmus Dam - January 20, 2014</strong></p> <p>Hi Martin</p> <p>Great to see you posting again. It’s an interesting and easy to understand post. Furthermore your model is easy to use for other league forecasts. If interested I’ve done so for the Danish Superliga in these two posts:</p> <p>http://super-analyse.blogspot.dk/2014/01/forecast-superliga-1314.html</p> <p>http://super-analyse.blogspot.dk/2013/06/pythagorean-expectation-in-football.html</p> <p>As a side comment: Isn’t the R^2 between EP and actual final points less than 0.938 after 22 rounds? (I have a value of 0.7 after 22 rounds calculated from last season in Denmark)</p> <p>Sorry for linking, but just wanted to let you know that you inspired me to do the same to danish football so you could see it if interested :)</p> <div class="hline"></div> <p><strong>Martin Eastwood - January 21, 2014</strong></p> <p>Hi Rasmus</p> <p>Thanks for the links, it’s great to see my Pythagorean equation getting used elsewhere :-)</p> <p>Yes you are correct the r2 value I quoted was for the end of the season. Perhaps I should have made that clearer in my last post? It’s good to see that you found the same results as me and that the predicted points stabilises pretty fast in Danish football too.</p>Martin EastwoodMon, 20 Jan 2014 19:30:00 +0000tag:,2014-01-20:2014/01/20EPLPythagoreanUEFA Champions League – Route To The Final/2013/09/30<h2>Introduction</h2> <p>With the UEFA Champions League group stage now underway I took a quick look at what it typically takes for teams to reach the final.</p> <p>I started off by looking at how well teams from the major six domestic leagues (England, France, Italy, Spain, Germany and Portugal) performed in the UEFA Champions League based on what position they qualified in domestically (Figure 1) as this affects at what point they enter the competition…</p> <p>Find out more by reading the full article on the Onside Analysis blog <a href="http://www.onsideanalysis.com/uefa-champions-league-route-final/">here</a>.</p>Martin EastwoodMon, 30 Sep 2013 19:30:00 +0100tag:,2013-09-30:2013/09/30UCLAnalysing Football Teams Using Cluster Analysis and Principal Component Analysis/2013/08/30<h2>Introduction</h2> <p>The amount of football data available is growing rapidly – with every passing week of the season more matches are played and even more data gets collected. This is great as it allows us to increase our understanding of the game but it also means we quickly end up with more information than could ever be analysed manually.</p> <p>Instead, we can use techniques such as cluster analysis and principal component analysis (PCA) to critically analyse these large sets of football data to identify important patterns and relationships that can help explain a team’s performances.</p> <p>Find out more by reading the full article on the Onside Analysis blog <a href="http://www.onsideanalysis.com/analysing-football-teams-using-cluster-analysis-principal-component-analysis/">here</a>.</p>Martin EastwoodFri, 30 Aug 2013 19:30:00 +0100tag:,2013-08-30:2013/08/30Cluster AnalysisAnouncement/2013/06/19<h2>Introduction</h2> <p>You may have noticed that my blogging has slowed down over the past few weeks and the reason is that I have joined Onside Analysis as a computational statistician.</p> <p>I am really excited about my new role as it means that I will be working on football analysis full time instead of trying to squeeze it in around my day job, family, sleep etc. I am not sure exactly what this means for my blog here though but the plan is that I will be contributing to the Onside Analysis blog so keep an eye out on that if your interested in what I have been writing about so far.</p> <p>I’ll also still be around on <a href="https://twitter.com/penaltyblog">Twitter</a> so please keep in touch :)</p>Martin EastwoodWed, 19 Jun 2013 19:30:00 +0100tag:,2013-06-19:2013/06/19Betting With The Eastwood Index And Kelly Criterion/2013/05/23<h2>Introduction</h2> <p>I demonstrated in my last post that the odds calculated using the Eastwood Index were slightly more accurate than the bookmakers over the course of the football season. My next goal is to work out the optimal way of using this edge to make a profit, starting off with the Kelly Criterion.</p> <h2>The Kelly Criterion</h2> <p>The first point of call for any staking plan is the Kelly Criterion, a method developed by John Larry Kelly Jr to determine the optimal bet size based on how far the odds are perceived to be in your favour.</p> <p>The equation used to calculate the Kelly Criterion is shown in Figure 1 where <span class="math">\(p\)</span> is your expected probability of winning, <span class="math">\(b\)</span> is the odds offered and <span class="math">\(f\)</span> is the Kelly Criterion or recommended percentage of your bankroll to bet.</p> <p><span class="math">\(f=(pb-1)/(b-1)\)</span></p> <p><strong>Figure 1: Kelly Criterion</strong></p> <p>Let’s run through a quick example using Fulham’s last match of the season against Swansea City. Bet365 offered Fulham to win at odds of 4.75, which is equivalent to an expected win probability of around 21%, while let’s say you think the probability of Fulham winning is actually closer to 24%.</p> <p><span class="math">\(f = (0.24 * 4.75 – 1) / (4.75 – 1)\)</span></p> <p><span class="math">\(f = 0.14 / 3.75\)</span></p> <p><span class="math">\(f = 0.0373\)</span></p> <p><span class="math">\(f = 3.73%\)</span></p> <p>So according to the Kelly Criterion we should be willing to risk 3.73% of our bankroll on this bet.</p> <h2>Applying the Kelly Criterion to the Eastwood Index</h2> <p>So what is the best way of applying the Kelly Criterion to the Eastwood Index? There are numerous different strategies that could be used but to start off with I’ve gone purely with value bets.</p> <p>For each match I calculated the Kelly Criterion based on the Home, Draw and Away odds from Bet365 and looked for the outcome where the recommended bet was the largest. The reason for this was that the larger the recommended bet then the greater the difference between my probabilities and the bookmaker’s odds so the greater the potential value of the bet.</p> <p>Figure Two shows the results over the course of the season. Starting off with a bankroll of £100 there was a slight loss over the first half of the season followed by pretty steady growth to finally finish with £114 in the bank. This gave a return on investment (ROI) of 14% for the Eastwood Index based on the 2012–2013 premier League season.</p> <p><img alt="Pelican" src="../../../../images/130523-Fraction-Kelly.png" /> <strong>Figure 2: Bankroll over 2012-2013 Season</strong></p> <h2>Fractional Kelly</h2> <p>Being relatively risk averse, I’ve found that using a fractional Kelly Criterion is preferable for me. Although using the full Kelly Criterion is optimal for maximizing growth of the bankroll long term, there is more risk of being caught out by variance and an unlucky streak wiping out your bank balance.</p> <p>Betting a fractional value, such as half the recommended amount slows down growth but helps protects you from volatility. As a comparison, take a look at Figure Three where I bet a full Kelly Criterion on each match and you can see that at its peak the ROI reaches 73%, yet at the end of the season variance has pulled the bankroll down below its starting value causing a loss overall.</p> <p><img alt="Pelican" src="../../../../images/130523-Full-Kelly.png" /> <strong>Figure 3: Volatility Betting The Full Kelly Criterion</strong></p> <h2>Early Season Dip</h2> <p>One intriguing aspect of Figure Two is why the bankroll grew so much more in the second half of the season compared with the first? Partly this may be due to random variance but another possibility is betting on recently promoted teams.</p> <p>At the moment promoted teams take over the EI ratings of the equivalent relegated team so the team promoted as champions take the EI rating of the team relegated third from bottom, the team promoted second take over the rating of the team relegated second from bottom and the team promoted in the playoffs gets the ratings of the team finishing bottom.</p> <p>These ratings will not be exactly correct for the promoted teams but should over time move towards the right levels to reflect the team’s performances. Looking through the history of the bets made though those involving a promoted team during the first half of the season lost money overall while those not involving promoted teams made a profit.</p> <p>By the end of the season I had made a profit out of the promoted teams, suggesting that the team’s ratings had sufficiently corrected themselves. This means though that I could improve the ROI even further by avoiding bets on the promoted teams early on in the season or improving the way the promoted teams ratings are handled, such as correcting their EI ratings faster.</p> <h2>Conclusions</h2> <p>This is only a quick look at a very simple strategy for placing bets using the Eastwood Index; there are still plenty of improvements that can be made to improve results further. Yet even with this relatively naive approach the ROI is 14%, which is much more than I would have made sticking my money into a bank account.</p> <p>Applying the Eastwood Index to betting is also a great ways to identify the strengths and weaknesses of the model as the ROI gives a clear indicator of what works, what doesn’t work and what can be optimized further.</p> <p><em>Addendum</em></p> <p><em>I was asked on Twitter what the ROI works out at per bet for the Eastwood Index without betting the Kelly Criterion. Using a fixed bet of £1 per bet gave an overall profit for the season of £17 over 380 matches, which works out at an ROI of around 4.5% per bet.</em></p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>James - June 1, 2013</strong></p> <p>Hey Martin.</p> <p>I used your EI for betting purposes for the last few weeks of the season and obtained 21% ROI, so i must have been lucky in picking a good week to start. Have you any plans to work out the ROI for specific situations? For example just looking at when the ‘value’ selection was a home win, draw, etc. so that you can see where your model is better than the bookies and where it isn’t.</p> <p>Or, do a graph with each teams ROI to see if anything sticks out.</p> <p>Thanks.</p> <div class="hline"></div> <p><strong>Martin Eastwood - June 1, 2013</strong></p> <p>Excellent work Jamie, always good to see someone take money off the bookies :)</p> <p>Good idea, I’ll try and take a look at those over the summer!</p> <div class="hline"></div> <p><strong>Amir - June 4, 2013</strong></p> <p>Just be careful with jumping into conclusions over a small sample…</p> <div class="hline"></div> <p><strong>Martin Eastwood - June 4, 2013</strong></p> <p>Agreed, one year’s worth of data is certainly not definitive for football.</p> <script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: 'center'," + " displayIndent: '0em'," + " showMathMenu: true," + " tex2jax: { " + " inlineMath: [ ['\\\\(','\\\\)'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" + " } " + "}); "; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); } </script>Martin EastwoodThu, 23 May 2013 19:30:00 +0100tag:,2013-05-23:2013/05/23EIEPLDid The Eastwood Index Beat The Bookmakers?/2013/05/21<h2>Introduction</h2> <p>It’s the end of the season so it’s time to review how the Eastwood Index performed over the year and how it compared with the bookmakers.</p> <h2>Ranked Probability Scores</h2> <p>One of the most important aspects to me is how accurate the forecasts were and I’ve assessed this using <a href="http://pena.lt/y/2013/03/21/how-accurate-are-the-ei-football-predictions/">Ranked Probability Scores</a>, as recommended by <a href="http://www.eecs.qmul.ac.uk/~norman/papers/assessing_probabilistic_football_forecast_models.pdf">Constantinou and Fenton</a>. I’ve discussed Ranked Probability Scores on the blog before but for people new to them they measure the difference between the forecasts and what really happened. Scores range between 0–1 and represent the amount of error in the predictions so lower Ranked Probability Scores are better and signify greater accuracy.</p> <h2>Comparison With Bookmakers</h2> <p>Looking at Figure 1 you can see that the Eastwood Index has consistently outperformed the bookmakers all season – and this isn’t just one bookmaker that the Eastwood Index has beaten but the combined knowledge of the industry as I’ve aggregated multiple bookmakers’ odds together and stripped out the overround to make the comparison as tough as possible.</p> <p><img alt="Pelican" src="../../../../images/130521-EI-vs-BM1.png" /> <strong>Figure 1: Eastwood Index Vs Aggregated Bookmakers</strong></p> <p>Interestingly, the difference in accuracy seems to be greatest as both ends of the season. I expected the start of the season to be difficult to forecast as new teams have been promoted, players have been bought and sold, and managers may have changed clubs but the Eastwood Index seems to have coped with these variables better than the bookmakers’ odds have.</p> <p>Over the course of the season the bookmakers’ forecasts improved until there was very little difference between them and the Eastwood Index but I was somewhat surprised to see how far out their accuracy drifted over the final few weeks of the season.</p> <p>In theory these should be the easiest matches to forecast as we have the most information but in reality they can be tricky as team’s motivations change. For example, Manchester United have been playing their reserve goalkeeper so he gets enough appearances to earn his winners medal while Swansea’s players may as well have been on holiday since they won the league cup.</p> <p>These changes seem to have thrown the bookmakers’ odds out quite noticeably while the Eastwood Index’s accuracy has remained constant. In fact, it suggests that bookmakers may be over-compensating for these apparent end-of-season effects as the Eastwood Index does not currently take them into account and has not struggled because of it.</p> <h2>Conclusions</h2> <p>Overall, I am pleased with the Eastwood Index’s debut season. I was slightly reticent to publish the forecasts at first in case the model did not hold up but it has remained accurate throughout the year. The next stage of its development is to identify any patterns as to where its forecasts differ from the bookmakers and how that could be combined with various staking strategies as well as looking at expanding to cover other leagues too.</p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>George - May 21, 2013</strong></p> <p>Cool outcome. From my cursory looks into this, it is definitely possible, its just a case of trying to maximise any returns by only taking picks of a certain ratio etc. I’ve looked at other sports as well and sometimes find you can get a good number on a particular team for a period of weeks etc. Out of interest how are you translating %’s into home, away or draw (or are you assuming where the EI percentage is greater than the bookmakers % you are taking that result)? Things to look at going forward could be things like home field advantage (e.g. like the Clarke, Norman paper from 1995) or the effect of travel or fatigue. Good luck.</p> <div class="hline"></div> <p><strong>Martin Eastwood - May 21, 2013</strong></p> <p>Thanks George, that’s the tricky bit – how to convert from my %’s to winning staking strategy. My plan for the summer is to work on that further as I’ve got some ideas ready to test out for next season :)</p> <div class="hline"></div> <p><strong>Alex - May 21, 2013</strong></p> <p>Okay – perhaps this is naive, but if you’ve demonstrated that your index has outperformed the bookies all season, why can’t you invest, say, $100/matchday and split the bets according to the outcomes predicted by the EI over the course of the season? You should win the majority of the bets and show some return for your money, no?</p> <div class="hline"></div> <p><strong>Martin Eastwood - May 22, 2013</strong></p> <p>Yes and I would make a profit over the season doing that. What I want to work out though is what would be the best way of splitting that $100/matchday between the matches to maximize the profit made, e.g. something like the Kelly Criterion.</p> <div class="hline"></div> <p><strong>Alex - May 23, 2013</strong></p> <p>Ahh, I see. This is <em>extremely</em> interesting stuff. I’ve been tinkering around with a Monte Carlo approach in Matlab (using data from Football Manager 2013) for predicting game results, but I haven’t had great – or even remotely convincing – success. Your stuff here is brilliant.</p> <div class="hline"></div> <p><strong>Martin Eastwood - May 23, 2013</strong> Thanks Alex :)</p> <p>Is the data easy to extract from Football Manager 13? I’ve never touched it since the all Championship Manager days because it is too addictive but it could be an interesting data source…</p> <div class="hline"></div> <p><strong>Alex - June 6, 2013</strong></p> <p>Yep! There’s a few editors available (just google for ‘em), it becomes pretty straightforward to pull data.</p> <p>Man, if you can show that you can outperform the bookies by even 4%, I think this would make a great alternative to savings accounts. Are you going to be publishing predictions for matchdays over the next season? I think putting in, say, $20 a matchday might be fun.</p> <div class="hline"></div> <p><strong>George - May 23, 2013</strong></p> <p>Don’t know which source you use for data but I use football-data.co.uk, which is great for any of the European Leagues (including the Premiership). What’s really nice for doing this kind of thing, is they usually put stuff in an xls or csv, so you can just do your thing straight away. They also usually have a range of bookmaker odds (as I am expecting that certain bookmakers deal with “sharp”er customers than others (if you know what I mean) so I expect that you could probably find a bookmaker that the EI is more prone to exploiting (or perhaps that the EI for a particular team is more prone to exploring).</p> <div class="hline"></div> <p><strong>Martin Eastwood - May 23, 2013</strong></p> <p>Something else to add to my to-do list, which is the best bookmaker for using the EI with!</p> <p><strong>amir - May 23, 2013</strong></p> <p>Please read this article for optimal allocation of bets: http://www.academia.edu/1027427/Algorithms_for_optimal_allocation_of_bets_on_many_simultaneous_events</p> <div class="hline"></div> <p><strong>amir - May 23, 2013</strong></p> <p>I see you already wrote another article stating Kelly Criterion…</p> <div class="hline"></div> <p><strong>Martin Eastwood - May 23, 2013</strong></p> <p>Thanks Amir that link looks really interesting!</p> <p><strong>Lars - October 18, 2013</strong></p> <p>If I see this correctly, you introduced the EI in February 2013.</p> <p>Now you are comparing its predictions with the whole of the 2012/13 season.</p> <p>That is not really “beating” the bookmakers. To beat them, you need to make your predictions BEFORE the matches not after them.</p> <p>If you beat them again next year without changing your methods I’ll give you all the credit you deserve.</p> <div class="hline"></div> <p><strong>Martin Eastwood - October 18, 2013</strong></p> <p>Lars – I actually went back and recreated predictions for every match of the season based on just the data that would have been available at the point in time when the match was originally played.</p> <div class="hline"></div> <p><strong>Nic - November 24, 2013</strong></p> <p>Thanks for the good read. It’s really interesting.</p> <p>When you compare against the bookmakers are you taking their opening odds? Because after the odds are published the fluctuations are due to general populace betting patterns.</p> <p>This could account for your sudden strong finish vis-a-vis the bookmakers as the public was betting more against academic results etc which then the bookmakers have to tweak the odds for.</p> <p>Another question. What’s the average spread a bookmaker keeps. To beat them do you mean you overcome the spread as well?</p> <div class="hline"></div> <p><strong>Martin Eastwood - November 25, 2013</strong></p> <p>Thanks Nic,</p> <p>The odds were taken from the football-data website. There were then normalised to account for the overround and aggregated IIRC. Not sure how much the average bookmaker keeps as the overound, charges etc vary so much between traditional highstreet bookmakers and online betting.</p> <p>Cheers,</p> <p>Martin</p> <div class="hline"></div> <p><strong>Patrick - December 10, 2013</strong></p> <p>You state that you stripped out the bookmakers over-round in your comparison. Should you not be comparing your predictions against the bookmakers prices WITH over-round included?</p> <p>Otherwise is it not a little bit pointless (in betting terms) as you won’t be able to take prices with the over-round removed? ie the edge you’ve found only exists when betting against 100% books…</p> <p>Cheers</p> <div class="hline"></div> <p><strong>Martin Eastwood - December 13, 2013</strong></p> <p>Yes, I agree. I was undecided as to the best way of looking at it so ended up removing the over round as gave me the worst results so I went with the worst case scenario. If I leave the over round in which is probably more realistic then I actually got even better results</p>Martin EastwoodTue, 21 May 2013 19:30:00 +0100tag:,2013-05-21:2013/05/21EIEPLEI Match Probabilities for the English Premier League/2013/05/17<p>We have finally reached the end of the season so for the last time in 2012-2013 here are the Eastwood Index’s (EI) probabilities for the English Premier League.</p> <p>Once the season is over and done with I’ll be looking back at how the EI has performed and how well it’s predictions compare with the bookmakers so look out for that next week!</p> <table class="table"> <tbody> <tr> <td><strong>Home Team</strong></td> <td><strong>Away Team</strong></td> <td><strong>Home (%)</strong></td> <td><strong>Draw (%)</strong></td> <td><strong>Away (%</strong>)</td> </tr> <tr> <td>Chelsea</td> <td>Everton</td> <td>52</td> <td>28</td> <td>20</td> </tr> <tr> <td>Liverpool</td> <td>QPR</td> <td>71</td> <td>18</td> <td>11</td> </tr> <tr> <td>Man City</td> <td>Norwich</td> <td>77</td> <td>14</td> <td>9</td> </tr> <tr> <td>Newcastle</td> <td>Arsenal</td> <td>23</td> <td>32</td> <td>45</td> </tr> <tr> <td>Southampton</td> <td>Stoke</td> <td>42</td> <td>31</td> <td>27</td> </tr> <tr> <td>Swansea</td> <td>Fulham</td> <td>46</td> <td>30</td> <td>24</td> </tr> <tr> <td>Tottenham</td> <td>Sunderland</td> <td>67</td> <td>21</td> <td>13</td> </tr> <tr> <td>West Brom</td> <td>Man United</td> <td>13</td> <td>30</td> <td>57</td> </tr> <tr> <td>West Ham</td> <td>Reading</td> <td>49</td> <td>29</td> <td>23</td> </tr> <tr> <td>Wigan</td> <td>Aston Villa</td> <td>43</td> <td>31</td> <td>27</td> </tr> </tbody> </table> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>George - May 17, 2013</strong></p> <p>Cool website. I really enjoy making projections and football seems to be the more interesting sport to do this because of the variety of approaches (given the low scoring nature of it which makes it less predictable). I go the least squares way and assume error is normally distributed around predictions from a rating system (and its the only one I could do easily in Excel given my limited maths). I think the guys at DTech go the ordered probit/logistical regression route (which seems to be the way to go I think) and I haven’t figured out your method yet but it looks interesting. They all seem to have similar numbers though:</p> <p>e.g.: for Chelsea/Everton (Home/Draw/Away)</p> <p>Your way: 52/28/20</p> <p>DTech: 55/23/22</p> <p>Least Squares: 59/22/18</p> <p>All of which are around the bookmaker odds for that game 60/24/19 (random bookmaker picked)</p> <p>Out of interest, have you noticed any kind of preferred ratio of your projections to bookmakers odds along the lines of the Dixon/Coles paper into this kind of thing?</p> <p>Keep up the good work.</p> <div class="hline"></div> <p><strong>Martin Eastwood - May 18, 2013</strong></p> <p>Thanks George! Once the season finishes I’m planning looking back at how I’ve done over the year compared with the bookmakers to see if there are any patterns to my projections versus theirs.</p> <div class="hline"></div> <p><strong>George - May 18, 2013</strong></p> <p>Thanks. Re: the bookmakers that’s what I’ve found, just because you can generate a number similar to theirs – what does it actually tell you (as we don’t know what their perspective is)? What biases is it accounting for? Don’t know if you saw the Steven Levitt paper in 2004 on this kind of thing (was on the NFL though), and the various papers done on football (e.g. Graham and Stott from DTech in 2008) and bookmaker efficiency. I know DTech have worked out they would make something like 10% when their number differed from bookmaker numbers over however many years they have been doing it. The Dixon/Coles paper also found that when the ratio of their probability against the bookmakers probability was about 1.2 it was optimum for generating profit. What this tells us though – I don’t know.</p> <p>Good luck with everything and I look forward to reading it.</p> <div class="hline"></div> <p><strong>Martin Eastwood - May 21, 2013</strong></p> <p>Thanks George, I’ve not seen the Levitt paper before.</p> <div class="hline"></div> <p><strong>Jimmy - August 23, 2013</strong></p> <p>Very interesting, been trying myself to see how accurate the bookies have been with their odds. Scraping my data off Oddsportal, using a Java app to analyse it. And then generatinig my own rating based on form etc Only point of note I have noticed so far is that home teams with odds between 1.999 and 2.5 are the best to bet on. Consistently generating a % profit across all leagues. I assume these are games where bookies are loosest with their odds.</p> <div class="hline"></div> <p><strong>Martin Eastwood - August 23, 2013</strong></p> <p>yes the bookies are pretty good but they are certainly not perfect and there are situations where they can be beaten – the tricky bit is being able to do it consistently over a long period of time :)</p> <div class="hline"></div> <p><strong>Jimmy - August 23, 2013</strong></p> <p>I’d like to see you try your system against the Iranian pro league. Very obscure league to be betting on I know but I have found my own rating system giving me accuracy of 70% with average odds of over 2.2. I was raking it in last season. Not so much this time round as the bookies seem to have tightened their odds on that particular pony.</p> <div class="hline"></div> <p><strong>Martin Eastwood - August 23, 2013</strong></p> <p>cool, well done it’s always great to hear about people taking money off the bookies :)</p> <p>I haven’t tried my system outside of Europe or MLS yet, now you have got me really interested in trying it out on more leagues!</p>Martin EastwoodFri, 17 May 2013 19:30:00 +0100tag:,2013-05-17:2013/05/17EIEPLEI Match Probabilities for the English Premier League/2013/05/10<p>Here are the latest match probabilities for the English Premier League calculated using the Eastwood Index (EI).</p> <p>Somewhat surprisingly, Liverpool are only just favorites to beat Fulham with the odds so close that a draw would seem the likely outcome.</p> <p>Down at the bottom of the table Newcastle versus QPR and Norwich versus West Brom look likely to finish tied, while Sunderland are slight favorites against Southampton meaning Wigan desperately need to take points off Arsenal to stand any chance of avoiding relegation.</p> <table class="table"> <tbody> <tr> <td><strong>Home Team</strong></td> <td><strong>Away Team</strong></td> <td><strong>Home (%)</strong></td> <td><strong>Draw (%)</strong></td> <td><strong>Away (%</strong>)</td> </tr> <tr> <td>Aston Villa</td> <td>Chelsea</td> <td>21</td> <td>32</td> <td>47</td> </tr> <tr> <td>Stoke</td> <td>Tottenham</td> <td>24</td> <td>32</td> <td>44</td> </tr> <tr> <td>Everton</td> <td>West Ham</td> <td>67</td> <td>20</td> <td>12</td> </tr> <tr> <td>Fulham</td> <td>Liverpool</td> <td>31</td> <td>33</td> <td>36</td> </tr> <tr> <td>Norwich</td> <td>West Brom</td> <td>37</td> <td>32</td> <td>31</td> </tr> <tr> <td>QPR</td> <td>Newcastle</td> <td>32</td> <td>32</td> <td>35</td> </tr> <tr> <td>Sunderland</td> <td>Southampton</td> <td>49</td> <td>29</td> <td>22</td> </tr> <tr> <td>Man United</td> <td>Swansea</td> <td>76</td> <td>15</td> <td>9</td> </tr> <tr> <td>Arsenal</td> <td>Wigan</td> <td>71</td> <td>18</td> <td>11</td> </tr> <tr> <td>Reading</td> <td>Man City</td> <td>8</td> <td>28</td> <td>64</td> </tr> </tbody> </table> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>Acka Bilk - May 10, 2013</strong></p> <p>Interesting to see that you make Sunderland such strong favourites against Southampton. On what basis can that be? Are you just looking purely at very recent results and the league table?</p> <div class="hline"></div> <p><strong>Martin Eastwood - May 10, 2013</strong></p> <p>It is based on results over the past few seasons, with greater weighting applied the more recent the result is. The two team’s are rated reasonably similar at the moment so part of Sunderland’s advantage is likely due to playing at home.</p> <div class="hline"></div> <p><strong>Saze - May 14, 2013</strong></p> <p>If the elo ratings are based on results over the past few seasons then how do they apply to newly promoted teams that have never played in the premier league before? Also when you say “over the past few seasons” how many seasons are you exactly talking about.</p> <p>And have you ever considered crafting elo ratings for individual players and then using this to create an elo rating for a whole team as team lineups do usually vary from match to match and can sometimes significantly affect the outcome of a match. The downside to this however is that team lineups are only announced 30-45 minutes before the match starts so you won’t really be able to make a prediction until the match has practically started although it can still be useful to look at which players are performing at the very top level.</p> <div class="hline"></div> <p><strong>Martin Eastwood - May 14, 2013</strong></p> <p>My EI ratings are currently based on three previous season’s data plus this season. Over the summer I hope to look at whether extending this offers any advantages.</p> <p>Newly promoted teams are tricky. For the EPL teams are assigned the relevant relegated team’s rating so the team promoted in first place get the team relegated third from bottom’s rating. It’s not perfect but over time the rating will correct itself and move towards the correct value. For the MLS there is no relegation to worry about so I can avoid this.</p> <p>I am interested in trying to model teams based on their players but this still requires more research into what stats to use per player – passes, tackles, shots, etc etc??</p> <div class="hline"></div> <p><strong>Saze - May 14, 2013</strong></p> <p>I understand that a promoted teams elo rating will correct itself over time but wouldn’t just creating elo ratings for the Championship, league one and league 2 overcome this issue.</p> <p>Moreover if you are interested in individual player elo ratings then castrol rankings is a good player ranking site; “http://www.castrolfootball.com/rankings/rankings/?team=&amp;comp=&amp;nation=&amp;position=&amp;search=&amp;offset=0&amp;jump=1 ” You can find out more on how they create the rankings in their FAQ section.</p> <div class="hline"></div> <p><strong>Martin Eastwood - May 14, 2013</strong></p> <p>I’m not sure it would really work. For example Cardiff’s EI would be based on how they have performed against Championship quality teams so would not reflect how they would perform against Premier League teams.</p> <div class="hline"></div> <p><strong>Goalimpact - May 15, 2013</strong></p> <p>‘Individual player ELOs’</p> <p>This is what I do. It works quite well, however it is far from being easily implemented. If you are interested please drop by my blog Goalimpact.com</p> <div class="hline"></div> <p><strong>Saze - May 16, 2013</strong></p> <p>Very interesting. I like the “top-down” approach that you have taken.</p>Martin EastwoodFri, 10 May 2013 19:30:00 +0100tag:,2013-05-10:2013/05/10EIEPLMLS Player Salaries: 2013/2013/05/10<h2>Introduction</h2> <p>The latest Major League Soccer (MLS) salaries were released recently by the MLS Players’ Union so I thought I would post a quick summary of the data.</p> <h2>Average Salary By Team</h2> <p>The first thing I was interested in was average salaries per team and whether there had been any changes compared with previous seasons (Figure 1).</p> <p>The trend over the past few years has been pretty constant, with LA Galaxy and New York Red Bulls having the highest outgoings on wages, which again continues for 2013.</p> <p>Toronto have typically followed in a distant third place but this season sees them overtaken by Seattle following the addition of Obafemi Martins to their roster.</p> <p><em>click the legend headers to show / hide each season’s data and hover the data points for more information</em></p> <iframe src="http://pena.lt/y/highcharts/mls_average_by_club.html" frameborder="0" scrolling="no" width="100%" height="750"></iframe> <h2>Number of Players</h2> <p>Next I looked at the number of players currently playing in each position. The results are pretty much the same as 2012, with a marginal gain in the number of forwards and defenders registered for this season (Figure 2).</p> <p><em>click the legend headers to show / hide each season’s data and hover the data points for more information</em></p> <iframe src="http://pena.lt/y/highcharts/mls_players_pos.html" frameborder="0" scrolling="no" width="100%" height="500"></iframe> <h2>Average Salary By Position</h2> <p>Next I looked at average salary by position and it is probably no surprise that forwards receive the most remuneration (Figure 3). In fact, the higher up the field you are, the more money you earn, with goalkeepers earning the least followed by defenders, midfielders, attacking midfielders and then forwards.</p> <p>The only player outside of this trend are defensive midfielders who earn even less than goalkeepers. In terms of salary this appears to be the least appreciated position by quite a large margin. If you are out to make money then you are much better off specializing as either a clear-cut defender or midfielder rather than something perhaps between the two. Or even better, learn to score goals…</p> <p><em>click the legend headers to show / hide each season’s data and hover the data points for more information</em></p> <iframe src="http://pena.lt/y/highcharts/mls_salary_pos.html" frameborder="0" scrolling="no" width="100%" height="500"></iframe> <h2>The Big Earners</h2> <p>Although the average salaries show that forwards earn noticeably more than any other position, the actual value is skewed by a few high-profile players earning big bucks. Table 1 shows the top ten earners in the MLS compared with the overall league average. Of the ten players, eight are forwards and two are midfielders. The highest paid defender is Toronto’s Darren O’Dea, ranked 18th overall and the highest paid goalkeeper is Portland’s Donovan Ricketts, ranked just 41st overall.</p> <table class="table"> <tbody> <tr> <td><strong>Club</strong></td> <td><strong>Last Name</strong></td> <td><strong>First Name</strong></td> <td><strong>Pos</strong></td> <td><strong>Base Salary</strong></td> <td><strong>Compensation</strong></td> </tr> <tr> <td>NY</td> <td>Henry</td> <td>Thierry</td> <td>F</td> <td>$3,750,000</td> <td>$4,350,000</td> </tr> <tr> <td>LA</td> <td>Keane</td> <td>Robbie</td> <td>F</td> <td>$4,000,000</td> <td>$4,333,333</td> </tr> <tr> <td>NY</td> <td>Cahill</td> <td>Tim</td> <td>M</td> <td>$3,500,000</td> <td>$3,625,000</td> </tr> <tr> <td>LA</td> <td>Donovan</td> <td>Landon</td> <td>F</td> <td>$2,500,000</td> <td>$2,500,000</td> </tr> <tr> <td>MTL</td> <td>Di Vaio</td> <td>Marco</td> <td>F</td> <td>$1,000,008</td> <td>$1,937,508</td> </tr> <tr> <td>SEA</td> <td>Martins</td> <td>Obafemi</td> <td>F</td> <td>$1,600,000</td> <td>$1,725,000</td> </tr> <tr> <td>TOR</td> <td>Koevermans</td> <td>Danny</td> <td>F</td> <td>$1,250,000</td> <td>$1,663,323</td> </tr> <tr> <td>VAN</td> <td>Miller</td> <td>Kenny</td> <td>F</td> <td>$1,114,992</td> <td>$1,132,492</td> </tr> <tr> <td>SEA</td> <td>Montero</td> <td>Fredy</td> <td>F</td> <td>$700,000</td> <td>$856,000</td> </tr> <tr> <td>DAL</td> <td>Ferreira</td> <td>David</td> <td>M-F</td> <td>$625,000</td> <td>$730,000</td> </tr> <tr> <td></td> <td></td> <td></td> <td>League Average</td> <td>$141,903</td> <td>$159,849</td> </tr> </tbody> </table> <p>Since these star players are skewing the averages, we can analyse the <a href="http://en.wikipedia.org/wiki/Median">median</a> salary instead using box and whiskers plots (Figure 4). These show the distribution of the different salaries for each position where the thick line across the center of the box is the median salary, the top and bottom of the box are the 75th and 25th <a href="http://en.wikipedia.org/wiki/Percentile">percentiles</a> and the whiskers represent 1.5x the <a href="http://en.wikipedia.org/wiki/Interquartile_range">interquartile range</a>. Outliers outside of this range are then plotted as dots.</p> <p><img alt="Pelican" src="../../../../images/Rplot01.png" /></p> <p>Looking at the median salaries there is actually very little difference between the outfield players. The average MLS player’s salary is also clearly nothing like the league’s star players, in fact if we remove the top twenty earners then the overall league average falls from $159,849 to $113,516 with a median of $83,000 and a <a href="http://en.wikipedia.org/wiki/Mode_(statistics)">mode</a> of $46,500, which is the league minimum for first teamers (roster positions 1-24).</p> <h2>Conclusions</h2> <p>This is only a quick overview of the data and there is still a lot more to explore so feel free to get in touch if there is anything in particular you want to have a look at.</p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>Dennis - May 10, 2013</strong></p> <p>Could you just add in the position-specific medians and averages, as well as both of those without the top-payed players? The last chart does a lot of work but is a little short on the details.</p> <p>Also, it looks like the median salary for defenders is higher than midfielders. Why is that?</p> <p>Thanks!</p> <div class="hline"></div> <p><strong>Martin Eastwood - May 10, 2013</strong></p> <p>Good idea, I’ll take a look!</p>Martin EastwoodFri, 10 May 2013 19:30:00 +0100tag:,2013-05-10:2013/05/10ChanceMLSThe Eastwood Index, MLS and Parity/2013/05/07<h2>Introduction</h2> <p>I showed in my <a href="http://pena.lt/y/2013/05/02/how-much-does-luck-affect-mls/">last post</a> how Major League Soccer (MLS) is a much more closely matched league than the English Premier League (EPL), with the wage cap and draft system increasing the parity between teams.</p> <h2>The Eastwood Index</h2> <p>This high level of parity can also be seen using the <a href="http://pena.lt/y/2013/02/21/rating-teams-and-predicting-football-matches-using-the-ei-index/">Eastwood Index</a> (EI), a rating system designed to calculate odds of match outcomes when different teams play each other.</p> <p>The Eastwood Index rates teams so that the average rating is 2000 and the higher the rating the better a team is compared with the rest of the league.</p> <p>EI ratings increase when teams win matches or draw against superior opposition and decrease when teams lose matches or draw against weaker opposition. The size of the gain or loss in ratings is linked to the quality of the opposition so that beating a superior team is worth more than winning against a lower ranked team.</p> <p>The change in EI rating is also weighted by the goal difference in the match so that the greater the difference in goals scored or conceded then the greater the change in ratings. Home advantage is also included in the calculations so that the home team is expected to perform better when playing at home compared with away.</p> <h2>Major League Soccer EI Ratings</h2> <p>Currently, the highest rated team in MLS is LA Galaxy, with an EI of 2506 (Table 1) while the lowest is Toronto FC, with an EI of just 1303 (Table 2). Outside of this, the majority of teams are fairly evenly matched in MLS and are rated around 1880 – 2300 demonstrating the parity in the league.</p> <table class="table"> <tbody> <tr> <td><strong>Position</strong></td> <td><strong>Club</strong></td> <td><strong>EI Rating</strong></td> </tr> <tr> <td>1</td> <td>New York Red Bulls</td> <td>2225</td> </tr> <tr> <td>2</td> <td>Sporting Kansas City</td> <td>2374</td> </tr> <tr> <td>3</td> <td>Houston Dynamo</td> <td>2210</td> </tr> <tr> <td>4</td> <td>Montreal Impact</td> <td>2052</td> </tr> <tr> <td>5</td> <td>Columbus Crew</td> <td>2082</td> </tr> <tr> <td>6</td> <td>Philadelphia Union</td> <td>1809</td> </tr> <tr> <td>7</td> <td>New England Revolution</td> <td>1610</td> </tr> <tr> <td>8</td> <td>Chicago Fire</td> <td>2063</td> </tr> <tr> <td>9</td> <td>Toronto FC</td> <td>1303</td> </tr> </tbody> </table> <p><strong>Table 1: MLS Eastern Conference EI Ratings</strong></p> <table class="table"> <tbody> <tr> <td><strong>Position</strong></td> <td><strong>Club</strong></td> <td><strong>EI Rating</strong></td> </tr> <tr> <td>1</td> <td>FC Dallas</td> <td>2150</td> </tr> <tr> <td>2</td> <td>LA Galaxy</td> <td>2506</td> </tr> <tr> <td>3</td> <td>Real Salt Lake</td> <td>2271</td> </tr> <tr> <td>4</td> <td>Portland Timbers</td> <td>1804</td> </tr> <tr> <td>5</td> <td>Colorado Rapids</td> <td>1871</td> </tr> <tr> <td>6</td> <td>Chivas USA</td> <td>1433</td> </tr> <tr> <td>7</td> <td>San Jose Earthquakes</td> <td>2267</td> </tr> <tr> <td>8</td> <td>Vancouver Whitecaps</td> <td>1690</td> </tr> <tr> <td>9</td> <td>Seattle Sounders FC</td> <td>2392</td> </tr> </tbody> </table> <p><strong>Table 2: MLS Western Conference EI Ratings</strong></p> <p>It is still a bit early in the season to draw too many conclusions but if we combine the two MLS conferences together then Columbus currently come out as mid-table, with an EI rating of 2082. This is just 424 lower than the top team (LA Galaxy) and 569 higher than the bottom of the league (Toronto FC), and close to theoretical league average EI of 2000.</p> <h2>MLS Compared with EPL</h2> <p>Compare this with the EPL (Table 3) and you can see an immediate difference in the level of parity. Taking the average of West Ham and Stoke to be the middle of the table then a mid placed EPL team’s EI is below the theoretical average at 1634, just 264 better than QPR at the bottom of the table and a gigantic 1431 away from Manchester United. The top of the EPL has been very much a league-within-a-league for a while now, with average teams vastly more likely to be relegated than they are of ever winning anything or even reaching the European qualification spots.</p> <table class="table"> <tbody> <tr> <td><strong>Position</strong></td> <td><strong>Club</strong></td> <td><strong>EI Rating</strong></td> </tr> <tr> <td>1</td> <td>Manchester United</td> <td>3064</td> </tr> <tr> <td>2</td> <td>Manchester City</td> <td>2909</td> </tr> <tr> <td>3</td> <td>Chelsea</td> <td>2598</td> </tr> <tr> <td>4</td> <td>Arsenal</td> <td>2627</td> </tr> <tr> <td>5</td> <td>Tottenham Hotspur</td> <td>2514</td> </tr> <tr> <td>6</td> <td>Everton</td> <td>2351</td> </tr> <tr> <td>7</td> <td>Liverpool</td> <td>2291</td> </tr> <tr> <td>8</td> <td>West Bromwich Albion</td> <td>1883</td> </tr> <tr> <td>9</td> <td>Swansea City</td> <td>1797</td> </tr> <tr> <td>10</td> <td>West Ham United</td> <td>1520</td> </tr> <tr> <td>11</td> <td>Stoke City</td> <td>1747</td> </tr> <tr> <td>12</td> <td>Fulham</td> <td>1806</td> </tr> <tr> <td>13</td> <td>Aston Villa</td> <td>1704</td> </tr> <tr> <td>14</td> <td>Southampton</td> <td>1611</td> </tr> <tr> <td>15</td> <td>Sunderland</td> <td>1741</td> </tr> <tr> <td>16</td> <td>Norwich City</td> <td>1607</td> </tr> <tr> <td>17</td> <td>Newcastle United</td> <td>1814</td> </tr> <tr> <td>18</td> <td>Wigan Athletic</td> <td>1650</td> </tr> <tr> <td>19</td> <td>Reading</td> <td>1397</td> </tr> <tr> <td>20</td> <td>Queens Park Rangers</td> <td>1369</td> </tr> </tbody> </table> <p><strong>Table 3: EPL EI Ratings</strong></p> <h2>Parity</h2> <p>Compared with the EPL, MLS is a very evenly matched league where the margins between the top and bottom of the conferences are small, making it a really exciting league to follow as virtually any team is in with a chance of reaching the playoffs at the start of the season.</p> <h2>Comments</h2> <p><div class="hline"></div></p>Martin EastwoodTue, 07 May 2013 19:30:00 +0100tag:,2013-05-07:2013/05/07ChanceMLSEI Match Probabilities for the English Premier League/2013/05/03<p>Here are the latest match probabilities for the English Premier League calculated using the Eastwood Index (EI).</p> <p>Please note that I have not included next week’s mid-week matches yet as the odds for those will change depending on how this weekend’s matches finish. I’ll try and add those on Tuesday once I have all the data available.</p> <p><em>Edit – Table 1 is now updated with the odds for this week’s mid week matches.</em></p> <table class="table"> <tbody> <tr> <td><strong>Home Team</strong></td> <td><strong>Away Team</strong></td> <td><strong>Home (%)</strong></td> <td><strong>Draw (%)</strong></td> <td><strong>Away (%)</strong></td> </tr> <tr> <td>Fulham</td> <td>Reading</td> <td>59</td> <td>25</td> <td>16</td> </tr> <tr> <td>Norwich</td> <td>Aston Villa</td> <td>44</td> <td>30</td> <td>26</td> </tr> <tr> <td>Swansea</td> <td>Man City</td> <td>15</td> <td>31</td> <td>54</td> </tr> <tr> <td>Tottenham</td> <td>Southampton</td> <td>68</td> <td>20</td> <td>12</td> </tr> <tr> <td>West Brom</td> <td>Wigan</td> <td>54</td> <td>27</td> <td>19</td> </tr> <tr> <td>West Ham</td> <td>Newcastle</td> <td>37</td> <td>32</td> <td>32</td> </tr> <tr> <td>QPR</td> <td>Arsenal</td> <td>13</td> <td>30</td> <td>57</td> </tr> <tr> <td>Liverpool</td> <td>Everton</td> <td>44</td> <td>30</td> <td>26</td> </tr> <tr> <td>Man United</td> <td>Chelsea</td> <td>60</td> <td>24</td> <td>16</td> </tr> <tr> <td>Sunderland</td> <td>Stoke</td> <td>45</td> <td>30</td> <td>25</td> </tr> <tr> <td>Man City</td> <td>West Brom</td> <td>72</td> <td>18</td> <td>10</td> </tr> <tr> <td>Wigan</td> <td>Swansea</td> <td>41</td> <td>31</td> <td>28</td> </tr> <tr> <td>Chelsea</td> <td>Tottenham</td> <td>47</td> <td>30</td> <td>23</td> </tr> </tbody> </table> <h2>Comments</h2> <p><div class="hline"></div></p>Martin EastwoodFri, 03 May 2013 19:30:00 +0100tag:,2013-05-03:2013/05/03EIEPLHow Much Does Luck Affect MLS?/2013/05/02<h2>Introduction</h2> <p>Following my recent article for <a href="http://www.bettingexpert.com/blog/football-luck">Betting Expert</a> quantifying how large a role luck plays in the English Premier League (EPL) I thought it would be interesting to look at Major League Soccer (MLS) too.</p> <h2>MLS Structure</h2> <p>MLS is structured differently to the EPL as it has followed other North American sports in implementing wage caps and player drafts. Unlike the current salary free-for-all in the EPL, MLS clubs are currently limited to spending a maximum of $2.95 million in wages over their first 20 roster spots, with up to three additional designated players paid (partially) outside of this salary cap.</p> <p>MLS also has a draft system that takes place each January during which teams can sign players graduating from college or otherwise signed by the league. The draft is split into three rounds and is designed to give priority to the league’s weaker teams allowing them first choice of players ahead of the more successful teams.</p> <p>MLS is also a shorter season than the EPL with teams playing just 34 matches compared with the EPL’s 38. This is important as the more matches that are played then the more opportunity talent has to overcome luck.</p> <p>Overall, this all works towards increasing parity in the MLS and making it a more evenly balanced league, which in turn should enhance the role luck plays.</p> <h2>Results</h2> <p>Using <a href="http://www.bettingexpert.com/blog/football-luck">my adaptation</a> of <a href="http://www.tangotiger.net/">Tom ‘Tango’ Tiger’s</a> <a href="http://blog.philbirnbaum.com/2006/08/on-correlation-r-and-r-squared.html">baseball equation</a> I calculated the average win rate in MLS going back to 2004 and the variance of the win rate. I them calculated the variance expected due to luck and subtracted one from the other to get the amount of variance attributed to talent.</p> <p>Luck accounts for around 35% of a team’s win rate in the EPL and I was expecting MLS to be higher, but it initially came out at a staggering 82% for MLS. Instinctively this seems too high and I suspect it is inaccurate due to the changes in MLS’s structure over the years. For example, back in 2004 there were only ten teams and one conference while there are currently 19 teams and two conferences. There have also been changes to the level of the salary cap and the number of designated players allowed over this time period too.</p> <p>So I went back and reprocessed the results using just the 2010–2012 data. Although this reduces the sample size considerably it leaves us with data more representative of the current state of MLS. And the results this time? Luck accounted for around 57% of a team’s win percentage compared with just 43% for talent.</p> <p>So compared with the EPL, the structure of MLS does appear to increase parity and enhance the influence luck has in deciding the league champions. In fact, being lucky is probably the more important of the two, although luck on its own is not enough – you need to be a talented team with luck on its side to win the MLS Cup.</p> <h2>Comments</h2> <p><div class="hline"></div></p>Martin EastwoodThu, 02 May 2013 19:30:00 +0100tag:,2013-05-02:2013/05/02ChanceMLSHow Much Does Luck Affect Football?/2013/04/30<h2>Introduction</h2> <p>I’ve written a new article for Betting Expert quantifying how much luck affects football. Take a look <a href="http://www.bettingexpert.com/blog/football-luck">here</a> as it is probably more than you are expecting!</p> <h2>Comments</h2> <p><div class="hline"></div></p>Martin EastwoodTue, 30 Apr 2013 19:30:00 +0100tag:,2013-04-30:2013/04/30ChanceWhat Is A Meaningful Sample Size?/2013/04/28<h2>Introduction</h2> <p>I had an article published at Betting Expert last week looking at how to determine statistically how much data you need to make accurate predictions.</p> <p>Find out more by reading the rest of this article <a href="http://www.bettingexpert.com/blog/how-much-data">here</a>.</p> <h2>Comments</h2> <p><div class="hline"></div></p>Martin EastwoodSun, 28 Apr 2013 19:30:00 +0100tag:,2013-04-28:2013/04/28EI Match Probabilities for the English Premier League/2013/04/26<p>It’s Friday again so here are this weekend’s match probabilities for the English Premier League.</p> <p>I was a little surprised to see Manchester United come out as favourites against Arsenal, even they they are away from home. but the odds are so close though that it looks like a potential draw. However, it all depends on Sir Alex Ferguson’s squad selection, with the league won will he rest the bigger stars and let some of the second-string players reach enough appearances to be eligible for a winners medal? Anders Lindegaard, for example, still needs to play another two matches this season to claim his medal.</p> <p>Other possible draws include Southampton Vs West Brom and Newcastle Vs Liverpool, while you’d hope Reading Vs QPR will not be a draw as a single point is useless for either team.</p> <table class="table"> <tbody> <tr> <td><strong>Home Team</strong></td> <td><strong>Away Team</strong></td> <td><strong>Home (%)</strong></td> <td><strong>Draw (%)</strong></td> <td><strong>Away (%)</strong></td> </tr> <tr> <td>Man City</td> <td>West Ham</td> <td>78</td> <td>13</td> <td>9</td> </tr> <tr> <td>Everton</td> <td>Fulham</td> <td>58</td> <td>25</td> <td>17</td> </tr> <tr> <td>Southampton</td> <td>West Brom</td> <td>39</td> <td>31</td> <td>29</td> </tr> <tr> <td>Stoke</td> <td>Norwich</td> <td>47</td> <td>29</td> <td>24</td> </tr> <tr> <td>Wigan</td> <td>Tottenham</td> <td>20</td> <td>32</td> <td>48</td> </tr> <tr> <td>Newcastle</td> <td>Liverpool</td> <td>35</td> <td>32</td> <td>33</td> </tr> <tr> <td>Reading</td> <td>QPR</td> <td>44</td> <td>30</td> <td>26</td> </tr> <tr> <td>Chelsea</td> <td>Swansea</td> <td>65</td> <td>22</td> <td>13</td> </tr> <tr> <td>Arsenal</td> <td>Man United</td> <td>31</td> <td>32</td> <td>37</td> </tr> <tr> <td>Aston Villa</td> <td>Sunderland</td> <td>40</td> <td>31</td> <td>29</td> </tr> </tbody> </table> <h2>Comments</h2> <p><div class="hline"></div></p>Martin EastwoodFri, 26 Apr 2013 19:30:00 +0100tag:,2013-04-26:2013/04/26EIEPLEI Match Probabilities for the English Premier League/2013/04/19<p>It’s been a busy day but I’ve finally got the probabilities for this weekend’s matches completed.</p> <p>There are some pretty close games, with QPR Vs Stoke, Tottenham Vs Man City and Liverpool Vs Chelsea all looking like potential draws. Plus you could maybe throw Sunderland Vs Everton and even West Ham Vs Wigan into that group too.</p> <p>The only clear favourties are Manchester United and Norwich so it’s going to be a tricky week to call.</p> <table class="table"> <tbody> <tr> <td>Home Team</td> <td>Away Team</td> <td>Home(%)</td> <td>Draw (%)</td> <td>Away (%)</td> </tr> <tr> <td>Fulham</td> <td>Arsenal</td> <td>26</td> <td>32</td> <td>42</td> </tr> <tr> <td>Norwich</td> <td>Reading</td> <td>53</td> <td>27</td> <td>20</td> </tr> <tr> <td>QPR</td> <td>Stoke</td> <td>37</td> <td>32</td> <td>31</td> </tr> <tr> <td>Sunderland</td> <td>Everton</td> <td>28</td> <td>33</td> <td>39</td> </tr> <tr> <td>Swansea</td> <td>Southampton</td> <td>49</td> <td>29</td> <td>22</td> </tr> <tr> <td>West Brom</td> <td>Newcastle</td> <td>45</td> <td>30</td> <td>25</td> </tr> <tr> <td>West Ham</td> <td>Wigan</td> <td>41</td> <td>31</td> <td>28</td> </tr> <tr> <td>Tottenham</td> <td>Man City</td> <td>31</td> <td>33</td> <td>36</td> </tr> <tr> <td>Liverpool</td> <td>Chelsea</td> <td>36</td> <td>32</td> <td>32</td> </tr> <tr> <td>Man United</td> <td>Aston Villa</td> <td>79</td> <td>12</td> <td>9</td> </tr> </tbody> </table> <h2>Comments</h2> <p><div class="hline"></div></p>Martin EastwoodFri, 19 Apr 2013 19:30:00 +0100tag:,2013-04-19:2013/04/19EIEPLEI Match Predictions for the English Premier League/2013/04/12<p>Here we go with this week’s predictions from the Eastwood Index (EI)!</p> <p>The EI doesn’t hold out much chance for Wigan or West Ham getting three points off the two Manchester clubs this week, with the lowest odds I think the EI has ever produced. Interestingly, both of Fulham’s matches look like possible draws.</p> <table class="table"> <tbody> <tr> <td>Home Team</td> <td>Away Team</td> <td>Home (%)</td> <td>Draw (%)</td> <td>Away (%)</td> </tr> <tr> <td>Arsenal</td> <td>Norwich</td> <td>70</td> <td>19</td> <td>11</td> </tr> <tr> <td>Aston Villa</td> <td>Fulham</td> <td>35</td> <td>32</td> <td>33</td> </tr> <tr> <td>Everton</td> <td>QPR</td> <td>70</td> <td>18</td> <td>11</td> </tr> <tr> <td>Reading</td> <td>Liverpool</td> <td>22</td> <td>32</td> <td>46</td> </tr> <tr> <td>Southampton</td> <td>West Ham</td> <td>50</td> <td>29</td> <td>21</td> </tr> <tr> <td>Newcastle</td> <td>Sunderland</td> <td>48</td> <td>29</td> <td>23</td> </tr> <tr> <td>Stoke</td> <td>Man United</td> <td>10</td> <td>29</td> <td>61</td> </tr> <tr> <td>Arsenal</td> <td>Everton</td> <td>51</td> <td>28</td> <td>21</td> </tr> <tr> <td>Man City</td> <td>Wigan</td> <td>77</td> <td>14</td> <td>9</td> </tr> <tr> <td>West Ham</td> <td>Man United</td> <td>7</td> <td>27</td> <td>66</td> </tr> <tr> <td>Fulham</td> <td>Chelsea</td> <td>32</td> <td>32</td> <td>36</td> </tr> </tbody> </table> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>Andrew Beasley - April 13, 2013</strong></p> <p>Hi Martin – is that Villa v Fulham prediction the closest you’ve had (as in a range of just three between largest and smallest)?</p> <p>Ever had a 33/33/34 for instance?</p> <p>Cheers.</p> <div class="hline"></div> <p><strong>Martin Eastwood - April 13, 2013</strong></p> <p>I think it is the closest yet. I wonder if equal odds of home and away win suggest a draw is likely?</p>Martin EastwoodFri, 12 Apr 2013 19:30:00 +0100tag:,2013-04-12:2013/04/12EIEPLEI Match Predictions for the English Premier League/2013/04/05<p>A couple of weeks back I demonstrated how the EI is more accurate than the bookies based on rank probability scores but a few people have asked if I can do something a bit simpler so Figure 1 shows how often the EI picked the winner as being the favourite compared with aggregated bookmaker’s odds. It’s pretty close but the EI seems to have a small but reasonably constant margin over the bookmaker so far this season.</p> <p><img alt="Pelican" src="../../../../images/EI_BM_2012_2013.png" /> <strong>Figure 1: EI Versus Bookmakers</strong></p> <p>Last week turned out to be a pretty good week with the EI managing to correctly predict the winner in eight out of the ten matches played. I’d made a few minor tweaks before posting last week’s odds to try and enhance the the way draws and away wins are calculated so hopefully the EI will be able to maintain its edge over the bookmakers.</p> <p>Anyway, here are this week’s odds:</p> <table class="table"> <tbody> <tr> <td><strong>Home Team</strong></td> <td><strong>Away Team</strong></td> <td><strong>Home (%)</strong></td> <td><strong>Draw (%)</strong></td> <td><strong>Away (%)</strong></td> </tr> <tr> <td>Reading</td> <td>Southampton</td> <td>38</td> <td>32</td> <td>30</td> </tr> <tr> <td>Norwich</td> <td>Swansea</td> <td>41</td> <td>31</td> <td>28</td> </tr> <tr> <td>Stoke</td> <td>Aston Villa</td> <td>50</td> <td>29</td> <td>21</td> </tr> <tr> <td>West Brom</td> <td>Arsenal</td> <td>26</td> <td>32</td> <td>42</td> </tr> <tr> <td>Liverpool</td> <td>West Ham</td> <td>66</td> <td>22</td> <td>12</td> </tr> <tr> <td>Tottenham</td> <td>Everton</td> <td>49</td> <td>29</td> <td>22</td> </tr> <tr> <td>Chelsea</td> <td>Sunderland</td> <td>66</td> <td>21</td> <td>13</td> </tr> <tr> <td>Newcastle</td> <td>Fulham</td> <td>44</td> <td>30</td> <td>26</td> </tr> <tr> <td>QPR</td> <td>Wigan</td> <td>39</td> <td>31</td> <td>30</td> </tr> <tr> <td>Man United</td> <td>Man City</td> <td>51</td> <td>28</td> <td>21</td> </tr> </tbody> </table> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>Dan - April 11, 2013</strong></p> <p>Hi Martin,</p> <p>Enjoy reading your articles.</p> <p>Just wondered if you considered your Liverpool-West Ham prediction as a success versus the bookmakers.</p> <p>I ask as although you suggest Liverpool are the most likely winners (66%) the odds makers priced them up as 1/3 (75%) chances. So in that case would you say your data would trigger a lay of Liverpool at 1/3 rather than a back?</p> <p>Cheers,</p> <p>Dan.</p> <div class="hline"></div> <p><strong>Martin Eastwood - April 11, 2013</strong></p> <p>Hi Dan</p> <p>That’s an interesting point. I am still working on what is the best strategy to use with the model. Ideally I will be able to identify a pattern between where my odds diverge from the bookmakers and from there work out where the real value lies.</p> <p>Liverpool’s odds from the bookmakers are always a little strange though as they supposedly get bet on very heavily from Asia, which leads to the odds moving towards Liverpool so I suspect they were rated more likely to win than they actually were by the bookmakers.</p> <p>Cheers,</p> <p>Martin</p>Martin EastwoodFri, 05 Apr 2013 19:30:00 +0100tag:,2013-04-05:2013/04/05EIEPLUnderstanding Total Shot Ratio in Football/2013/04/02<h2>Introduction</h2> <p>The use of Total Shot Ratio (or TSR) seems to have slowly been gaining ground so I thought it would be worth analyzing the statistic in more detail to see what it can and cannot do.</p> <h2>What is Total Shot Ratio?</h2> <p>Put simply Total Shot Ratio is the proportion of shots taken by one team compared with another. It can be calculated by dividing the number of shots taken by a team by the total shots overall (Figure 1).</p> <p><span class="math">\(TSR=ShotsFor/(ShotsFor+ShotsAway)\)</span></p> <p><strong>Figure 1: Total Shot Ratio</strong></p> <p>It is often used as a surrogate for dominance as the presumption is that the team taking the majority of the shots will be controlling the match and possibly limiting the opposition’s ability to shoot at goal.</p> <h2>Total Shot Ratio Data</h2> <p>Using data taken from the football-data.co.uk website I calculated the Total Shot Ratios for all matches from the English Premier League going back to the 2001-2002 season, giving a total of 8360 data points, which are normally distributed (Figure 2).</p> <p><img alt="Pelican" src="../../../../images/130402-TSR-Distribution.png" /> <strong>Figure 2: Distribution of Total Shot Ratios</strong></p> <p>The average Total Shot Ratio is always 0.5, because for every value above 0.5 you always an equivalent value below it for the opposition. For example, if the home team’s Total Shot Ratio is 0.75 then the away team’s ratio must be 0.25.</p> <p><span class="math">\((0.75 + 0.25) / 2 = 0.5\)</span></p> <p>The standard deviation, which is a measure of the dispersion of the data around the average value, was 0.166.</p> <h2>Total Shot Ratio: Correlation With Goals Scored</h2> <p>Since Total Shot Ratios are being used to show dominance in a match it makes sense to assess the correlation with the number of goals scored. The higher a team’s Total Shot Ratio is then the more shots it is having compared with the opposition so the expectation would be that they would score more goals. However, this does not seem to be the case as the relationship between the two is extremely weak (Figure 3; r2=0.079).</p> <p><img alt="Pelican" src="../../../../images/130402-TSR-Goals.png" /> <strong>Figure 3: Correlation Between Total Shot Ratio and Goals Scored</strong></p> <h2>Total Shot Ratio: Correlation With Goal Difference</h2> <p>So how about the relationship between Total Shot Ratio and goal difference instead? Since teams with higher Total Shot Ratios are thought to be dominating matches, perhaps they are more likely to have a higher goal difference in the match as they may also be less likely to concede goals? Again though the correlation is weak (Figure 4, r2=0.11).</p> <p><img alt="Pelican" src="../../../../images/130402-TSR-Goal-Diff.png" /> <strong>Figure 4: Correlation Between Total Shot Ratio and Goal Difference</strong></p> <h2>Total Shot Ratio: Correlation With Match Outcomes</h2> <p>Additionally, the relationship between Total Shot Ratio and match outcome is also poor (r2=0.066) suggesting that Total Shot Ratio also has very little influence on the likelihood of a team winning a particular match. Just because you are taking a greater proportion of the shots does not mean you are any more likely to win.</p> <h2>Total Shot Ratio: Correlation With Points Per Season</h2> <p>Although the match-by-match correlations above are weak there is the suggestion of a trend so it may be that Total Shot Ratio is heavily luck driven in the short term and that we need more matches before we can see the overall effects of a higher ratio. For example, looking at the correlation between Total Shot Ratio and points over an entire season shows a pretty decent relationship between the two (Figure 5). This suggests that long term possessing a higher Total Shot Ratio is in fact associated with fewer matches being lost per season.</p> <p><img alt="Pelican" src="../../../../images/130402-TSR-Points.png" /> <strong>Figure 5: Correlation Between Total Shot Ratio and Points</strong></p> <h2>Total Shot Ratio: How Much Data is Enough?</h2> <p>So if Total Shot Ratio is only becoming meaningful over longer periods of time then how much data do we actually need before it becomes a useful metric? To look at this I calculated the overall Total Shot Ratio per season by team and then randomized the order of each match that season. I then looked at how the deviation changed over course of a season compared with the overall ratio, e.g. after five matches, ten matches etc (Figure 6).</p> <p><img alt="Pelican" src="../../../../images/130402-TSR-Deviation.png" /> <strong>Figure 6: Deviation in Total Shot Ratio by Sample Size</strong></p> <p>As more data is used to calculate the Total Shot Ratio it moves closer towards its true value and the deviation decreases as the effect of any outlier matches becomes less influential. With fewer matches being used to calculate the Total Shot Ratio there is more dispersion and variability in the calculated value. Interestingly, there is still a reduction in the deviation moving from 30 matches to 38 matches, suggesting that we may need at least a full season’s worth of data to get an accurate measure of a team’s Total Shot Ratio.</p> <h2>Total Shot Ratio: Calculated Sample Size</h2> <p>Another option to find out how much data we need is to calculate the sample size required to identify specific differences in Total Shot Ratio. There are a number of different methods for this but the commonly used t-test sample size estimation suggests that to be 95% certain that two teams with a difference in Total Shot Ratio of 0.1 are actually different from each other takes 45 matches.</p> <p>So, to be statistically certain that a team with a Total Shot Ratio of 0.6 actually has a higher ratio than a team with a Total Shot Ratio of 0.5 rather than it just being down to random variability requires over a season’s worth of matches to be played.</p> <p>As the differences become smaller then the number of matches required increases even further – to identify a difference in Total Shot Ratio of 0.05 takes nearly five season’s worth of matches!</p> <h2>Conclusions</h2> <p>In the short term, Total Shot Ratio appears to virtually meaningless in terms of goals scored or match outcomes as its variability is so high.</p> <p>Over the long term though, skill outweighs luck and Total Shot Ratio becomes increasingly correlated with outcomes. However, it may take a long time for this to occur and may be less accurate than other statistics available if you are interested in predicting performance.</p> <p>Finally, this article is not intended to say “do not to use Total Shot Ratio” as it is still an interesting metric. Rather, make sure that you are aware of its abilities and limitations if you are planning on using it for analysis.</p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>Bob - April 2, 2013</strong></p> <p>You’re getting closer to the holy grail. Quality of shot data is important. TSR can be refined into something slightly less random (though shots alone are still less random than goals).</p> <div class="hline"></div> <p><strong>Martin Eastwood - April 2, 2013</strong></p> <p>I totally agree, the quality of shots are important. You can improve your TSR by taking lots and lots of shots but they are not necessarily going to improve you chances of scoring. It is the quality of shots that are important not the amount of them.</p> <div class="hline"></div> <p><strong>sidereal - April 2, 2013</strong></p> <p>A good compromise might be looking at shots in the box rather than overall shots. It’s a little bit more recordkeeping (though Opta has them for leagues it tracks) without having to subjectively evaluate shot quality. And I suspect it’d correlate better over a shorter sample size. I can run the correlation between TBSR and results in MLS when I get some time.</p> <div class="hline"></div> <p><strong>Martin Eastwood - April 2, 2013</strong></p> <p>Sounds good, would be interesting to see how the correlation looks</p> <div class="hline"></div> <p><strong>sidereal - April 4, 2013</strong></p> <p>Had time to run this today. With two years of MLS data I found substantially lower correlations than your EPL data. Possibly because of the smaller sample size. More likely because MLS shot quality is lower and more random.</p> <p>R squared for TSR to goals is 0.0245 and to GD is 0.05. Switching to box shots improves those marginally to 0.042 and 0.087.</p> <p>But at the season level the improvement mostly goes away. TSR to seasonal PPG is 0.324 and TBSR to seasonal PPG is 0.349.</p> <div class="hline"></div> <p><strong>Martin Eastwood - April 4, 2013</strong></p> <p>Thanks, that is really interesting to see!</p> <p>There is also much more parity in the MLS compared with other leagues too, which may be having an effect as the more closely matched the teams are then the more impact luck has on determining outcomes.</p> <div class="hline"></div> <p><strong>Bob - April 2, 2013</strong></p> <p>It isn’t actually that difficult to measure quality of shots, as long as you’re prepared to put in 10-15 minutes work per week (per league). For the top five leagues in Europe anyway, plus a few others (including the npower leagues from next season).</p> <p>I agree shots alone has it’s limitations but over a 20-25 game sample, I do think TSR is an extremely valuable measure and one that has called a few regressions this season (Sunderland the most obvious) that most observers did not see coming.</p> <div class="hline"></div> <p><strong>shuddertothink - April 3, 2013</strong></p> <p>What would be the best way to quantify ‘shot quality’?</p> <p>The best we have on ‘shot quality’ is Shots on target to points has an r2 of .0685 in 2012/13.</p> <p>There may be issues with the sample size of just 600 data points in comparison to 7600 or so in Martin’s 10 year sample.</p> <p>As was stated skill outweighs luck given a bigger sample</p> <div class="hline"></div> <p><strong>Turkish - November 29, 2013</strong></p> <p>Do you teach any courses at the moment on football based prediction?</p> <p>I am sure a lot of people would be keen to see you present your information – would you be interested on doing that?</p> <div class="hline"></div> <p><strong>Martin Eastwood - December 4, 2013</strong></p> <p>Hi Matthew,</p> <p>I don’t have any courses planned but it would be something really interesting to do if there was enough interest from people!</p> <p>Cheers,</p> <p>Martin</p> <div class="hline"></div> <p><strong>Dzof - June 14, 2014</strong></p> <p>Thanks for the article, very interesting.</p> <p>Do you have the numerical values for figure 4 published anywhere (Correlation between shot ratio and goal difference)? Or raw data?</p> <p>I basically was looking for the observed probability a team wins/draws/loses a game given a certain shot ratio, e.g. When a team has 60-70% shot ratio, what % of games do they win?</p> <p>Extending this, the other question that comes to mind is if this probability is consistent across seasons / teams / leagues.</p> <p>Keep up the good work!</p> <div class="hline"></div> <p><strong>Martin Eastwood - June 14, 2014</strong></p> <p>Not to hand but I’m planning an update to the site over the summer to provide access to that sort of thing so keep an eye out for that!</p> <div class="hline"></div> <p><strong>Michael - October 31, 2014</strong></p> <p>Thanks for the article. I was looking for statistics like that! You said there are other statistics for predicting performance which are more accurate than the TSR. Which ones were you talking about?</p> <div class="hline"></div> <p><strong>Martin Eastwood - October 31, 2014</strong></p> <p>Hi Michael, thanks for the message. Take a look at expected goals to start with :)</p> <script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: 'center'," + " displayIndent: '0em'," + " showMathMenu: true," + " tex2jax: { " + " inlineMath: [ ['\\\\(','\\\\)'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" + " } " + "}); "; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); } </script>Martin EastwoodTue, 02 Apr 2013 19:30:00 +0100tag:,2013-04-02:2013/04/02TSREI Match Predictions for the English Premier League/2013/03/28<p>After last week’s international matches, domestic football is finally back so here are this weekend’s match predictions using my EI predictive model. Let’s see if it can keep up its good form and continue to beat the bookmakers!</p> <table class="table"> <tbody> <tr> <td><strong>Home Team</strong></td> <td><strong>Away Team</strong></td> <td><strong>Home (%)</strong></td> <td><strong>Draw (%)</strong></td> <td><strong>Away (%)</strong></td> </tr> <tr> <td>Sunderland</td> <td>Man Utd</td> <td>10</td> <td>29</td> <td>61</td> </tr> <tr> <td>Arsenal</td> <td>Reading</td> <td>74</td> <td>16</td> <td>10</td> </tr> <tr> <td>Man City</td> <td>Newcastle</td> <td>70</td> <td>18</td> <td>11</td> </tr> <tr> <td>Southampton</td> <td>Chelsea</td> <td>20</td> <td>32</td> <td>48</td> </tr> <tr> <td>Swansea</td> <td>Tottenham</td> <td>27</td> <td>32</td> <td>40</td> </tr> <tr> <td>West Ham</td> <td>West Brom</td> <td>30</td> <td>32</td> <td>37</td> </tr> <tr> <td>Wigan</td> <td>Norwich</td> <td>43</td> <td>30</td> <td>26</td> </tr> <tr> <td>Everton</td> <td>Stoke City</td> <td>60</td> <td>24</td> <td>16</td> </tr> <tr> <td>Aston Villa</td> <td>Liverpool</td> <td>28</td> <td>30</td> <td>42</td> </tr> <tr> <td>Fulham</td> <td>QPR</td> <td>59</td> <td>25</td> <td>16</td> </tr> </tbody> </table> <h2>Comments</h2> <p><div class="hline"></div></p>Martin EastwoodThu, 28 Mar 2013 19:30:00 +0000tag:,2013-03-28:2013/03/28EIEPLHow Accurate Are The EI Football Predictions?/2013/03/21<h2>Introduction</h2> <p>Unfortunately time caught up with me last week and I was unable to post any predictions from my Eastwood Index. However, since then I have been busy validating the results to see how accurate the predictions really are using the 296 matches played in the English Premier League so far this season.</p> <h2>Ranked Probability Scores</h2> <p>I have previously discussed the problems of trying to determine the accuracy of probability-based models and Jonas posted a suggestion in the comments section recommending the use of ranked probability scores, which turned out to be a really interesting idea.</p> <p>Ranked probability scores were originally proposed by <a href="http://www.inmet.gov.br/documentos/cursoI_INMET_IRI/Climate_Information_Course/References/Epstein_1969.pdf">Epstein</a> back in 1969 as a way to compare probabilistic forecasts against categorical data. Their main advantage over other techniques is that as well as looking at accuracy, they also account for distance in the predictions e.g. how far out inaccurate predictions are from what actually happened.</p> <p>They are also easy to calculate. The equation for ranked probability scores is shown in Figure 1 for those of a mathematical disposition, where <span class="math">\(K\)</span> is the number of possible outcomes, and <span class="math">\(CDF_{fc}\)</span> and <span class="math">\(CDF_{obs}\)</span> are the predictions and observations for prediction <span class="math">\(k\)</span>.</p> <p><img alt="Pelican" src="../../../../images/rpseqn.png" /></p> <h2>Interpreting Ranked Probability Scores</h2> <p>Ranked probability scores range between 0–1 and are negatively orientated meaning that the lower the result the better. For simplicity, you can think of them representing the amount of error in the predictions where a score of zero means your predictions are perfect.</p> <h2>The Results</h2> <p>I started off looking at how well I would have done if I had just guessed at random for each match in the English Premier League this season rather than using the Eastwood Index and obtained a ranked probability score of 0.231.</p> <p>Next, I looked at how well the bookmaker’s odds predicted matches. To do this I aggregated the odds from multiple bookmakers, partly to reduce the comparisons needed and partly because aggregating data often improve predictions and I wanted to give the Eastwood Index the toughest test possible. This gave a ranked probability score of 0.193 for the bookmakers.</p> <p>Finally I calculated the score for the Eastwood Index and got…</p> <p><em>drum roll please</em></p> <p>a ranked probability score of 0.191. Okay, it is not much lower than the bookmakers but it does mean that so far this season the Eastwood Index has been more accurate at predicting football matches than the combined odds of the gaming industry which is really pleasing for me.</p> <h2>Conclusions</h2> <p>Most importantly though this suggests that the Eastwood Index works. I had originally set myself the target of being able to compete with the bookmakers as I consider them to be gold standard prediction for football. These are large companies employing professional odds compilers to generate their odds so for me to be able to beat them, even by a small amount, using a bunch of equations is a big success for the Eastwood Index.</p> <p>It is still early days and it is still a relatively small number of predictions (n=296) so I will be continuing to monitor the results to check the accuracy doesn’t change over time. It is a fantastic start though and great inspiration to continue developing and improving the Eastwood Index further!</p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>amir - March 23, 2013</strong></p> <p>Have you calculated the RPS for the same 296 matches for the bookmakers too? Otherwise, you put them at disadvantage to begin with, as early matches are harder to predict. Also, have you used the data from this year’s EPL to improve the EI. If you did, you probably did over-fitting…</p> <div class="hline"></div> <p><strong>Martin Eastwood - March 23, 2013</strong></p> <p>Hi Amir</p> <p>EI and bookmaker’s RPS was tested using exactly same set of matches.</p> <p>The EI was developed using historical data from the EPL rather than relying on this season’s data to prevent overfitting model.</p> <p>Thanks for leaving your comment.</p> <div class="hline"></div> <p><strong>amir - March 23, 2013</strong></p> <p>The results are very impressive then.</p> <p>Looking forward to read more details about the EI methodology!</p> <p><strong>Martin Eastwood - March 23, 2013</strong></p> <p>Thanks!</p> <div class="hline"></div> <p><strong>Lars - January 20, 2014</strong></p> <p>The Ranked Probability Scores method looks straightforward and really meaningful.</p> <p>Thanks for bringing it up, I plan to use it myself, too.</p> <div class="hline"></div> <p><strong>Martin Eastwood - January 20, 2014</strong></p> <p>Yes, it’s a great way to measure accuracy. I’m using it more and more for assessing football models now.</p> <div class="hline"></div> <p><strong>Lars - January 28, 2014</strong></p> <p>I have worked my way through Epstein’s paper now, a few comments:</p> <p>Contrary to what is written here, a high RPS is good, not bad. Even without understanding the equations, you can see in the table 2 on page 987 of the paper that the score is 1 when the prediction is correct and is &lt;1 the worse the prediction is.</p> <p>Secondly, I have tried to work out where you get the simplified equation from that you show above.</p> <p>In my eyes, for football (3-way result) it should rather be:</p> <p>RPS = S – 0.5 * (P_d+2*P_a) in case of home win</p> <p>RPS = S – 0.5 * (P_h+P_a) in case of draw</p> <p>RPS = S – 0.5 * (2*P_h+P_d) in case of away win</p> <p>where:</p> <p>S = 1.5 – 0.25*(P²_h +(P_h+P_d)²+ (P_d+P_a)²+P²_a )</p> <p>and:</p> <p>P_h is the probability for a home win, P_d for a draw and P_a for an away win.</p> <p>Maybe this can be corrected above or let me know where I am wrong.</p> <div class="hline"></div> <p><strong>Martin Eastwood - January 29, 2014</strong></p> <p>Perhaps the Epstein paper doesn’t make it particularly clear but the RPS is the sum of the squared differences between the forecast and observation. Therefore the more accurate the forecast then the smaller the difference is and the lower the RPS.</p> <p>If you are interested in digging deeper into RPS then this book has quite a good chapter on it IIRC – http://www.amazon.co.uk/Statistical-Atmospheric-Sciences-International-Geophysics/dp/0123850223/ref=tmm_hrd_title_0?ie=UTF8&amp;qid=1390988812&amp;sr=8-2-fkmr0</p> <div class="hline"></div> <p><strong>Lars - January 29, 2014</strong></p> <p>I disagree. The “sum of the squared differences between the forecast and observation” does not take into account the ranking characteristic. And your formula from above does not either.</p> <p>There would really be no reason to call it RPS if it was just a sum of squared differences.</p> <p>What Epstein does makes sense and is clearly different. And I did not find that the paper leaves open questions or was not particularly clear.</p> <div class="hline"></div> <p><strong>Martin Eastwood - January 29, 2014</strong></p> <p>Have you checked whether Epstein have an additional 1- term in his paper because RPS typically ranges from 0-1, with 0 considered the perfect score. Or perhaps you’re looking at Ranked Probability Skill Scores where higher values are better?</p> <div class="hline"></div> <p><strong>Lars - January 29, 2014</strong></p> <p>Whether 0 or 1 is defined as perfect is a minor issue. It is more where the “ranked” comes in.</p> <p>For that I have found on google a site that uses the same formula as you:</p> <p>http://www.eumetcal.org/resources/ukmeteocal/verification/www/english/msg/ver_prob_forec/uos2b/uos2b_ko1.htm</p> <p>And they add to their description that CDF is the CUMULATIVE distribution.</p> <p>Now that makes more sense and I think you probably do the same it is just not mentioned above. I did not get this the whole time.</p> <p>I see now why you do it and why it is called RPS. That’s fine.</p> <p>Epstein still does something different though (see my comment above where I apply Epstein to football).</p> <div class="hline"></div> <p><strong>Martin Eastwood - January 29, 2014</strong></p> <p>Thanks Lars</p> <div class="hline"></div> <p><strong>Adam - March 17, 2014</strong></p> <p>Hi Martin,</p> <p>Have you ever used the Ranked Probability Skill Score to evaluate your model rather than just the RPS? I’ve been using the RPS however I’ve seen on sites discussing probabilistic forecast verification, a mention of the RPSS, but i’m unsure as to how to apply it to football predictions. It compares the forecast to a ‘reference forecast’, and I’m wondering what the equivalent is in football (a sample climatology is the example given for weather forecasting). It’s defined here: http://www.cawcr.gov.au/projects/EPSverif/scores/scores.html</p> <p>Cheers,</p> <p>Adam.</p> <div class="hline"></div> <p><strong>Martin Eastwood - March 17, 2014</strong></p> <p>Hi Adam,</p> <p>I’ve never tried the RPSS, perhaps you could use aggregated bookmaker odds as the ‘reference’ and compare against that?</p> <p>Thanks</p> <p>Martin</p> <div class="hline"></div> <p><strong>Ian - March 27, 2014</strong></p> <p>Hi Martin,</p> <p>Just wanted to make sure I understand, is CDFfc, effectively your estimated probability of said outcome? As such, in a match where a bookmaker offered 1/2 on a team to win, this would be 0.67? Then the CDFobs would be either 0 or 1?</p> <p>Am I right in thinking that therefore in theory, a coin toss prediction, where the prediction of 50% is in fact the perfect probability, would have an RPS of approximately 0.25?</p> <p>Also, any idea why it divides by (K-1) rather than just K?</p> <p>Cheers,</p> <p>Ian</p> <div class="hline"></div> <p><strong>Ian - July 11, 2014</strong></p> <p>Hi Martin,</p> <p>Did you have any thoughts on what I asked above?</p> <p>Sorry to persist,</p> <p>Ian</p> <div class="hline"></div> <p><strong>Martin Eastwood - July 12, 2014</strong></p> <p>Apologies Ian, I completely missed your comment.</p> <p>In terms of the coin toss you wouldn’t use RPS as it is intended for when there are more than two possible outcomes. Instead, you would use Brier Scores which is effectively the same thing but for situations with two outcomes, giving the mean squared error of the forecasts. And yes, for the coin toss example you would expect a Brier Score of 0.25.</p> <p>Not sure about the k-1 part, I’d have to go back and look at Epstein’s paper. I’ve not read it for a while..</p> <div class="hline"></div> <p><strong>Thomas - July 27, 2014</strong></p> <p>Hi Martin,</p> <p>Why are your values so constant after the first matches? I would expect to see a spike for those matches where an underdog wins?</p> <p>Bests and thanks for the great work,</p> <p>Thomas</p> <div class="hline"></div> <p><strong>Thomas - July 27, 2014</strong></p> <p>Sorry the comment was indented for post: http://pena.lt/y/2013/05/21/did-the-eastwood-index-beat-the-bookmakers/</p> <div class="hline"></div> <p><strong>Martin Eastwood - July 27, 2014</strong></p> <p>It’s the average rps of all the forecasts so individual matches tend not to cause spikes due to the smoothing from the aggregation.</p> <script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: 'center'," + " displayIndent: '0em'," + " showMathMenu: true," + " tex2jax: { " + " inlineMath: [ ['\\\\(','\\\\)'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" + " } " + "}); "; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); } </script>Martin EastwoodThu, 21 Mar 2013 19:30:00 +0000tag:,2013-03-21:2013/03/21EIIs Brendan Rogers Improving Liverpool?/2013/03/13<h2>Introduction</h2> <p>As well as using my <a href="http://pena.lt/y/2013/02/21/rating-teams-and-predicting-football-matches-using-the-ei-index/">EI Index</a> to predict future matches, it can also be used to look back at how team’s performances have changed over time. An interesting example is Liverpool, who sacked Kenny Dalglish at the end of the 2011–2012 season to bring in Brendan Rogers from Swansea City.</p> <p>The green line in Figure One shows the weekly EI rating for Kenny Dalglish’s Liverpool team over the course of the 2011–2012 season, with the black line showing the moving three-match average. Up until around Christmas time Liverpool were making decent progress in terms of EI, improving from a rating of 2127 up a peak of 2247 following their 3-1 victory against Newcastle United.</p> <p><img alt="Pelican" src="../../../../images/130313-LFC-EI.png" /></p> <p>However, Liverpool’s form plummeted soon after that, with 11 losses out of their remaining 19 matches dragging Liverpool’s EI back down rapidly. Their worst performances in terms of EI were losses against Bolton Wanderers and Wigan Athletic, both of which Liverpool’s EI ratings suggested they should have had a good chance of winning. Despite a small flurry at the end of the season, Liverpool still finished with an EI lower than they started with.</p> <p>In contrast, the red line in Figure One shows Liverpool’s weekly EI ratings under Brendan Rogers, with the black line again showing the moving three-match average.</p> <p>The first few matches of the season did not go particularly well for Rogers and Liverpool’s EI dropped even lower than under Dalglish. The obvious narrative here is that Liverpool may have needed time to adjust to Rogers tactical changes but it’s also worth noting they had a tough start to the season, with fixtures against Manchester City, Arsenal and Manchester United all within the opening few weeks.</p> <p>Since then, Liverpool has shown a pretty steady increase in EI over the rest of the season. There have been a few drops along the way due to unexpected losses against teams such as Aston Villa and Stoke but their EI rating is currently on course to exceed Dalglish’s peak by the end of the season.</p> <p>To put these numbers into context, Chelsea are currently in fourth position with an EI of 2464 while Tottenham Hotspur finished fourth last season with an EI of 2329. While Liverpool’s EI isn’t quite that high yet, if they can maintain their current rate of improvement then their EI rating suggests they have a decent chance of challenging for a Champion’s League place next season.</p> <p><strong>addendum</strong></p> <p>In case anyone wonders why Brendan Rogers’ starting EI is lower than Kenny Dalglish’s final EI – Brendan Rogers lost his first match against West Bromwich Albion so the difference between the two is the loss in EI caused by that particular match.</p> <h2>Comments</h2> <p><div class="hline"></div></p>Martin EastwoodWed, 13 Mar 2013 19:30:00 +0000tag:,2013-03-13:2013/03/13EIHow Much Risk Should a Football Manager Take?/2013/03/11<h2>Introduction</h2> <p>How much risk should a football manager take if their team is the underdog in a match? Should they take on the opposing team to try and win the game or sit back and just try not to lose? The decision made is inherently linked to how risk averse the club’s manager is, but is there actually an optimal strategy to use in when in this position?</p> <h2>Exploring tactics using the EI Index</h2> <p>My <a href="http://pena.lt/y/2013/02/21/rating-teams-and-predicting-football-matches-using-the-ei-index/">EI Index</a> considers a team’s performance to be normally distributed around their true skill level so for any given match we can predict the probabilities that a team will perform above or below their average rating. By looking at how different tactics change these performance curves we can see how they affect each team’s chance of winning.</p> <p>The average EI rating is 2000, with better teams having higher ratings. The larger the difference in EI between two teams then the greater the chance the higher rated team will outscore the lower team. However, since team’s performances vary from match-to-match it is possible for the lower rated team to out-perform the higher rated team and win the match.</p> <h2>The Underdog</h2> <p>Looking at Figure 1 as an example, the underdog (orange) on average plays with an EI rating of 1800 and the favourite (blue) plays with an EI rating of 2000. Overall the favourite is expected to out-perform the underdog and win the match, yet in around 15% of matches the underdog will actually play above the favourite’s EI rating of 2000.</p> <p><img alt="Pelican" src="../../../../images/130311-Figure-1.png" /></p> <h2>All Out Attack</h2> <p>So what happens if the underdog decides to play the more high risk strategy of attacking the match and going all out for the win? We would expect the more risk a team takes then the more variance we will see in their performances as they have more chance of scoring and yet more chance of conceding too.</p> <p>Figure 2 shows what happens when the underdog’s variance doubles. Notice how there is now much more of the orange curve exceeding the favourite’s average EI rating of 2000. In fact, in this example the underdog now has around a 31% chance of playing above the favourite’s average and so is much more likely to win the match than before.</p> <p>There is a down side though as there is also more orange distributed towards lower EI ratings meaning that although they have a greater chance of winning, the underdog has also seriously increased their chances of a humiliatingly large defeat.</p> <p><img alt="Pelican" src="../../../../images/130311-Figure-2.png" /></p> <h2>Playing Safe</h2> <p>Let’s compare this to Figure 3 where the underdog sits back and plays conservatively hoping they will not get beaten. This low risk strategy reduces their performance variance meaning they are much less likely to out-perform their opponents and win the match. In fact reducing their variance by a half reduces their chance of playing above the favourite’s average down to just 2% at the benefit of maybe grabbing a draw or minimising the risk of an embarrassing defeat.</p> <p><img alt="Pelican" src="../../../../images/130311-Figure-3.png" /></p> <h2>The Favourite</h2> <p>What about the favourite, how should they respond to a change in risk by their opponents? The optimal choice would appear be to utilise a low risk approach to reduce the variance in their performance and minimize the chance of playing at a level below the underdog’s expected performance (Figure 4). This means there is less chance of a glamorous, high-scoring win for them but importantly less chance of making a mistake and throwing an easy victory away.</p> <p><img alt="Pelican" src="../../../../images/130311-Figure-4.png" /></p> <p>The Real World</p> <p>So what tactics should a manager choose? In the case of the underdog it surely makes sense to take the high risk strategy of attacking the match to increase their chances of winning all three points. The downside is of course that they are at more risk of losing by a heavier score line. But whether a team loses by one goal or by four goals, the net outcome in terms of points is the same – zero.</p> <p>Over the course of a season it is much more beneficial to gain additional points at the risk of worse goal difference. One extra victory is worth more in league placement than having a superior goal difference to the teams around you. Plus, you need to hold on and scrape three draws to equal the benefit of getting that one extra win.</p> <p>To counter this high risk approach, the favourite should then play safe to reduce their risk of a poor performance and try to maintain the relative difference in expected performances.</p> <p>Overall this means we would expect the lower rated team attack the game and take risks while the higher rated team plays safe and waits for the underdog to make an error.</p> <p>Conclusions</p> <p>Is this what actually happens in football though?</p> <p>Personally, I suspect not. It is difficult to quantify the actual risk teams are taking in matches so it is impossible to say for sure but from personal observations it seems much more likely that the underdog will play safe to try and avoid defeat and hopefully grab a lucky draw even though they are then at a much lower chance of winning the match.</p> <p>There are certainly times when this approach has worked, such as this season’s Champions League match when Celtic beat Barcelona against the odds. But it is perhaps not the best strategy long term over the course of a season to maximize a team’s outcomes.</p> <p>So why would a team not play to an optimal strategy? There are likely many competing reasons of which one is that football managers are not statisticians and cannot necessarily be expected to view matches from a probability or risk-based viewpoint.</p> <p>Another explanation also seems to be the public and media’s perception. One victory and three heavy losses are worth more in points and league placements than two draws and two narrow 1-0 defeats yet the manager presiding over the three heavy losses would come under much more criticism even though he had achieved more. A manager’s career is short and unstable – it doesn’t take much for a trigger-happy chairman to wield the managerial axe so rightly or wrongly many managers seem to be focussed on the goal of retaining their job ahead of anything else.</p> <p>Luck and variation will always play a big role in a team’s results, that’s why football is so exciting, but playing the least risky strategy available may not be the best approach for the smaller teams. Sometimes behind brave is best.</p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>GoalImpact - March 11, 2013</strong></p> <p>I came to an equal conclusion here (sorry German)</p> <p>http://www.goalimpact.com/2013/02/der-sturm-gewinnt-spiele-die-abwehr.html</p> <p>Jupp Heynkes once said: Attack wins games, defense championships. The opposite is true for the underdog.</p> <div class="hline"></div> <p><strong>Martin Eastwood - March 12, 2013</strong></p> <p>Thanks for the link, I put it through Google translate and it looks like we came to similar conclusions that the underdog needs to take risks and attack the match rather than sit back and play safe.</p> <p>Looks like a really interesting site, I look forward to reading more :)</p> <div class="hline"></div> <p><strong>Miguel - March 13, 2013</strong></p> <p>I disagree with your premise and think there is actually an optimized strategy for the underdog.</p> <p>There is an assumption that a team that plays it safe does so in order to not get scored on. While this is true, they also play defensively because it increases their chance of scoring. The more defensive they play, the more numbers the favorite must send up to attack, the less numbers the favorite has defending, the more probability of scoring.</p> <p>So when the underdog plays defensively, it not only decreases the chance to receive a goal but it also increases their chance to score one.</p> <div class="hline"></div> <p><strong>GoalImpact - March 13, 2013</strong></p> <p>I agree with your assessment. However, I’m not sure it necessarily contradicts Martin’s statement. <em>IF</em> playing defensive increases the goal difference, i.e. increasing the mean of the own distribution and/or decreasing the opponents, it may be worthwhile going for it. Otherwise, seeking their chances may be a better way to optimize the team’s number of points.</p> <div class="hline"></div> <p><strong>Martin Eastwood - March 13, 2013</strong></p> <p>Yes, it is more about mangers being prepared to take the risks needed to win the match rather being negative and playing to not lose. There may well be cases where that risk lies in playing defensively.</p> <div class="hline"></div> <p><strong>2ndMan - March 13, 2013</strong></p> <p>Good article, in terms of getting more points I certainly think it’s worth a risk, but agree that when you consider squad harmony and morale then playing it safe may be better for the long term. Allardyce’s “respect the point” comes to mind.</p> <p>I disagree completely though Miguel, you say defending means the opposing team commit more men to the attack, but that also requires you commit more men to defend. Generally the team in possession is gonna have at least 1 more defender back than attackers you have forward, and the more men you bring back to defender the greater the opposing teams advantage (having 3 attackers v 4 defenders, down to 2 v 3 and 1 v 2).</p> <div class="hline"></div> <p><strong>Miguel - March 15, 2013</strong></p> <p>2ndMan, the arithmetic that you are using cannot be applied to how the game is realistically played. In other words, 1 defender does not cancel out 1 attacker or, even, 2 defenders cancel out 1 attacker. The game is played at a very fluid pace and when a team is playing very defensively and they are able to get a turnover in the middle of the field, they can exploit the space that is available. And the space, which is the key, is the difference maker.</p> <p>http://www.youtube.com/watch?v=P2jq2NP2osM</p> <p>Look at this video, it is a video of a series of counterattack goals by Real Madrid. I know they are one of the best teams in the world, but in almost everyone of those plays, the defenders have the numerical advantage. The huge disadvantage that the defense has is that they are running back to cover the space and it is that space, that is usually not present even when you are attacking with your whole team, that creates the offensive advantage. I would love to see a statistical analisis of success rate of a counterattack, but I can guarantee you a much bigger percentage of goals are scored when the defense is running back to cover the space in front of the box, than when they are positioned there to begin with, regardless of the numbers each team has attack or defense, and this is the offensive advantage that a team has when playing it safe and counterattacking, that they dont have when attacking with numbers.</p> <div class="hline"></div> <p><strong>Nick - March 20, 2013</strong></p> <p>There is also a mode of thought that the more “possessions” there are in a game, the more likely the better team is to take advantage of those possessions. I believe this came from the NBA.</p> <p>Therefore an underdog who limits the changes of possession, either by keeping the ball, making the opposition “over-pass”, time wasting, slowing the game down, etc is actually shortening the length of the game and is increasing the chances of an upset.</p> <p>Does anyone know if there are figures for the average number of team possessions in EPL, for example?</p> <div class="hline"></div> <p><strong>Martin Eastwood - March 21, 2013</strong></p> <p>Good points Nick.</p> <p>I don’t think that data for average number of team possessions is currently available. As far as I am aware Opta calculate possession percentages from completed passes rather than the number of actual possessions each team has had during the match.</p>Martin EastwoodMon, 11 Mar 2013 19:30:00 +0000tag:,2013-03-11:2013/03/11EIRiskEI Match Predictions for the English Premier League/2013/03/08<p>Here we go again!</p> <p><a href="http://pena.lt/y/2013/03/08/ei-match-predictions-for-the-english-premier-league-2/">Last Week</a> was another success for the EI, with seven out of the ten predicted favourites winning their matches. It is still a bit early to be drawing too many conclusions but so far that is 14 out of 19 for the EI, which seems a pretty good start to me!</p> <p><a href="http://pena.lt/y/2013/02/28/how-did-the-ei-predictions-do/">Jonas</a> commented on my recent post discussing the difficulties of assessing probability-based models to suggest trying <a href="http://en.wikipedia.org/wiki/Statistical_distance">Ranked Probability Scores</a> which looks like a really good idea so look out for that once I have a bit more data to play with.</p> <p>It is only a small gameweek this week due to the FA cup but here are the predictions anyway.</p> <table class="table"> <tbody> <tr> <td><strong>Home Team</strong></td> <td><strong>Away Team</strong></td> <td><strong>Home (%)</strong></td> <td><strong>Draw (%)</strong></td> <td><strong>Away (%)</strong></td> </tr> <tr> <td>Norwich</td> <td>Southampton</td> <td>48</td> <td>24</td> <td>28</td> </tr> <tr> <td>QPR</td> <td>Sunderland</td> <td>36</td> <td>27</td> <td>37</td> </tr> <tr> <td>Reading</td> <td>Aston Villa</td> <td>42</td> <td>26</td> <td>32</td> </tr> <tr> <td>West Brom</td> <td>Swansea</td> <td>46</td> <td>25</td> <td>29</td> </tr> <tr> <td>Newcastle</td> <td>Stoke</td> <td>49</td> <td>24</td> <td>27</td> </tr> <tr> <td>Liverpool</td> <td>Tottenham</td> <td>38</td> <td>27</td> <td>35</td> </tr> </tbody> </table> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>Chris Pope - March 8, 2013</strong></p> <p>I am only a curious amateur stat freak , but love how close this weeks predictions are. Love the blog and am telling everyone about it.</p> <div class="hline"></div> <p><strong>Martin Eastwood - March 9, 2013</strong></p> <p>Thanks Chris :)</p> <p>It is a really tough week to predict so I expect the accuracy of the EI to drop a bit but over enough data it should all cancel itself out as will be some easier weeks too.</p>Martin EastwoodFri, 08 Mar 2013 19:30:00 +0000tag:,2013-03-08:2013/03/08EIEPLEI Match Predictions for the English Premier League/2013/03/01<p>Since last week’s predictions turned out to be so popular I thought I would continue testing the EI index in public so here are this week’s predictions. Fingers crossed they turn out well again!</p> <table class="table"> <tbody> <tr> <td><strong>Home Team</strong></td> <td><strong>Away Team</strong></td> <td><strong>Home (%)</strong></td> <td><strong>Draw (%)</strong></td> <td><strong>Away (%)</strong></td> </tr> <tr> <td>Chelsea</td> <td>West Brom</td> <td>60</td> <td>20</td> <td>21</td> </tr> <tr> <td>Everton</td> <td>Reading</td> <td>64</td> <td>17</td> <td>19</td> </tr> <tr> <td>Man United</td> <td>Norwich</td> <td>76</td> <td>10</td> <td>14</td> </tr> <tr> <td>Southampton</td> <td>QPR</td> <td>49</td> <td>24</td> <td>27</td> </tr> <tr> <td>Stoke</td> <td>West Ham</td> <td>56</td> <td>21</td> <td>23</td> </tr> <tr> <td>Sunderland</td> <td>Fulham</td> <td>42</td> <td>26</td> <td>32</td> </tr> <tr> <td>Swansea</td> <td>Newcastle</td> <td>42</td> <td>26</td> <td>32</td> </tr> <tr> <td>Wigan</td> <td>Liverpool</td> <td>31</td> <td>27</td> <td>42</td> </tr> <tr> <td>Tottenham</td> <td>Arsenal</td> <td>44</td> <td>25</td> <td>31</td> </tr> <tr> <td>Aston Villa</td> <td>Man City</td> <td>14</td> <td>25</td> <td>61</td> </tr> </tbody> </table> <h2>Comments</h2> <p><div class="hline"></div></p>Martin EastwoodFri, 01 Mar 2013 19:30:00 +0000tag:,2013-03-01:2013/03/01EIEPLEI Match Predictions for the English Premier League/2013/02/28<h2>The EI</h2> <p>Last week was a big test for my new <a href="http://pena.lt/y/2013/02/21/rating-teams-and-predicting-football-matches-using-the-ei-index/">EI Index</a> – it had finally reached the point where I was confident it was working well enough to post its predictions in public.</p> <p>For those people who haven’t come across it before, the EI index is a mathematical system I have been developing for ranking football teams and predicting the outcomes of matches. Using the EI it is possible to predict the odds for each team winning, drawing or losing its match.</p> <h2>Results</h2> <p>So how did the EI do?</p> <p>Well, this is the tricky bit as the EI is a probability model. For linear models it is relatively simple to assess accuracy as you get an R-squared value showing you how well your predictions match the observed result. The higher the R-squared then the better you did.</p> <p>For a probability-based model though you cannot do this. An obvious alternative is to just look at whether the model’s predicted favourites won their matches. And on this measure the EI performed fantastically well by correctly matching seven of its nine predictions, giving it an overall success rate of (78%).</p> <p>But we need to be careful here as this can be a misleading way of looking at accuracy. Just because Manchester City had a 54% probability of beating Chelsea doesn’t mean they will win the match purely because it is the most probable result. Instead, it means that if this match was played 100 times then Manchester City would be expected to win 54 of them and not win 46 of them.</p> <p>Rather than looking at the accuracy of the predicted favourite winning we really need to look at the accuracy of the predicted probabilities. Are teams ranked with a 50% chance of winning actually winning 50% of the time? Are teams ranked with a 25% chance of winning actually winning 25% of the time?</p> <p>This can only be done by making lots and lots of predictions so over the coming weeks I will keep making them until I have enough of predictions to get an estimate of how good they are.</p> <p>Overall though it is a very exciting start for the EI, bring on next week!</p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>Jonas - March 1, 2013</strong></p> <p>Have you heard about the rank probability score (RPS)? It seems, to me a least, to be a reasonable measure of how well a proabalistic model fares.</p> <p>You should check out the paper “Solving the Problem of Inadequate Scoring Rules for Assessing Probabilistic Football Forecast Models” by Anthony Costa Constantinou and Norman Elliott Fenton.</p> <div class="hline"></div> <p><strong>Martin Eastwood - March 1, 2013</strong></p> <p>Thanks Jonas, that looks really interesting!</p> <div class="hline"></div> <p><strong>GoalImpact - March 12, 2013</strong></p> <p>Hi Martin,</p> <p>Thanks for sharing your approach. This appears to be more sound to me than most of the rankings around. One easy way to check your prediction quality would be the power stat. Just sort the predicted outcomes by probability and make a xy chart with the sorted predictions on the x axis and the cumulative real outcome on the y axis. Then calculate the Gin I on that graph.</p> <div class="hline"></div> <p><strong>Martin Eastwood - March 12, 2013</strong></p> <p>Good idea, I like the sound of that. Will be giving that a try :)</p>Martin EastwoodThu, 28 Feb 2013 19:30:00 +0000tag:,2013-02-28:2013/02/28EIEPLEI Match Predictions for the English Premier League/2013/02/21<p>For a bit of fun, here is a trial run at predicting this weekend’s EPL matches using my EI ratings. I haven’t compared these with anyone else’s odds yet but they generally look about what I would have expected.</p> <p>Poor QPR don’t seem to have much chance holding out against Manchester United, even playing at home they are only rated at having a 9% chance of winning.</p> <p>It looks like it could be a good weekend for Arsenal to bounce back from their Champions League defeat as they have a massive 67% chance of beating Aston Villa.</p> <p>Personally, I am surprised Newcastle are rated quite so highly against Southampton. I wonder if this may be due to Newcastle being so strong last season while Southampton only have this season’s data for generating EI ratings from? If so, it may be that I need to go back and tweak the equation weightings slightly to account for situations like this.</p> <table class="table"> <tbody> <tr> <td>Match</td> <td>Home (%)</td> <td>Draw (%)</td> <td>Away (%)</td> </tr> <tr> <td>Fulham Vs Stoke</span></td> <td>46</span></td> <td>25</span></td> <td>29</span></td> </tr> <tr> <td>Arsenal Vs Aston Villa</span></td> <td>67</span></td> <td>15</span></td> <td>17</span></td> </tr> <tr> <td>Norwich Vs Everton</span></td> <td>29</span></td> <td>27</span></td> <td>43</span></td> </tr> <tr> <td>QPR Vs Man United</span></td> <td>9</span></td> <td>23</span></td> <td>68</span></td> </tr> <tr> <td>Reading Vs Wigan</span></td> <td>43</span></td> <td>25</span></td> <td>31</span></td> </tr> <tr> <td>West Brom Vs Sunderland</span></td> <td>47</span></td> <td>24</span></td> <td>28</span></td> </tr> <tr> <td>Man City Vs Chelsea</span></td> <td>54</span></td> <td>22</span></td> <td>24</span></td> </tr> <tr> <td>Newcastle Vs Southampton</span></td> <td>53</span></td> <td>22</span></td> <td>25</span></td> </tr> <tr> <td>West Ham Vs Tottenham</span></td> <td>19</span></td> <td>27</span></td> <td>54</span></td> </tr> </tbody> </table> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>JVent - February 25, 2013</strong></p> <p>Awesome EL predictions you made you only miss two predictions but nobody would of know that Reading Pavel Pogrebnyak was going to receive a red card. So that would only make one wrong prediction. Hope you share the formula later on when you perfected</p> <p>Also you got one or the most interesting blogs about Football.</p> <div class="hline"></div> <p><strong>Martin Eastwood - February 25, 2013</strong></p> <p>Thanks! Was definitely a good start, let’s see if it continues :)</p> <div class="hline"></div> <p><strong>Casey - February 27, 2013</strong></p> <p>Oh goodness I would love to apply this to MLS. New season starts Saturday. No new teams.</p>Martin EastwoodThu, 21 Feb 2013 19:30:00 +0000tag:,2013-02-21:2013/02/21EIEPLIntroducing the Eastwood Index/2013/02/21<h2>Introduction</h2> <p>My past couple of articles have focused on Elo ratings and how they can be applied to football teams to rank them against each other and to estimate win probabilities.</p> <h2>Problems with Elo Ratings</h2> <p>On the whole the Elo system works okay but it was not designed with football in mind and so there are some issues with it, for example it can only handle two distinct outcomes – winning and losing.</p> <p>Elo ratings try to get around this problem by considering each draw to be half a win and half a loss. However, this means that the win probabilities calculated using the Elo equation are actually the probability of winning or drawing versus the probability of losing or drawing, which isn’t particularly useful.</p> <p>For a game such as chess, which Elo ratings were originally developed for, this may not be too much of an issue as tied matches are comparatively rare but in football draws are a common occurrence so we really need to be able to model three outcomes – win, loss and draw.</p> <h2>The Eastwood Index</h2> <p>So instead of combining draws with wins and losses, we need to be able to calculate their probabilities individually. To do this I have been developing my own ranking system, which for want of a better name I am currently calling the Eastwood Index, or EI for short (it feels rather pretentious to be naming it after myself so if anyone has any better names for it then feel free to let me know!)</p> <p>The Eastwood Index allows football teams to be ranked using a mathematical rating system that evaluates relative strength based on previous performances weighted so that more recent matches have a greater impact on a team’s ranking.</p> <h2>Methodology</h2> <p>Teams EI ratings are scaled so that the average rating is 2000. The higher the rating the better a team is compared with the rest of the league.</p> <p>EI ratings increase when a team wins a match or draws against superior opposition. Conversely, EI ratings decrease when teams lose matches or draw against weaker opposition. The size of this increase or decrease in ratings is linked to the quality of the opposition. For example, beating a superior team is worth more than winning against a lower ranked team.</p> <p>The change in EI rating is also weighted by the score line so that the greater the difference in goals scored or conceded then the greater the change in ratings. Home advantage is also included in the calculations so that the home team is considered to perform better at home compared with away.</p> <p>So far this all sounds similar to an Elo rating. However, the Eastwood Index has a major advantage over Elo in that it is multinomial, meaning it can function with multiple outcomes. This makes it possible to accurately calculate the probabilities of teams winning, drawing or losing matches.</p> <p>A further advantage of the Eastwood Index is that it is does not rely on the Logistic distribution the same way Elo ratings do. The use of the Logistic distribution in Elo ratings originates from chess where it was considered to predict chess outcomes reasonably well. Football and chess are different games with different outcomes so instead the Eastwood Index uses custom curves developed using football data. This means that predictions for football should be more accurate using the EI compared with Elo ratings.</p> <h2>Example</h2> <p>The underlying mathematics for the EI is completely different to how an Elo rating is calculated but rather than wade through a list of equations it is simpler to show how it works using the recent Liverpool versus Swansea match.</p> <p>Prior to the game Liverpool had an EI rating of 2151 compared with Swansea’s rating of 1891. Team performances are considered to be normally distributed around their rating so on any given day a team may play above or below their true skill level. By comparing the distribution curves for the two teams we can then calculate the probabilities of each outcome of the match before it is played.</p> <p>Although both teams have similar ratings Liverpool has the home advantage giving them overall a 52% chance of a win compared with a 25% chance of Swansea winning and a 23% chance of a draw (Figure 1).</p> <p><img alt="Pelican" src="../../../../images/LvsS.png" /></p> <p><strong>Figure 1: Predicting Liverpool Versus Swansea City</strong></p> <p>We can also use these probabilities to calculate the expected points from the match. If these two teams were to play the same match repeatedly then on average Liverpool would be expected to earn (0.52 * 3) + (0.23 * 1) = 1.79 points while Swansea would be expected to earn (0.25 * 3) + (0.23 * 1) = 0.98 points.</p> <p>Once we know the actual result we can then update the EI for each team based on their current ratings and the score line, which was Liverpool 5 – 0 Swansea. Since Liverpool already had a higher EI rating and had beaten somewhat lesser opposition they would expect only a small rise in their EI but taking into account their high score in the match Liverpool’s rating moves up to 2183 while Swansea’s falls to 1859.</p> <h2>Conclusions</h2> <p>The EI Index offers a potentially superior way of rating football teams compared with other ranking systems, with the advantage that it can predict wins, losses and draws, and uses mathematics specifically designed to accurately model football data.</p> <p>I will be discussing the EI in more detail in future posts and showing how it can be used to analyse and predict football matches.</p> <p>As ever, get in touch if you have any comments of questions!</p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>Lars - February 21, 2013</strong></p> <p>Hello Martin,</p> <p>I admire the courage to come up with a completely new ranking system. There are a lot of things to be taken into consideration if you want to set up a solid theoretic basis for such a complex problem. That is why I shy away from developing something completely new and rather stick to Elo, certainly not perfect but still very good in my eyes.</p> <p>Please tell us more about the maths behind it so that substantiated comments are possible. Until then, let me give you my first thoughts I had when reading this:</p> <p>1) It seems that by using the 3-point rule for the ranking, you leave the ground of zero-sum-games. This would imply that two teams can increase (or decrease) their average ranking just by playing each other. I wonder if that is intended?</p> <p>2) If you do not want to be pretentious, name it after what it does or its unique feature (multinomial or whatever).</p> <div class="hline"></div> <p><strong>Martin Eastwood - February 21, 2013</strong></p> <p>Hi Lars,</p> <p>Yes Elo is certainly adequate and I am not trying to criticize it. Rather I am just trying to improve things further although I am sure there is still more to do as this is just the first version of the system. In answer to your questions:</p> <p>1) The mathematics is designed to ensure the system is zero-sum so the average rating for the league will always be 2000</p> <p>2) perhaps the Multinomial Football Index? I may just stick with EI and then just avoid referring to what the E stands for :)</p> <div class="hline"></div> <p><strong>Rob - February 21, 2013</strong></p> <p>Hi Martin</p> <p>Enjoy your blogs and find them very interesting .</p> <p>Just trying to get my head around your example of Liverpool v Swansea. If I am correct then you award extra points for a 5-0 win (goals) but Swansea had rested the majority of the team if memory serves me correctly so could you take that into account when awarding points or is Subjectivity a dangerous state to avoid when putting together ratings ?</p> <div class="hline"></div> <p><strong>Martin Eastwood - February 21, 2013</strong></p> <p>Thanks Rob. At the moment it is based purely on the actual match result, so far I haven’t found a good way to quantify whether a team has put out a weakened squad for a match.</p> <div class="hline"></div> <p><strong>Baloo - February 21, 2013</strong></p> <p>I use a similar rating system (purely to assess opposition strength) and if I had a team rated 250 points higher, it would mean they are around 1.25 goal favourites. Add on home advantage and you get Liverpool in at 1.65 goal jollies (ie roughly 73.5% to win the game).</p> <p>My pricing method is a lot more complicated but I also had Liverpool around 1.65 goal favourites and bet accordingly.</p> <p>How did you get to 52%?</p> <div class="hline"></div> <p><strong>Martin Eastwood - February 22, 2013</strong></p> <p>The 52% is based on the difference between the two team’s performance distribution curves. For such closely matched teams though 73.5% sounds slightly high?</p> <div class="hline"></div> <p><strong>Baloo - February 22, 2013</strong></p> <p>What do you mean by performance distribution?</p> <p>I would strongly disagree that they are closely matched on performance (or even just shot) data, which is essentially what drives the betting markets. They are closely matched only in pure results terms.</p> <p>Liverpool continually divide opinion however, and I have to admit I’m in the minority when it comes to rating them.</p> <p>They are an outlier, just as Man Utd are also an outlier but at the opposite end of the spectrum.</p> <div class="hline"></div> <p><strong>Martin Eastwood - February 22, 2013</strong></p> <p>The model considers individual performances to be distributed around the team’s current EI rating. Yes Liverpool are quite a controversial team to rate, personally I never think the odds for them look how I would expect them to. I agree about Man Utd they are a huge outlier this season and based on many of their individual stats they would not be expected to be where they are.</p> <div class="hline"></div> <p><strong>Vasilis - August 30, 2013</strong></p> <p>Hi, I have a question. At the beginning of a season, so you make and a subjective evaluation of all teams, do you start all teams with 2000 points, do you rely on last years ranking to handicap better teams? And once the start up points have been awarded, does your system takes into account purely results, or do you feed it with subjective criteria as well?</p> <div class="hline"></div> <p><strong>Martin Eastwood - August 30, 2013</strong></p> <p>All teams initially started off equal on 2000 points. Promoted teams them take over the equivalent relegated team’s rating and other team’s ratings carry over from one season to the next. At the moment it is based purely on results but at some point I plan to investigate whether subjective criteria can help account for changes e.g.manager changes / transfers / summer breaks etc.</p> <div class="hline"></div> <p><strong>Vasilis - August 30, 2013</strong></p> <p>Nevertheless, dont you think that not all teams have the same probability of winning the trophy? I mean, if you figured out a way to evaluate odds for each team winning the 1st place, and then compiled them around 2000, but with weights in order to have a more realistic initial point, wouldn’t that be more accurate? Also, does your model take into account the total number of teams participating in the league. I mean, do you use the 2000 initial points for Scottish Premier league (12 teams), and for England Premier (20 teams)too? I guess that a team loosing all matched would have its points limit close to zero, but not negative, correct? Hence the 2000 points shouldn’t be adjustable to each league?</p> <div class="hline"></div> <p><strong>Martin Eastwood - August 30, 2013</strong></p> <p>When I initially set up the model I ran it over the three previous season’s data so that all teams ratings had time to move from the 2000 to the correct value to represent the team’s values so no need to weight them.</p> <p>The effect of league sizes is something that intrigues me as ideally I would like to run the model over multiple leagues and seasons and have ratings comparable between them all so some form of weighting will be required, although I am not sure of the best solution though. Still trying to find an answer that I am happy with.</p> <div class="hline"></div> <p><strong>Nick - April 16, 2014</strong></p> <p>EI is great but your statement that draws in chess are relatively rare is not correct. “For a game such as chess, which Elo ratings were originally developed for, this may not be too much of an issue as tied matches are comparatively rare but in football draws are a common occurrence” Actually, quite the opposite – no other game has as high ratio of draws as chess, over 50% of chess games played at high (professional) level are drawn. https://en.wikipedia.org/wiki/Draw_(chess)#Frequency_of_draws</p> <div class="hline"></div> <p><strong>Martin Eastwood - April 16, 2014</strong></p> <p>I stand corrected Nick :-)</p> <div class="hline"></div> <p><strong>Johan - April 23, 2014</strong></p> <p>Hi Martin.</p> <p>I’m relatively new to your blog but reading a new article every morning on my way to work. Interesting content (and comments).</p> <p>Your system the EI index sounds somewhat similar to Paul Steele’s power ratings. Would you mind sharing the formula (like Steele) so it’s possible to compare their performance on the same data sets?</p> <p>I found Mr. Steele’s system to work quite well especially for Home wins (roughly 66% correct) but it would be good fun if there was a system that could beat it especially on away wins (roughly 40%).</p> <p>I can send you his formula if you like (pls advise me of your email).</p> <p>Best regards,</p> <p>Johan</p> <div class="hline"></div> <p><strong>http://google.com/ - June 14, 2014</strong></p> <p>whoah this weblog is fantastic i like studying your posts.</p> <p>Keep up the great work! You already know, a lot of individuals are hunting around for this information, you can help them greatly.</p> <div class="hline"></div> <p><strong>Empecinator - July 7, 2014</strong></p> <p>Hi Martin,</p> <p>Thanks for the blog as it is very interesting and easy to read. However and same as Baloo I don’t see how you get to the 52% chance of victory for L’pool, i.e. how to you infer the multinominal probabilities from the Win-Lose calculated with the Ea logistic formula (explained in a previous article)?</p> <p>I understand it can be calculated from the overlay distributions but will you (or have you) publish it?</p> <p>Cheers</p> <div class="hline"></div> <p><strong>Martin Eastwood - July 12, 2014</strong></p> <p>Hi Juan,</p> <p>I’ve moved away from the EI as I couldn’t find a good way of forecasting scores from it so while it worked well for 1X2 probabilities it was not much use for Asian handicaps. It’s unlikely I’ll bother publishing it now since I no longer use it or keep it up to date.</p>Martin EastwoodThu, 21 Feb 2013 19:30:00 +0000tag:,2013-02-21:2013/02/21EIElo RatingsUnderstanding Elo Ratings Part Two/2013/02/07<h2>Introduction</h2> <p>Now that we understand the theory behind <a href="http://pena.lt/y/2013/01/31/understanding-elo-ratings/">Elo Ratings</a>, let’s take a look at how to calculate them and how to make them more relevant to football.</p> <h2>Calculating Elo Ratings</h2> <p>The equation for calculating a team’s Elo rating is shown below in Figure 1, where <span class="math">\(Ra_{new}\)</span> is the team’s new Elo rating after a match, <span class="math">\(Ra_{old}\)</span> is the team’s previous Elo rating before the match and <span class="math">\(k\)</span> is a weighting factor. <span class="math">\(Sa\)</span> is the outcome of the match normalised to the range 0–1 so that 0 is a loss, 0.5 is a draw and 1 is a win.</p> <p><span class="math">\(Ra_{new}=Ra_{old}+k(Sa-Ea)\)</span></p> <p><strong>Figure 1: Elo Rating equation</strong></p> <p><span class="math">\(Ea\)</span> is the expected probability of the team winning the match and is calculated using the equation in Figure 2 where <span class="math">\(Rb-Ra\)</span> is the difference in Elo ratings between the two teams.</p> <p><span class="math">\(Ea=1/1+10^{(Rb-Ra)/400}\)</span></p> <p><strong>Figure 2: Expected win probability equation</strong></p> <h2>Win Expectancy</h2> <p>The calculation for <span class="math">\(Ea\)</span> is actually slightly different from the original Elo equation as it uses a logistic distribution for player performances rather than a normal distribution. The use of the logistic distribution stems the chess community, who suggested that it fit player performances better than the normal distribution did. In effect, the differences between the two are relatively minor, with the logistic curve skewing more performances to the tails of the distribution, meaning players are slightly more likely to over- or under-perform (Figure 3).</p> <p><img alt="Pelican" src="../../../../images/130702-Logistic-Vs-Normal.png" /></p> <p><strong>Figure 3: Comparison of logistic and normal distributions</strong></p> <h2>Weighting Factor</h2> <p>The constant <span class="math">\(k\)</span> in the equation controls how many points are gained or lost each match. Increasing k will apply more weight to recent matches while lowering it will allow historic matches to have more of an effect on a team’s Elo rating. Therefore, using an inappropriate rating for <span class="math">\(k\)</span> may lead to inaccurate Elo ratings being calculated.</p> <p><a href="http://www.eloratings.net/system.html">Eloratings.net</a> is a website that applies Elo ratings to international football. They use a weighting of 60 for a world cup final, 50 for continental championship finals and major intercontinental tournaments, 40 for World Cup and continental qualifiers, 30 for all other tournament matches and 20 for international friendly matches. However, since none of these ratings apply directly to domestic football and since <a href="http://www.eloratings.net/system.html">Eloratings.net</a> does not explain how they were determined I decided to calculate my own.</p> <p>Using <a href="http://en.wikipedia.org/wiki/Least_squares">Least Squares</a> I optimized the value of <span class="math">\(k\)</span> to minimize the error of the predicted outcomes versus the actual match results using data from the English Premier League. Overall, the most accurate predictions were obtained using a value of 15 for <span class="math">\(k\)</span>.</p> <p><img alt="Pelican" src="../../../../images/130702-K-Weighting.png" /></p> <p><strong>Figure 4: Effect of k on error of Elo prediction</strong></p> <h2>Goal Difference</h2> <p>Another modification we can do to make the Elo ratings more applicable to football is to take into account the number of goals scored so that beating the opposition by two goals for example is better than wining by just one.</p> <p>We can do this by scaling <span class="math">\(k\)</span> by the goal difference so that the larger the difference the more points are gained by the victor and the more lost by the loser. There are a number of ways this can be done but in my method each additional goal a team scores becomes increasingly less important. For example, going from 1–0 to 2–0 is much more critical in terms of winning a game than going from 8–0 to 9–0.</p> <p>Eloratings.net used a similar approach where their scaling reduces the weightings for goal differences of two and three. However, for goal differences of four upwards their scale (intentionally or unintentionally) becomes linear and from then on applies equal weightings to each additional goal scored. Instead, I have used a sigmoid function to smoothly reduce the weightings of each goal scored to create the curve shown in Figure 5, which is then used to produce the scaling factors shown in Table 1</p> <p><img alt="Pelican" src="../../../../images/130702-Goal-Differential.png" /></p> <p><strong>Figure 5: Goal difference scaling factor smoothed using a sigmoid</strong></p> <table class="table"> <tbody> <tr> <td>Goal Difference</span></strong></td> <td >Scaling Factor</span></strong></td> </tr> <tr> <td>10.00</td> <td>2.99</td> </tr> <tr> <td>9.00</td> <td>2.88</td> </tr> <tr> <td>8.00</td> <td>2.77</td> </tr> <tr> <td>7.00</td> <td>2.64</td> </tr> <tr> <td>6.00</td> <td>2.49</td> </tr> <tr> <td>5.00</td> <td>2.32</td> </tr> <tr> <td>4.00</td> <td>2.11</td> </tr> <tr> <td>3.00</td> <td>1.85</td> </tr> <tr> <td>2.00</td> <td>1.51</td> </tr> <tr> <td>1.00</td> <td>1.00</td> </tr> </tbody> </table> <p><strong>Table 1: Goal difference scaling factors</strong></p> <h2>Home Advantage</h2> <p>If two teams with equal Elo ratings play each other then in theory they should both have an identical chance of winning the match; however, in football the home team always has a noticeable advantage.</p> <p>Looking back at the 2011–2012 English Premier League season, home wins accounted for 47% of results compared with just 24% for away wins. The remainder of the results are draws, which Elo ratings consider to be half a win, so including these gives us a final win expectancy of 61% for the home team and 39% for the away team.</p> <p>To account for this we can give the home team’s Elo a temporary boost of 75 points. For two equally matched teams this then raises the win expectancy for the home team from 50% to 61%, matching what we see in the English Premier League.</p> <h2>Relegations and Promotions</h2> <p>Another issue to consider is how to deal with relegations and promotions. We could calculate Elo ratings for each tier of the league so that a team already has a rating when it gets promoted or alternatively we could award each promoted team the average Elo rating of 1500. A nice feature of Elo ratings is that they are self-correcting so although these arbitrary ratings may not be accurate they would gradually alter to the correct level.</p> <p>This does have the unfortunate side effect of skewing the other team’s Elo values though. The gain and loss of Elo points is zero sum, meaning that for every Elo point a team gains another team has to lose one. So adding in teams with different Elo ratings would distort the values of other the team’s ratings by altering the overall number of points available in the league.</p> <p>The simplest way to deal with this problem is to give the promoted teams the equivalent relegated team’s Elo rating. So the best promoted team takes the Elo rating of the best relegated team, the second best promoted team takes the Elo rating of the second best relegated team, and so on. This then keeps the correct number of Elo points in the league and maintains the parity in points between teams.</p> <h2>Conclusions</h2> <p>Elo ratings are a really quick and easy way to compare teams directly and calculate win expectancies. While techniques like the Pythagorean Expectation looks at how teams perform over a long period of time, Elo ratings can be used to look at teams on a match–by–match basis.</p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>Lars - February 7, 2013</strong></p> <p>Thanks for this article.</p> <p>I invite you to have a look at my website, where I am doing Elo ratings for European club football.</p> <p>I lot of conclusions I have come to are similar to yours.</p> <p>My least-squares curve for the weighting factor however looks a bit different, with a minimum at k=20 and not as symmetrical.</p> <p>Glad to see that the Elo system becomes more and more popular.</p> <div class="hline"></div> <p><strong>Martin Eastwood - February 7, 2013</strong></p> <p>That is a really nice website Lars :)</p> <p>It’s also good to see we have come to similar conclusions with regards to k factor etc.</p> <p>Your use of the Poisson looks interesting too. I have played around with various Poisson models before but I have not tried combining it with the Elo before, an intriguing idea!</p> <div class="hline"></div> <p><strong>Stefan - February 7, 2013</strong></p> <p>Very interesting article, thanks!</p> <p>I was wondering about two things:</p> <p>1) In Figure 3, what are the parameters of the logistic distribution? Is it plotted for the same mean and variance as the normal distribution? Maybe this should be stated in the text.</p> <p>2) What would happen if you used the actual match result, in particular the fraction of total goals scored, for “Ea” instead of just 0, 0.5, 1 for loss, draw, or win, respectively? For example, if the match ended 2-1, team 1 would have scored 66% of the goals, so the actual outcome would be 0.66 (and 0.33 for team 2). This would also naturally account for the weighting of goal differences, since the fraction of goals scored is much more similar when comparing 5-1 and 6-1 wins than for 2-1 and 3-1 wins. However, then one obviously has to deal with results where only one team scores and the outcome is always 100%/0% regardless of the goal difference.</p> <p>I remember that the European eSports Leage ESL uses a similar ranking system for online games, see: http://www.esl.eu/eu/faq/rankmodules/ (interestingly also for the FIFA games^^) As far as I know, they added an offset of 1 to the match result, such that 1-0 is actually interpreted as 2-1, 2-0 as 3-1, and so on.</p> <div class="hline"></div> <p><strong>Martin Eastwood - February 7, 2013</strong></p> <p>1) Yes, the two distributions should be comparable with each other</p> <p>2) That is a really interesting idea, I will have a play around with that and see if it works.</p> <p>Thanks for the comments Stefan!</p> <div class="hline"></div> <p><strong>Stefan - February 7, 2013</strong></p> <p>If I had to guess, it could be that this underestimates the value of a close win (and I assume the majority of wins are with one goal difference). If this is the case you could play around with a sigmoid that pushes the actual match outcome away from 0.5 towards 0/1, but still allows for some gradual changes. Intuitively speaking, it should make sense to include the full information of the match score, which is more than just a ternary event (win, loss, or draw).</p> <p>On a related note, it would be interesting to analyze whether the absolute goal difference or the goal ratio carries more predictive information.</p> <div class="hline"></div> <p><strong>Ian - September 2, 2013</strong></p> <p>Hi Martin, Can you please explain in layman’s terms how you calculated the optimal K value?</p> <p>Same thing for the scaling factor – if it’s not too difficult.</p> <div class="hline"></div> <p><strong>Martin Eastwood - September 15, 2013</strong></p> <p>Most stats packages will have some sort of optimisation routines built in e.g optim in R or solver in Excel. These will iteratively work through a range if numbers looking to minimise or maximise some value for you. So for example you would look to minimise error or maximise likelihood to get the optimal value.</p> <div class="hline"></div> <p><strong>Ian - October 19, 2013</strong></p> <p>Thanks Martin, Just saw your reply now.</p> <p>Can you (or Mick) run me through how to use solver to get the optimal K value? I’d really appreciate it.</p> <p>Currently I’m regressing every team’s rating to the mean after each season. Would you guys recommend that I continue to do that after finding the optimal K value? Or would that become unnecessary?</p> <div class="hline"></div> <p><strong>Michael Podger - October 16, 2013</strong></p> <p>Hello Martin, great article</p> <p>I analysed 4 years of A-League football (about 550 matches), using Solver to minimise the average error between the predicted and actual margin. The optimum K value was 75! Can you think of any reason why our K values would be so different ?</p> <p>Thanks Mick</p> <div class="hline"></div> <p><strong>Martin Eastwood - October 16, 2013</strong></p> <p>Some leagues do seem to optimise to different k values. I found MLS to be quite different too due to its high level of parity so perhaps something similar with A league?</p> <div class="hline"></div> <p><strong>Mick Podger - October 18, 2013</strong></p> <p>Thanks Martin, thats probably it. A-league has a lot of equalisation measures which create more variability from year to year than you’d probably get in Premier League.</p> <div class="hline"></div> <p><strong>Nick - January 14, 2014</strong></p> <p>Hello. I don’t understand your probability formula. EA + EB = 1, but we have three possible results.</p> <div class="hline"></div> <p><strong>Martin Eastwood - January 20, 2014</strong></p> <p>Elo ratings only have two outcome – win and loss – so the draw probability gets merged between the two</p> <div class="hline"></div> <p><strong>Adam - April 12, 2014</strong></p> <p>I noticed that the Ea formula here is different from that in Wikipedia:</p> <p>http://en.wikipedia.org/wiki/World_Football_Elo_Ratings</p> <p>Any suggestions?</p> <div class="hline"></div> <p><strong>Martin Eastwood - April 12, 2014</strong></p> <p>If I remember correctly, my version is a modification to the original elo to use logistic regression to improve its accuracy a little bit more.</p> <div class="hline"></div> <p><strong>Adam - April 12, 2014</strong></p> <p>Thank you for your kind reply Martin! I’ve got another question if you do not mind.</p> <p>Does Rb stands for the Away team and Ra for the Home team?</p> <div class="hline"></div> <p><strong>Martin Eastwood - April 12, 2014</strong></p> <p>No problem Adam, yes Rb should be the away team</p> <div class="hline"></div> <p><strong>Adam - April 12, 2014</strong></p> <p>Many thinks Martin!</p> <div class="hline"></div> <p><strong>Nathan Hause - August 12, 2014</strong></p> <p>First off I’d like to say great series of articles. I found them extremely helpful and surprisingly easy to understand.</p> <p>I am commenting to inquire as to your success predicting wins/losses/draws with this model. After some minor tinkering with your Ea formula, some of the major statistics you outlined (home win %, away win %) and with some help from another site that had Elo ratings I was able to fashion together a prediction model. It was able to accurately predict the outcome for 11 of the final 14 matches in the EPL season including draws (I didn’t have enough data to go back further and test it). I am wondering if this a relatively high percentage or if you have been able to make a model that has more success.</p> <p>Your input would be greatly appreciated, thanks.</p> <div class="hline"></div> <p><strong>Martin Eastwood - August 13, 2014</strong></p> <p>Hi Nathan – yes that looks a very high success rate! It’s a small sample size though so really you need to test it over more matches to see whether that level of accuracy is sustainable. Let me know how you get on with it :-)</p> <script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: 'center'," + " displayIndent: '0em'," + " showMathMenu: true," + " tex2jax: { " + " inlineMath: [ ['\\\\(','\\\\)'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" + " } " + "}); "; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); } </script>Martin EastwoodThu, 07 Feb 2013 19:30:00 +0000tag:,2013-02-07:2013/02/07Elo RatingsUnderstanding Elo Ratings/2013/01/31<h2>What are Elo Ratings?</h2> <p>The Elo rating system was originally devised by its creator Arphad Elo as a way to calculate the average skill levels of two chess players. Although the system was created specifically for chess it has also been adapted to many other games and sports, including <a href="http://www.eloratings.net/">international football</a>.</p> <h2>How Do They Work?</h2> <p>The fundamental principle behind Elo ratings is that the performance of a team in each match can be considered a random variable sampled from a normally distributed population centred on the team’s true skill level. Although performances will vary from match-to-match, the true skill level of the team is likely to only change slowly over time so can be considered to be the mean value of all their performance values.</p> <p>For example, Figure one shows a team with an Elo rating of 1500. On any given day their actual performance could vary from anywhere below 1000 to above 2000. But over a reasonable period of time their performances will average out to 1500.</p> <p><img alt="Pelican" src="../../../../images/130131-Elo-Distribution.png" /></p> <p><strong>Figure 1: Possible performances for a team with Elo of 1500</strong></p> <h2>Why are Elo Ratings Useful?</h2> <p>Elo ratings have no units and taken in isolation their specific values are of little interest. However, they become useful when comparing teams together as they can be used to determine the expected outcome between two teams based on the difference between their Elo ratings.</p> <p>The range used for Elo ratings is somewhat arbitrary with Elo himself suggesting they should be scaled so that a difference of two hundred points equates to the higher ranked team having a win probability of 75%. In addition, Elo ratings are generally scaled so that an average team has a rating of 1500.</p> <h2>Predicting Match Results Using Elo</h2> <p>Plotting two team’s Elo distributions together gives a nice way of visualizing their expected performances. Figure 2 shows Team 1 with an Elo rating of 1100 compared with Team 2 with an Elo of 1500. The most likely outcome is that both teams will play to their average ratings and so Team 2 will win overall as they have the higher ranking. However, both team’s performance distributions overlap each other, so it is possible for Team 1 to out perform Team 2 and win the match.</p> <p><img alt="Pelican" src="../../../../images/130131-Elo-Comparison.png" /></p> <p><strong>Figure 2: Comparision of two team’s Elo performance probabilities</strong></p> <p>The more these performance distributions overlap then the greater the chance of the lower placed team winning the match. The actual probability of victory can then be calculated from these two distributions by subtracting one from the other to get the normal difference distribution between the two (Figure 3).</p> <p><img alt="Pelican" src="../../../../images/130131-ELO-Difference-Distribution.png" /></p> <p><strong>Figure 3: Probability of Elo differential occurring</strong></p> <p>The centre of this new distribution is equal to the difference between the two ratings (1500 – 1100), meaning the most likely outcome is that Team 2 play like a team with an Elo rating of four hundred higher than Team 1. As we move further to the left the difference between the two teams decreases until we reach a negative differential at which point Team 1 actually start to play better than Team 2, albeit with a low probability of occurrence.</p> <p>The actual probability of this occurring can be plotted using cumulative frequency to show the overall chance of winning based on the Elo differential (Figure 4). So for our example above, Team 1 with its differential of -400 actually has around a 9% chance of winning the match while Team 2 has a 91% chance of winning.</p> <p><img alt="Pelican" src="../../../../images/130131-Elo-Frequency1.png" /></p> <p><strong>Figure 4: Probability of winning based on Elo differential</strong></p> <p>So now we understand the theory behind Elo ratings, my next post will look at how they can be calculated and applied to football teams.</p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>Lars - February 1, 2013</strong></p> <p>“two hundred points equates to the higher ranked team having a win probability of 75%. Following this advice means that an average team should have an Elo rating of 1500.”</p> <p>That conclusion is false. The win probability would be exactly the same if the average Elo rating was 2500 or 9500. The Elo scale is not absolute, only the difference between teams’ Elo values is relevant.</p> <p>Apart from that very good introduction, congratulations. I am really looking forward to the sequel.</p> <div class="hline"></div> <p><strong>Martin Eastwood - February 2, 2013</strong></p> <p>Good point Lars, I have re-phrased that sentence to avoid any confusion.</p> <p>Thanks for the feedback!</p>Martin EastwoodThu, 31 Jan 2013 19:30:00 +0000tag:,2013-01-31:2013/01/31Elo RatingsPredicting Football Matches Using Shot Data Part Two/2013/01/25<h2>Introduction</h2> <p>Having found that the correlation between goals scored and shots on target was the strongest of the various shooting variables I had available to me, I decided to see how well they could predict the outcome of a football match.</p> <h2>Creating The Model</h2> <p>The obvious approach would have been to just do a linear regression for goals scored against number of shots on target and then predict the average number of goals each team would be expected to score. This doesn’t provide much insight though. The average score line might be of interest if each team was going to play each other 20 or 30 times a season but for a single game it is pretty much irrelevant.</p> <p>What is of more use is to predict the actual odds for each possible outcome between the teams. In other words what is the probability of each team winning, drawing or losing?</p> <p>To do this I looked at how many shots on target each team achieved and conceded each match compared with the league’s average to estimate how many they would be expected to have against each other. This was then mapped to the distribution of their shot on targets over the season so far and their shot conversion rate used to calculate the probabilities of the different number of goals they could score. Each match was then played one million times as part of a Monte Carlo simulation to see what the likely outcomes was.</p> <h2>Are the Predictions Accurate?</h2> <p>One difficulty with a model like this is to assess its accuracy. With a traditional linear model you can just look at the <span class="math">\(r2\)</span> value to see how well you predictions match the actual results. The higher the <span class="math">\(r2\)</span> value then the better your model is.</p> <p>But with a probability model this doesn’t work. For example take the situation where the probability model predicts Team A have a 75% chance of beating Team B. Even if the model has calculated these odds perfectly then Team A will still lose 25% of the time, making it look like the prediction was incorrect.</p> <p>One alternative is to identify what the most probable outcome for each match was – win, draw or loss – and compare that with what actually happened to see if they match. To do this I applied the model retrospectively to all the matches from the 2011–2012 English Premier League season and overall the proportions of outcomes predicted did match closely what actually happened (Figure 1).</p> <p><img alt="Pelican" src="../../../../images/130121-SOT-Proportions.png" /></p> <p><strong>Figure 1: Proportion of outcomes predicted compared with actual results for 2011–2012 English Premier League season</strong></p> <p>Another test we can do is to compare the Shot on Target model with other models to see how well they compare. Again I picked the most probable outcome from my odds and this time compared it with those from Bet365 for the entire 2011–2012 English Premier League season. I also randomly guessed the outcome for each match by chance to see how the model compared with pure luck too.</p> <h2>Prediction Results</h2> <p>Overall, the Shot on Target model’s most probable outcome correctly matched what actually happened for 43% of the matches tested compared with 52% for Bet365 and 33% from randomly guessing.</p> <p>Interestingly, even the bookies only managed to get the odds correct for around half the matches so the Shot on Target model is doing pretty well at 43% and isn’t that far behind the professional odds compilers. Also, this is only the first stage of the model, there are still plenty of ways it can be tweaked to try and improve its accuracy further.</p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>Laurie - January 31, 2013</strong></p> <p>Reading your blog it seems that shots on goal is a good thing to study to determine a win/draw/lose prediction of a game.I have an ELO system on excel and also power ratings.However i want to add shot:shots on target:goals to my excel system,but am unsure how to go about doing this.Im hoping you can help.Maybe send me an email with some advice.I’d be very grateful for your help.</p> <p>Cheers</p> <p>Laurie</p> <div class="hline"></div> <p><strong>BernieW - February 4, 2013</strong></p> <p>I did notice that Barca did not perform as well when their shots &amp; shots on target were below their average PLUS the opposition had their shots &amp; shots on target above the average. Do you use clear chances as a metric? I cannot find this statistic recorded by any site and I get the “feeling” that this would be more accurate a predictor. It would be great to have this for all the 5/6 major leagues for a few seasons.</p> <div class="hline"></div> <p><strong>Martin Eastwood - February 4, 2013</strong></p> <p>It is an interesting idea. I am looking at ways to improve the model by adding in extra metrics so I’ll take a look at it if i can find any data available</p> <div class="hline"></div> <p><strong>David - February 3, 2014</strong></p> <p>Great post..I have always wanted to use something other than linear regression. Could you make a short tutorial or send me an email on how you did the part where you create the model.</p> <p>How do you estimate how many shots on target they are expected to have against each other and how do you end up with the score probabilities.</p> <p>I would love to be able to recreate this model you have made.</p> <p>Looking forward to your reply and keep up the good work.</p> <div class="hline"></div> <p><strong>Nick - April 16, 2014</strong></p> <p>I came across a bog post (http://www.soccerstatistically.com/blog/2011/11/9/how-to-succeed-in-the-epl-chances-created-and-chance-convers.html) that analyzes chances created and goals scored. There is a correlation (adding in the chance conversion rate). The author uses data from Opta.</p> <div class="hline"></div> <p><strong>Martin Eastwood - April 16, 2014</strong></p> <p>Thanks for the link!</p> <div class="hline"></div> <p><strong>Nick - April 16, 2014</strong></p> <p>I have thought about the same – goal chances although a little bit subjective and more difficult to define than shots on target (is a shot from 30 yards that blazed over the bar a goal chance?) should be a much better predictor than shots on target because shots on target do not tell you anything about the quality of those shots and exclude great gola chances that resulted in shots off target or no shot at all.</p> <div class="hline"></div> <p><strong>Nick - June 20, 2014</strong></p> <p>This may be interesting for you: http://www.pinnaclesports.com/online-betting-articles/05-2014/world-cup-total-shots-ratio.aspx</p> <div class="hline"></div> <p><strong>Martin Eastwood - June 21, 2014</strong></p> <p>Thanks for the link Nick!</p> <script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: 'center'," + " displayIndent: '0em'," + " showMathMenu: true," + " tex2jax: { " + " inlineMath: [ ['\\\\(','\\\\)'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" + " } " + "}); "; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); } </script>Martin EastwoodFri, 25 Jan 2013 19:30:00 +0000tag:,2013-01-25:2013/01/25ShotsWhat Is The Chance of Bradford City reaching Wembley?/2013/01/22<p>With League Two’s Bradford City only one match away from playing at Wembley in the League Cup final I thought it would be interesting to see what the chances were of them getting this far.</p> <p>It has been an unbelievable cup run for Bradford City as they have had to play against teams in higher divisions nearly every step of the way. The first round of the cup pitted them against League One team Notts County who they managed to defeat in extra time. This was then followed a couple of weeks later with a 2–1 victory away at Championship side Watford.</p> <p>The third round was somewhat kinder to them as they played at home against Burton Albion, a team from their own division. However, having seen off Burton the fourth round then took them to Premier League side Wigan Athletic who they managed to defeat on penalties.</p> <p>The quarter final again put Bradford against another Premier League team, with Arsenal this time Bradford’s victims. Next up was Aston Villa who lost 3–1 to Bradford at the Valley Parade although Villa did come away with an away goal, which could prove to be critical for them.</p> <p>To work out the probability of Bradford’s cup run I collected the odds for each match from www.oddsportal.com and removed the overround to get the true odds. The overround is the bookies profit margin created by offering odds lower than the actual true odds of the event occurring. To remove it we just need to scale the odds by the excess so that they add up to exactly 100%.</p> <p>Once we have the true odds we can them work out the cumulative probability of the cup run by multiplying the odds together (note that bookies odds generally refer to what happens over the first ninety minutes of the match so Bradford City beating Notts County in extra time is actually classed as a draw rather than an away win).</p> <p><span class="math">\(Prob(Cup Run) = prob draw with Notts County * prob beating Watford * prob…\)</span></p> <p>Overall, the probability of Bradford City’s current cup run so far is 0.008%. If we take into account tonight’s match then the chances of Bradford’s cup run taking them all the way to Wembley is around 0.001% or 1 in 100,000. It’s not quite a lottery win, which is around 100 times less likely again, but it is a fantastic achievement for Bradford City and is likely to be a once in a lifetime experience for their fans.</p> <h2>Comments</h2> <p><div class="hline"></div></p> <script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: 'center'," + " displayIndent: '0em'," + " showMathMenu: true," + " tex2jax: { " + " inlineMath: [ ['\\\\(','\\\\)'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" + " } " + "}); "; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); } </script>Martin EastwoodTue, 22 Jan 2013 19:30:00 +0000tag:,2013-01-22:2013/01/22ChancePredicting Football Matches Using Shot Data/2013/01/21<h2>Introduction</h2> <p>Previously on this blog I have discussed my attempts at using the Poisson distribution to predict the number of goals scored in football matches. So far, the results have been disappointing as the mathematical model I constructed under-predicted the number of draws that occurred. This is something I intend to go back and address at some point by adding in the Dixon and Coles adjustment but in the meantime I thought I would try predicting the outcome of matches using shots instead.</p> <h2>Shot Data</h2> <p>There were a number of reasons for working with shots instead of using goals directly. First of all, shots and goals are inherently linked together. For every goals scored there has to be a shot taken. Secondly, not every shot taken leads to a goal, giving us a much larger data set to work with compared with just goals alone. Thirdly, the number of shots taken in a match is pretty much normally distributed (Figure 1) whereas the number of goals scored is closer to a Poisson distribution. This is useful as many statistical tests rely on a normal distribution of data.</p> <p><img alt="Pelican" src="../../../../images/130121-Total-Shot-Histogram.png" /></p> <p><strong>Figure 1: Frequency of total shots in English Premier League matches 2009–2012</strong></p> <p>The first stage of developing the model was to determine what variables to use for it. Looking at data over a whole season showed a decent correlation between goals scored and total shots taken (<span class="math">\(r2\)</span>=0.62), shots on target(<span class="math">\(r2\)</span>=0.76), shots blocked (<span class="math">\(r2\)</span>=0.59) and shots wide (<span class="math">\(r2\)</span>=0.32; Figure 2).</p> <p><img alt="Pelican" src="../../../../images/130121-Shot-Correlations.png" /></p> <p><strong>Figure 2: Correlation between goals scored and various shooting parameters for the 2011–2012 English Premier League Season</strong></p> <h2>Variability</h2> <p>Unfortunately when you start looking at the data match-by-match the correlations become much weaker. Over the course of an entire season a lot of the variability in the data starts to even out but over a single match it is not the case and variables such as luck can play a much bigger role. For example it is likely that the teams with the most shots on target will score the most goals overall per season as skill would start to dominate over luck. However, this isn’t always the case for an individual game – we have all seen matches where a team has scored a lucky goal and then managed to hold on for the win even though the opposition has showered their goal with shots for ninety minutes.</p> <p>Because of this I decided to exclude many of the variables as they have little value over a single match. Instead, I focussed on using just shots on target data as this had the highest correlation with goals match-by-match. As with the total number of shots taken, the data is also roughly normally distributed although it is skewed towards zero (Figure 3) as obviously no matter how bad a team is it cannot achieve less than zero shots on target in a match (although Blackburn Rovers come close by managing to go the entire match against Tottenham Hotspur in 2012 without taking even a single shot, let alone managing to get one on target!)</p> <p><img alt="Pelican" src="../../../../images/130121-SOT-Histogram.png" /></p> <p><strong>Figure 3: Frequency of Shots on Target in English Premier League matches 2011–2012</strong></p> <p>In my next post I will explain more about how the Shot on Target model works and discuss its accuracy.</p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>Ilia - July 30, 2013</strong></p> <p>Hi Martin,</p> <p>I’m curious how you got such a high r2 value for shots on goal. When I try to do the same calculations on the EPL I can’t get more then 0.17. Any idea on what I might be doing wrong?</p> <div class="hline"></div> <p><strong>Martin Eastwood - July 30, 2013</strong> I aggregated shots on goal with goals scored over a full season. Maybe you are looking at individual matches in which case the <span class="math">\(r2\)</span> will likely but much lower?</p> <script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: 'center'," + " displayIndent: '0em'," + " showMathMenu: true," + " tex2jax: { " + " inlineMath: [ ['\\\\(','\\\\)'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" + " } " + "}); "; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); } </script>Martin EastwoodMon, 21 Jan 2013 19:30:00 +0000tag:,2013-01-21:2013/01/21ShotsPredicting The Premier League Using The Refined Pythagorean Equation/2013/01/18<h2>Introduction</h2> <p>New article for <a href="http://www.bettingexpert.com/blog/pythagorean-expectation-and-football">Betting Expert</a> looking at the current Premier League standings compared with the predictions from my refined version of the Pythagorean Expectation.</p> <h2>Comments</h2> <p><div class="hline"></div></p>Martin EastwoodFri, 18 Jan 2013 19:30:00 +0000tag:,2013-01-18:2013/01/18How Early In The Season Can Pythagorean Predictions Be Made?/2013/01/02<h2>Introduction</h2> <p>The next stage for developing my refined version of the Pythagorean equation (known as the MPE) is to characterise how much data it actually needs to make accurate football predictions.</p> <h2>Methodology</h2> <p>To investigate this I selected Manchester City, Swansea City and Wolverhampton Wanderers from the English Premier League’s 2011–2012 season. The reason for choosing these teams was that they represented the top, middle and bottom of the league so I could test the MPE equation across teams of varying quality and league position.</p> <p>I then used the MPE equation to predict the total points at the end of the season for each team week-by-week to see how the prediction changed throughout the year. Figure 1 shows the difference between the predicted points and the actual points achieved at the end of the season for each of the three teams.</p> <p><img alt="Pelican" src="../../../../images/130102_pythag_by_week.png" /></p> <p><strong>Figure 1: Difference Between Predicted and Actual Points</strong></p> <h2>Results</h2> <p>The prediction settled down very quickly for Manchester City and from match three onwards the root mean square (RMSE) of the error was just 1.96 points. This means that after just three games the MPE equation was accurately predicting how many points Manchester City would have at the end of the season to within two points.</p> <p>For Swansea City the prediction was slightly more problematic as they didn’t score during their first four matches and the MPE equation needs goals to have been scored before a valid prediction can be made. Swansea City finally scored in their fifth match in a 3–0 victory over West Bromwich Albion and from then on the prediction steadily improved and was within three points of their actual total after their next six matches.</p> <p>Wolverhampton Wanderers’ season was an interesting one to predict as they had a very misleading start with two wins and a draw in their first three matches giving a predicted point total of 83. At this point though it all went disastrously wrong for them and they lost their next five matches on the run by which time their predicted points had dropped all the way down to 30. Wolverhampton Wanderers eventually finished bottom of the league with 25 points.</p> <p>Overall, the MPE equation appears to give stable results and the only real requirement is that goals have been scored. Based on the data in Figure 1 accurate predictions can be made early in the season as there is very little change in the predictions from week ten of the season onwards.</p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>amir - March 23, 2013</strong></p> <p>That is very interesting.</p>Martin EastwoodWed, 02 Jan 2013 19:30:00 +0000tag:,2013-01-02:2013/01/02PythagoreanWhat Has Caused Dimitar Berbatov’s Recent Lack of Goals?/2012/12/15<h2>Introduction</h2> <p>Up until week 12 of the season, Dimitar Berbatov was one of the English Premier League’s top goal scorers and goal creators. However, since then he has gone 450 minutes without registering either a goal or an assist, coinciding with Bryan Ruiz’s injury. Check out my <a href="http://www.bettingexpert.com/blog/what-has-caused-berbatovs-goal-drought">guest article</a> in which I analyse the effect the absence of Ruiz has had on Berbatov’s performances here</p> <h2>Comments</h2> <p><div class="hline"></div></p>Martin EastwoodSat, 15 Dec 2012 19:30:00 +0000tag:,2012-12-15:2012/12/15Using the Pythagorean Expectation Across Leagues Wordwide/2012/12/10<h2>Introduction</h2> <p>I showed in my <a href="http://pena.lt/y/2012/12/03/applying-the-pythagorean-expectation-to-football-part-two/">last post</a> that my initial version of the Pythagorean Expectation (MPE) predicted total points for the English Premier League (EPL) pretty well, with an RMSE of approximately four points over the course of a whole season (see <a href="http://en.wikipedia.org/wiki/Root-mean-square_deviation">here</a> for an explanation of using RMSE to measure the error of the predictions). The next stage for the equation’s development is to see whether it can be applied to other leagues too. Having one MPE equation that could be used globally across leagues is preferable to having to create specific equations for each league.</p> <h2>The Eredivisie</h2> <p>At the recommendation of <a href="http://scoreboardjournalism.wordpress.com/">Scoreboard Journalism's</a> Simon Gleave I started with the Eredivisie, the top flight division in Holland. The reason for choosing the Eredivisie is that it is a unique league, with high rates of goal scoring and a number of results in recent years that appear as potential outliers. For example, in the 2009–2010 season Ajax scored 43 goals more than Twente and conceded three fewer yet still finished second to them in the league. At the other end of the table Willem II finished 15th in 2007–2008 with a goal difference of -9 while the two teams immediately above them had goal differences of -30 and -24, respectively. These sort of results make the Eredivisie difficult to predict and so provide a good stress test for the MPE equation.</p> <p>Applying the MPE to the final Eredivise standings from 1999–2000 to 2011–2012 worked surprisingly well, with an overall RMSE of 4.35 points. It is slightly higher than the 4.08 previously obtained for the EPL but this is perhaps to be expected since the original MPE equation was generated using just data from the EPL.</p> <p>To see whether the Dutch league needed its own version of MPE I recreated the equation based on just Eredivisie data and the overall error dropped to 4.21, a decrease of around 3%. Such a minor improvement suggests that the equation maybe stable across leagues and so we will not need league-specific versions.</p> <p>To test this hypothesis further I collected 223 league tables from around the world and optimised the MPE against this larger data set. The reason for this was three-fold. Firstly, the original equation I published was created just from EPL data so any peculiarities to the EPL could bias results for other leagues.</p> <p>Secondly, the previous data set was smaller so any outliers in the data could have a large effect on the finalised results. By using a larger data set the influence of any outliers will be minimised.</p> <p>Thirdly, and perhaps most importantly, this gave enough data to <a href="http://en.wikipedia.org/wiki/Cross-validation_(statistics)">cross-validate</a> the equation by randomly splitting the league tables up into training and validation sets. Initially, the MPE had been trained and tested using the same data. Now it has been tested on different data to which it was optimised against, reducing the risk of <a href="http://en.wikipedia.org/wiki/Testing_hypotheses_suggested_by_the_data">Type III errors</a> errors occurring.</p> <p>Figure One shows the RMSE for the predictions for fifteen leagues randomly selected as a validation set. The overall RMSE across the entire validation set is 3.88 points and is plotted as the vertical dotted line. The overall RMSE is now reduced to below four points and this new version of the MPE equation appears suitable for use globally across different leagues.</p> <p><img alt="Pelican" src="../../../../images/121210_MPE_Training_Results.png" /></p> <p><strong>Figure 1: Results For Validation of MPE Equation</strong></p> <p>The finalised MPE Pythagorean Expectation is shown in Figure 2. Based on the data shown here this new version of the MPE equation is suitable for use across multiple leagues worldwide, with an average error of less than 4 points per season.</p> <p><span class="math">\(predicted points = (goalsfor^{1.2299}/(goalsfor^{1.16793} + goalsaway^{1.20053})) * 2.29761 * numberofgamesplayed\)</span></p> <p><strong>Figure 3: MPE Equation</strong></p> <h2>Comments</h2> <p><div class="hline"></div></p> <script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: 'center'," + " displayIndent: '0em'," + " showMathMenu: true," + " tex2jax: { " + " inlineMath: [ ['\\\\(','\\\\)'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" + " } " + "}); "; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); } </script>Martin EastwoodMon, 10 Dec 2012 19:30:00 +0000tag:,2012-12-10:2012/12/10PythagoreanApplying the Pythagorean Expectation to Football: Part Two/2012/12/03<h2>Introduction</h2> <p>In my <a href="http://pena.lt/y/2012/11/26/applying-the-pythagorean-expectation-to-football-part-one/">previous article</a>, I discussed how to apply the baseball Pythagorean expectation to football and how to measure the error of the predictions using <a href="http://en.wikipedia.org/wiki/Root-mean-square_deviation">RMSE</a>. This second article will demonstrate how to optimize the equation further to improve its accuracy.</p> <h2>Accuracy</h2> <p>One of the major reasons for the error in the predictions is the occurrence of draws in football. The Pythagorean expectation only looks at wins and losses and presumes that if a team scores zero goals then it will achieve zero points. This is of course incorrect, it is perfectly feasible for a team to fail to score but still gain a point through a nil-nil draw so we need to take this into account.</p> <p>Howard Hamilton of <a href="http://www.soccermetrics.net/">Soccermetrics</a> has published an updated <a href="http://hhamilton.typepad.com/files/pythag_mit_sa_2010.pdf">Soccer Pythagorean</a> equation that does just that, and it does a good job of it. For the 2011–2012 season, Howard Hamilton reports an RMSE of 3.81 compared with the RMSE of 5.65 I reported for my previous version of the Pythagorean equation. The downside to Howard Hamilton’s equation though is that it is rather complicated. While the original Pythagorean equation is simple enough to be used by any football fan, Howard Hamilton’s equation requires a decent understanding of mathematics to use it.</p> <p>Because of that, I thought I would tweak the original Pythagorean formula a bit further to try and improve its accuracy without adding too much extra complexity to it. One easy way to do this is to scale the points scored per match to take into account the occurrence of draws. Applying least squares to this reduces the RMSE for the 2011–2012 season to 4.04 points, just 6% higher than Howard Hamilton’s equation. This is based on only one season’s data though so to get a true idea of how well my enhanced Pythagorean expectation works (abbreviated to MPE) I optimized the equation based on a much larger data set and applied it to the last 10 English Premier League (EPL) seasons ( Figure 1).</p> <p><img alt="Pelican" src="../../../../images/121203_pthagorean_seasons.png" /></p> <p><strong>Figure 1: MPE Prediction by Season in the EPL</strong></p> <p>The MPE works well, with an average residual (the difference between predicted points and actual points) of 4.08 points. This compares nicely with Howard Hamilton’s published value of 3.81 and is less than half of the error the original Pythagorean Expectation equation gave. It is also worth noting that Howard Hamilton’s RMSE of 3.81 is for just one season, and of the ten seasons analysed here using the MPE, two actually have an RMSE lower than 3.81.</p> <p>Plotting the MPE predicted points versus actual points for the last ten EPL seasons shows visually how well the MPE equation works (Figure 2). The correlation between the predicted and actual points scored is excellent, with an an <span class="math">\(r2\)</span> value 0.938 (Figure 2).</p> <p><img alt="Pelican" src="../../../../images/121203_mpe_scatter.png" /></p> <p><strong>Figure 2: MPE Predicted Points Versus Actual Points in the EPL</strong></p> <p>So based on the initial work so far I am pleased that the MPE version of the Pythagorean expectation gives results comparable to Howard Hamilton’s more detailed and advanced derivation but without quite as much added complexity. The final equation for anybody who wants to give it a try is shown below in Figure 3.</p> <p><span class="math">\(predicted points = (goalsfor^{1.22777}/(goalsfor^{1.072388} + goalsaway^{1.127248})) * 2.499973 * numberofgamesplayed\)</span></p> <p><strong>Figure 3: MPE Equation</strong></p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>mark - December 3, 2012</strong></p> <p>Hi Martin,</p> <p>nice, clear explanation.</p> <p>Have you looked at using pythag to predict future games rather than explain previous points totals ? I’ve always thought that the drive to reduce rmse for predicted vs actual points can lead to overfitting of the non repeatable luck driven component of the actual games.</p> <p>I’ve looked at pythag match ups as predictors of future games, but only in the NFL ,not soccer and pythag league points from one year to predict total points in the next season for soccer here http://thepowerofgoals.blogspot.co.uk/2012/11/a-predictive-pythagorean-for-football.html</p> <p>Game by game pythag is also a novel twist.</p> <p>Interesting subject, but a bit confused as to where it’s going at the mo. Posts like yours will go a long way towards clarifying things. Once again, nice read.</p> <p>Mark</p> <div class="hline"></div> <p><strong>admin - December 3, 2012</strong></p> <p>Hi Mark,</p> <p>Thanks for the comments, they are really interesting points. I agree about the rmse, football is so variable and luck-driven that an error of zero is unrealistic and probably not particularly useful for making predictions from as will not reflect the future accurately.</p> <p>I am interested in looking at the predictive power of the Pythagorean and will be investigating that further. I am also interested though in what else the Pythagorean shows, such as how much Everton would need to improve in terms of goals scored / conceded to really challenge for a top four place or for a relegation-threatened team to stay up etc. Hopefully over the next few weeks I will be able to find out how useful the Pythagorean can be for this sort of thing in football.</p> <p>Thanks again!</p> <p>Martin</p> <div class="hline"></div> <p><strong>Jonas - December 12, 2012</strong> Thanks for an interesting blog. I have just one question. I see that you have fitted two different exponent for “GoalsFor”, have you tried fitting the formula with the same parameter for “GoalsFor” in both the numerator and denominator? This would seem more intuitive (not that I find the pythogarean very intuitive), but then again you would probably get larger RMSE.</p> <div class="hline"></div> <p><strong>admin - December 12, 2012</strong></p> <p>Yes you could simplify it by using one exponent but the RMSE will go up.</p> <div class="hline"></div> <p><strong>Andrew Ferris - December 17, 2012</strong></p> <p>Applying your equation to the current league table for the premiership on a team by team basis, the total points would equal 450, as opposed to the real points total currently 451 (correct 17/12/12), which is amazing accuracy. However, it is out by 2.7 points per team on average, and out by 11 points for Manchester United, which would be great in real life as I detest them! The league table would look as follows:</p> <table> <thead> <tr> <th>Team</th> <th>Points</th> </tr> </thead> <tbody> <tr> <td>Manchester City</td> <td>34</td> </tr> <tr> <td>Manchester United</td> <td>31</td> </tr> <tr> <td>Chelsea</td> <td>28</td> </tr> <tr> <td>Arsenal</td> <td>28</td> </tr> <tr> <td>Everton</td> <td>27</td> </tr> <tr> <td>Tottenham Hotspur</td> <td>25</td> </tr> <tr> <td>Swansea City</td> <td>25</td> </tr> <tr> <td>West Bromwich Albion</td> <td>25</td> </tr> <tr> <td>Stoke City</td> <td>25</td> </tr> <tr> <td>West Ham United</td> <td>23</td> </tr> <tr> <td>Liverpool</td> <td>23</td> </tr> <tr> <td>Fulham</td> <td>22</td> </tr> <tr> <td>Norwich City</td> <td>19</td> </tr> <tr> <td>Sunderland</td> <td>19</td> </tr> <tr> <td>Newcastle United</td> <td>18</td> </tr> <tr> <td>Southampton</td> <td>17</td> </tr> <tr> <td>Aston Villa</td> <td>16</td> </tr> <tr> <td>Reading</td> <td>15</td> </tr> <tr> <td>Wigan Athletic</td> <td>15</td> </tr> <tr> <td>Queens Park Rangers</td> <td>14</td> </tr> </tbody> </table> <div class="hline"></div> <p><strong>Max Steele - October 4, 2013</strong></p> <p>Hi Martin,</p> <p>I am slightly confused by the last graph. Surely it is more helpful to measure deviation from the line y=x rather than the line of best fit? I don’t really see what success a high correlation coefficient has if for example the best fit line was vertical. To highlight this, realise that you could achieve this with r^2 = 1 if you just predicted every team to have the same points total.</p> <p>Using the line y=x would have the added benefit of allowing you to see where your model is performing better/worse (i.e. low point scorers underestimated etc.) with systematic deviations from the line y=x.</p> <p>Thanks, Max</p> <div class="hline"></div> <p><strong>M - December 5, 2013</strong></p> <p>Why the formula has GoalsAway instead of GoalsAgainst?</p> <div class="hline"></div> <p><strong>Martin Eastwood - December 29, 2013</strong></p> <p>Yes, perhaps GoalsAgainst would have been a better name so in case it isn’t clear to anybody else, I am referring to goals conceded</p> <div class="hline"></div> <p><strong>Rick Tee - September 6, 2014</strong></p> <p>I have been working on this for a month or so and what i’ve found is accuracy drops from about 60% in the BPL to 40% in the Championship and lower, I also noticed a strange phenomenon where the outcome was the reverse of the test result i.e if team B were deemed the winner then team A would actually be the winner.</p> <table> <thead> <tr> <th>Current result</th> <th>%</th> </tr> </thead> <tbody> <tr> <td>BPL</td> <td>70%</td> </tr> <tr> <td>Cha</td> <td>60%</td> </tr> <tr> <td>LG1</td> <td>60%</td> </tr> <tr> <td>LG2</td> <td>55%</td> </tr> <tr> <td>CNF</td> <td>70%</td> </tr> </tbody> </table> <p>I should add I am also counting draws in the accuracy quoted above. I will continue to test but as the season is still ‘finding its feet’ i’m sure thing will change.</p> <div class="hline"></div> <p><strong>Rick Tee - September 6, 2014</strong></p> <p>Thought I would add, for the draws i am using a variable x and y, these are set differently for each league, eg. BPL x=75 y=84, CNF x=57.25 y=55.</p> <p>Hw = 70, Aw = 75, D = 80 If the points fall between these two numbers then a draw is the predicted outcome.</p> <p>I know the numbers are widely different but the system is essentially the same. I just thought i would share some information on how i’ve been calculating a draw. If anyone finds it useful I will try to explain my system in more detail.</p> <script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: 'center'," + " displayIndent: '0em'," + " showMathMenu: true," + " tex2jax: { " + " inlineMath: [ ['\\\\(','\\\\)'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" + " } " + "}); "; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); } </script>Martin EastwoodMon, 03 Dec 2012 19:30:00 +0000tag:,2012-12-03:2012/12/03PythagoreanApplying the Pythagorean Expectation to Football: Part One/2012/11/26<h2>Introduction</h2> <p>The <a href="http://en.wikipedia.org/wiki/Pythagorean_expectation">Baseball Pythagorean Expectation</a> is a formula originally derived by Bill James to estimate how many games a baseball team could be expected to win over a season based on the number of runs they score and concede (Figure 1). Teams winning fewer games than their Pythagorean prediction are considered to have been unlucky while those outperforming the prediction are thought to have had luck on their side.</p> <p><span class="math">\(wins = runs scored^2 / (runs scored^2 + runs allowed^2)\)</span></p> <p><strong>Figure 1: The Baseball Pythagorean Expectation</strong></p> <p>The formula works well for baseball, giving predictions generally within three games of what actually happens. The Pythagorean expectation has also been applied successfully to other sports, including American football and basketball. However, so far the equation has not worked particularly well for predicting football matches.</p> <p>Table 1 shows goals scored and conceded in the English Premier League (EPL) during the 2011–2012 season, along with the actual points and Pythagorean predicted points. Looking at the difference between predicted and actual points it is clear that the Pythagorean expectation is over-predicting at the top of the table and under-predicting at the bottom.</p> <table class="table"> <tbody> <tr> <td>Team</td> <td>GF</td> <td>GA</td> <td>Pts</td> <td>Pythag Pts</td> </tr> <tr> <td>Manchester City</td> <td>93</td> <td>29</td> <td>89</td> <td>104</td> </tr> <tr> <td>Manchester United</td> <td>89</td> <td>33</td> <td>89</td> <td>100</td> </tr> <tr> <td>Arsenal</td> <td>74</td> <td>49</td> <td>70</td> <td>79</td> </tr> <tr> <td>Tottenham Hotspur</td> <td>66</td> <td>41</td> <td>69</td> <td>82</td> </tr> <tr> <td>Newcastle United</td> <td>56</td> <td>51</td> <td>65</td> <td>62</td> </tr> <tr> <td>Chelsea</td> <td>65</td> <td>46</td> <td>64</td> <td>76</td> </tr> <tr> <td>Everton</td> <td>50</td> <td>40</td> <td>56</td> <td>70</td> </tr> <tr> <td>Liverpool</td> <td>47</td> <td>40</td> <td>52</td> <td>66</td> </tr> <tr> <td>Fulham</td> <td>48</td> <td>51</td> <td>52</td> <td>54</td> </tr> <tr> <td>West Bromwich Albion</td> <td>45</td> <td>52</td> <td>47</td> <td>49</td> </tr> <tr> <td>Swansea City</td> <td>44</td> <td>51</td> <td>47</td> <td>49</td> </tr> <tr> <td>Norwich City</td> <td>52</td> <td>66</td> <td>47</td> <td>44</td> </tr> <tr> <td>Sunderland</td> <td>45</td> <td>46</td> <td>45</td> <td>56</td> </tr> <tr> <td>Stoke City</td> <td>36</td> <td>53</td> <td>45</td> <td>36</td> </tr> <tr> <td>Wigan Athletic</td> <td>42</td> <td>62</td> <td>43</td> <td>36</td> </tr> <tr> <td>Aston Villa</td> <td>37</td> <td>53</td> <td>38</td> <td>37</td> </tr> <tr> <td>Queens Park Rangers</td> <td>43</td> <td>66</td> <td>37</td> <td>34</td> </tr> <tr> <td>Bolton Wanderers</td> <td>46</td> <td>77</td> <td>36</td> <td>30</td> </tr> <tr> <td>Blackburn Rovers</td> <td>48</td> <td>78</td> <td>31</td> <td>31</td> </tr> <tr> <td>Wolverhampton Wanderers</td> <td>40</td> <td>82</td> <td>25</td> <td>22</td> </tr> <tr> <td></td> <td></td> <td></td> <td>RMSE</td> <td>8.4</td> </tr> </tbody> </table> <p><strong>Table 1: Pythagorean Expectation for the EPL 2011–2012</strong></p> <p>We can quantify this error by calculating the root-mean-square error (RMSE). This technique basically squares the difference between the predicted and actual points and then takes the square root of the average. It sounds complicated but all the squares and square roots do is make all the numbers positive. Imagine if we predicted just two values and were -10 points out for the first and +10 points out on the second. If we just averaged these two numbers then the average error would be zero, making it look like our prediction was perfect when it obviously was not. Instead, if we square the numbers first and then take the square root of the average we get the correct error of ten points. Doing this calculation for Table 1 gives us a RMSE of 8.4 points meaning that on average the Pythagorean expectation was eight points out for the 2011–2012 season.</p> <p>The more accurate the predictions are then the lower the RMSE will be. One way to improve the prediction is to alter the exponent used in the equation. In other words, instead of raising goals scored and conceded to the power of two we use different values. Figure 2 shows what happens to the RMSE as the exponent is changed from 0.1–3. Looking at the chart, the RMSE is lowest using an exponent of 1.35, giving an average error of 5.75, nearly three points lower than before.</p> <p><img alt="Pelican" src="../../../../images/121126_pthagorean_rmse.png" /></p> <p><strong>Figure 2: Effect of Altering Exponent on RMSE</strong></p> <p>The next logical step to improve the prediction further is to try using a different exponent for each part of the equation. This makes the formula harder to optimize but by applying a technique called least squares to it we come up with optimal exponents of 1.39, 1.43 and 0.98. Unfortunately this has little effect on the RMSE though, reducing it just 0.1 to 5.65 points.</p> <p>So far the predictions are still nearly six points out but in part two of this article I will discuss why the error is high and show how to improve it further to increase the accuracy of the predictions.</p> <h2>Comments</h2> <p><div class="hline"></div></p> <script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; mathjaxscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: 'center'," + " displayIndent: '0em'," + " showMathMenu: true," + " tex2jax: { " + " inlineMath: [ ['\\\\(','\\\\)'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" + " } " + "}); "; (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript); } </script>Martin EastwoodMon, 26 Nov 2012 19:30:00 +0000tag:,2012-11-26:2012/11/26PythagoreanDisparity in European Football Leagues/2012/11/20<h2>Introduction</h2> <p>Having mentioned the effect disparity plays on determining the league champions in previous posts I thought it would be interesting to look at the actual levels of disparity currently present in football.</p> <h2>English Premier League</h2> <p>I started off looking at the English Premier League (EPL) over the past decade and plotted the points achieved each season as a Tukey Box-and-Whiskers plot (Figure 1). Looking at Figure 1, the spread of points across the league each season is broadly consistent. There have been a few years where individual teams have done particularly well, such as Chelsea in 2004-2005, or particularly badly, such as Derby County in 2008, but there are no obvious changes over time.</p> <p><img alt="Pelican" src="../../../../images/121120_epl_disparity_box_whiskers.png" /></p> <p><strong>Figure 1: Points Scored in EPL Per Season</strong></p> <p>One noticeable feature, however, is that the median value for every season (the thick black line in the middle of each box) is lower than the overall average (plotted as the horizontal dotted line), suggesting the data is skewed. Looking at the 2010–2011 season as an example, half the teams scored less than 47 points while half scored 47 or more. In comparison, the average points scored that season was 51.5. This means that an average mid-table EPL team is closer to relegation than it is to winning the league. To put it into perspective, West Ham finished bottom that season scoring just 18.5 points less than the average while Manchester United won the league with 28.5 points more than the average.</p> <h2>Other European Leagues</h2> <p>A similar pattern can be seen across all the major league in Europe (Figure 2) where the median points achieved was also lower than the average. The median points for the Budesliga and Eriedivise were furthest from the average but it is worth bearing in mind that these two league player fewer matches than the EPL, La Liga and Ligue 1 so this is perhaps to be expected.</p> <p><img alt="Pelican" src="../../../../images/121120_euro_disparity_box_whiskers.png" /></p> <p><strong>Figure 2: Points Scored in 2010-2011</strong></p> <p>La Liga and Ligue 1 both show two teams that are classified as statistical outliers. It is no surprise that the two outliers in La Liga are Real Madrid and Barcelona who both finished more than twenty points ahead of the rest of the league. In the case of Ligue 1, the champions Montpellier and relegated Arles-Avignon are both classed as outliers. A major the reason for this is how close the middle of Ligue 1 finished that season – Monaco were relegated with 44 points, only seven points less than Bordeux who finished in seventh place.</p> <p>Since leagues play different numbers of matches it is difficult to compare them directly so I also looked at the difference in points scored per match by the top team and the middle team, and the middle team and bottom team (Table 1). The results show that La Liga was the most uncompetitive of the leagues, with the champions scoring 1.737 points more per match than the bottom team. The EPL came out as the most competitive league, with the lowest difference between the top and bottom teams. However, Ligue 1 appears the most balanced, with the smallest difference between the top and bottom of the league compared with the middle. Interestingly, the Eriedivisie appears unbalanced in the opposite way to most other leagues, with the bottom team further away from mid-table than the champions are from mid-table.</p> <table class="table"> <tbody> <tr> <td>League</td> <td>Top/Middle</td> <td>Middle/Bottom</td> <td>Total</td> </tr> <tr> <td>EPL</td> <td>0.868</td> <td>0.368</td> <td>1.237</td> </tr> <tr> <td>Ligue 1</td> <td>0.711</td> <td>0.763</td> <td>1.474</td> </tr> <tr> <td>La Liga</td> <td>1.289</td> <td>0.447</td> <td>1.737</td> </tr> <tr> <td>Bundesliga</td> <td>0.912</td> <td>0.441</td> <td>1.353</td> </tr> <tr> <td>Eriedivisie</td> <td>0.765</td> <td>0.941</td> <td>1.706</td> </tr> </tbody> </table> <h2>Comments</h2> <p><div class="hline"></div></p>Martin EastwoodTue, 20 Nov 2012 19:30:00 +0000tag:,2012-11-20:2012/11/20ChanceAnalysis of André Villas-Boas Vs Harry Redknapp/2012/11/18<h2>Introduction</h2> <p>Since taking over as manager of Tottenham Hotspur, André Villas-Boas has been trapped in former Spurs manager Harry Redknapp’s shadow. Every tactical decision or team selection Villas-Boas makes is seemingly compared with Redknapp’s previous achievements. And after Tottenham’s apparent slow start to the season, Villas-Boas has come under heavy criticism from the media whose narrative seems to be that Tottenham are performing poorly. But is this criticism fair and are Tottenham really performing any worse than last season under Harry Redknapp?</p> <p>Find out by reading the rest of this article <a href="http://www.bettingexpert.com/blog/andre-villas-boas-vs-harry-redknapp">here</a>.</p> <h2>Comments</h2> <p><div class="hline"></div></p>Martin EastwoodSun, 18 Nov 2012 19:30:00 +0000tag:,2012-11-18:2012/11/18Effect of Season Length on Deciding the League Champion/2012/11/12<h2>Introduction</h2> <p>In my <a href="http://pena.lt/y/2012/11/08/how-often-does-the-best-team-win-the-league/">previous article</a> I looked at the interplay between luck and skill in determining the league champions. There is another parameter though that also interacts with luck and that is the structure of the league itself. How many times have you heard the same tired, old cliché from football managers about how luck evens itself out over a season? But does it? Is a football season really long enough for the effects of chance to be cancelled out?</p> <p>I used the same mathematical model as before to simulate 10,000 seasons of a league containing 20 teams. Skill levels were randomly assigned to each team from a normally distributed population with a mean of 0.5 and a standard deviation of 0.1. The length of the season was then altered to see how frequently the team with the highest skill level won the league depending on the number of matches played – teams either played each other once, twice, four times or eight times per season.</p> <table class="table"> <tbody> <tr> <td><b>Number of Teams</b></td> <td><b>Frequency Teams Meet</b></td> <td><b>Mean Win %</b></td> <td><b>Best Team Win %</b></td> </tr> <tr> <td>20</td> <td>1</td> <td>60.2</td> <td>32.11</td> </tr> <tr> <td>20</td> <td>2</td> <td>75.3</td> <td>45.9</td> </tr> <tr> <td>20</td> <td>4</td> <td>82.2</td> <td>48.5</td> </tr> <tr> <td>20</td> <td>8</td> <td>85.2</td> <td>50.8</td> </tr> </tbody> </table> <p><strong>Table One: Effect of Season Length on League Champions</strong></p> <h2>Results</h2> <p>The results in Table One show that as the length of the season increases the probability of the team with the highest skill rating winning the league increases too. The Champions also win a greater percentage of their matches too. Therefore, the more matches that are played the less of an influence chance seems to play in determining the overall league champion.</p> <p>The second row of Table One matches the structure of four of the major leagues in Europe – Premier League, Serie A, La Liga and Ligue 1 – which all contain 20 teams that play each other twice per season. The Eriedivisie and Bundesliga only contain 18 teams though, so what affect does this have? Rerunning the mathematical model with 18 teams gives a lower frequency for the best team winning the league of 28.8%. This suggests that the smaller size of these two leagues makes them somewhat more competitive as there are fewer matches for luck to be evened out.</p> <p>The Scottish Premier League (SPL) is smaller again, containing just 12 teams. The structure of the league is fairly unique in Europe, with teams playing each other three times, either twice away and once at home or vice versa. The league then splits in half and teams play a further match against the remaining five teams in their half of the league. If we apply the mathematical model to this structure then we come out with a frequency of 19.3% for the best team winning the league. This means the SPL should be one of the most competitive leagues in Europe, yet it has only ever been won by two teams – Celtic and Rangers. The reason for this is likely due to the large disparity in talent between Glasgow’s two largest teams and the rest of the league cancelling out the effect of chance.</p> <p>Interestingly, with Rangers now relegated from the SPL for financial irregularities, the league is the closest it has ever been. It was thought that without Rangers present in the SPL, Celtic would go on to dominate a very one-sided division. Yet with Hibernian currently sitting top of the league, the reduction in disparity from the loss of Rangers may actually make it the most competitive and exciting year in the SPL’s history.</p> <h2>Comments</h2> <p><div class="hline"></div></p>Martin EastwoodMon, 12 Nov 2012 19:30:00 +0000tag:,2012-11-12:2012/11/12ChanceHow Often Does The Best Team Win The League?/2012/11/08<h2>Introduction</h2> <p>How often does the best team win the league? Probably not as often as you think as it is not just talent that is required for success; a decent amount of luck is needed too.</p> <h2>Methodology</h2> <p>To investigate how big a role luck plays compared with ability I created a mathematical simulation based on the English Premier League (EPL) containing 20 teams that play each other twice per season. Each team was randomly assigned a skill level drawn from a normally distributed population with a mean of 0.5 and a standard distribution linked to the spread of talent across the league so that the disparity between the top and bottom clubs could be controlled. The simulation was then run for 10,000 seasons at various disparity levels and the number of times the team with the highest skill level won the league was measured.</p> <table class="table"> <tbody> <tr> <td>Mean Skill Level</td> <td>Disparity</td> <td>Mean Win %</td> <td>Best Team Win %</td> </tr> <tr> <td>0.5</td> <td>0</td> <td>65.5</td> <td>0</td> </tr> <tr> <td>0.5</td> <td>0.02</td> <td>65.9</td> <td>10.3</td> </tr> <tr> <td>0.5</td> <td>0.04</td> <td>67.4</td> <td>22.7</td> </tr> <tr> <td>0.5</td> <td>0.06</td> <td>69</td> <td>32.4</td> </tr> <tr> <td>0.5</td> <td>0.08</td> <td>72.1</td> <td>41.2</td> </tr> <tr> <td>0.5</td> <td>0.1</td> <td>76.4</td> <td>46.4</td> </tr> </tbody> </table> <p><strong>Table 1: Effect of Disparity on League Champions</strong></p> <p>The first row in Table One shows what would happen if all teams in the league were identical. Each team has a skill level of 0.5, meaning that they would each be expected to win 50% of their matches and lose 50% (ignoring draws to keep the model simple). Due to random chance though some teams will win more than 50% and some will lose more than 50%. You can see that the average number of matches won by the league champions over 10,000 seasons was 65.5% so in an evenly matched EPL you would just need to be lucky enough to win an extra 15% of matches to be champions.</p> <p>As the disparity increases though, the influence of chance decreases and the best team goes on to win the league more often and more of their matches in the process. Take a look at this season’s EPL and while it is possible that QPR could go on to fluke wins in all their remaining matches and go on to win the league it would take a colossal amount of luck compared to say the amount of luck Manchester City would need to finish ahead of Manchester United since their skill levels are closer.</p> <p>This leads to the question though of what is preferable, an evenly matched, competitive league in which luck is a major determining factor in winning or a league that is perhaps fairer as it has enough disparity that the best team is predominantly likely to win?</p> <h2>Comments</h2> <p><div class="hline"></div></p>Martin EastwoodThu, 08 Nov 2012 19:30:00 +0000tag:,2012-11-08:2012/11/08ChanceThe Poisson Model So Far/2012/11/02<h2>Introduction</h2> <p>In my <a href="http://pena.lt/y/2012/10/29/using-poisson-to-predict-football-matches/">last article</a> I wrote about my experiences using the Poisson distribution to predict the outcome of football matches. The results so far have been rather disappointing so I thought I would have a look at where things were going wrong.</p> <h2>Probabilities</h2> <p>The first place I decided to look was at the probabilities generated for the matches predicted correctly compared with those predicted incorrectly. I suspected that maybe the model was struggling with matches between more evenly matched teams. For example, for last week’s match between Stoke and Sunderland the predicted outcome was a home win with a probability of 51%. This still leaves us with a 49% chance though that the game will finish with an away win or a draw instead making it potentially difficult to predict accurately.</p> <p>Overall, the average probability for games correctly predicted was 64% compared with 56% in the games where the prediction failed. At first look it would therefore appear that the model does struggle somewhat with games between more closely matched teams. However, when you look at the variability in the data it is not possible to discern between the two percentages (Figure 1). In fact comparing the data sets using analysis of variance (ANOVA) gives a p-value of 0.32 suggesting no statistical difference between the two percentages based on the current data.</p> <p><img alt="Pelican" src="../../../../images/121102_poissonprob.png" /></p> <p><strong>Figure 1: Average probabilities of matches correctly / incorrectly predicted by the Poisson model</strong></p> <p>Next I looked at which outcomes were being incorrectly predicted and a problem immediately became apparent. So far the model has predicted 50 matches of which 58% were predicted to be home wins, 34% as away wins and 8% as draws. Looking at what really happened though, of those 50 matches 42% were actually home wins, 30% away wins and 28% were draws (Figure 2). This suggests the model is under-predicting the likelihood of draws by quite a large margin and is actually predicting them as home wins.</p> <p><img alt="Pelican" src="../../../../images/121102_poissonproportions.png" /></p> <p><strong>Figure 2: Proportion of Match Outcomes - Poisson vs Actual</strong></p> <h2>Conclusions</h2> <p>A quick Google revealed two possible fixes. Karlis and Ntzoufras recommend replacing the independent Poisson with a bivariate Poisson to add an element of correlation between the home and away team’s scores. However, even with this they still needed to inflate the diagonal of the score matrix to try and improve the prediction of draws, suggesting that moving to the bivariate Poisson is not necessarily much of an improvement. An alternative proposal by Dixon and Coles was to stick with the two independent Poisson calculations but add in an additional parameter to modify the probabilities of 0-0, 1-1, 1-0 and 0-1 scores occurring.</p> <p>So where does this leave the current Poisson model? For me, it is time to move on to other ideas. The Poisson model is one the most widely used models for predicting football outcomes so I will return to it in the future to try out the Karlis and Ntzoufras and Dixon and Coles adjustments but I gave a few other ideas to write about first.</p> <h2>Comments</h2> <p><div class="hline"></div></p>Martin EastwoodFri, 02 Nov 2012 19:30:00 +0000tag:,2012-11-02:2012/11/02PoissonRUsing Poisson to Predict Football Matches/2012/10/29<h2>Introduction</h2> <p>The <a href="http://thepowerofgoals.blogspot.co.uk/">Power Of Goals</a> recently blogged about using the Poisson distribution to predict the outcome of football matches. I have been evaluating the predictive ability of the Poisson for the English Premier League (EPL) this season so I thought I would share my experiences too.</p> <p>For anyone who is unaware, the number of goals scored by each team in a football match roughly follows a Poisson distribution. As you can see in Figure 1, it is not exact though as the Poisson distribution underestimates the likelihood of no goals being scored and overestimates one, two and three goals being scored. By four goals and upwards the Poisson starts to underestimate again. The actual difference between the Poisson and what is observed in the EPL is reasonably small though so it just requires a small fudge factor to bring the two into line.</p> <p><img alt="Pelican" src="../../../../images/121029_poissonvsactual.png" /></p> <p><strong>Figure 1: Poisson Distribution vs Observed</strong></p> <p>To carry out the predictions I have written a script in <a href="http://www.r-project.org/">R</a> that scrapes the Premier League table directly from the BBC’s website. The script then calculates attack and defence coefficients for each team by comparing their goals scored and conceded with the overall EPL average home and away. The predicted number of goals scored in a particular match can then be calculated by scaling the EPL’s average goals by the two team’s attack and defence coefficients. This can then be mapped to the Poisson distribution to generate a probability matrix for each particular score line (Table 1). From this, the probabilities can be summed to find the odds that each match will end as a home win, draw or away win.</p> <table class="table"> <tbody> <tr> <td><strong>Goals</strong></td> <td><strong>0</strong></td> <td><strong>1</strong></td> <td><strong>2</strong></td> <td><strong>3</strong></td> <td><strong>4</strong></td> <td><strong>5</strong></td> <td><strong>6</strong></td> <td><strong>7</strong></td> <td><strong>8</strong></td> </tr> <tr> <td><strong>0</strong></td> <td>1.96</td> <td>4.08</td> <td>4.24</td> <td>2.94</td> <td>1.53</td> <td>0.64</td> <td>0.22</td> <td>0.07</td> <td>0.02</td> </tr> <tr> <td><strong>1</strong></td> <td>3.63</td> <td>7.56</td> <td>7.86</td> <td>5.45</td> <td>2.83</td> <td>1.18</td> <td>0.41</td> <td>0.12</td> <td>0.03</td> </tr> <tr> <td><strong>2</strong></td> <td>3.36</td> <td>7.00</td> <td>7.27</td> <td>5.04</td> <td>2.62</td> <td>1.09</td> <td>0.38</td> <td>0.11</td> <td>0.03</td> </tr> <tr> <td><strong>3</strong></td> <td>2.08</td> <td>4.32</td> <td>4.49</td> <td>3.11</td> <td>1.62</td> <td>0.67</td> <td>0.23</td> <td>0.07</td> <td>0.02</td> </tr> <tr> <td><strong>4</strong></td> <td>0.96</td> <td>2.00</td> <td>2.08</td> <td>1.44</td> <td>0.75</td> <td>0.31</td> <td>0.11</td> <td>0.03</td> <td>0.01</td> </tr> <tr> <td><strong>5</strong></td> <td>0.36</td> <td>0.74</td> <td>0.77</td> <td>0.53</td> <td>0.28</td> <td>0.12</td> <td>0.04</td> <td>0.01</td> <td>0.00</td> </tr> <tr> <td><strong>6</strong></td> <td>0.11</td> <td>0.23</td> <td>0.24</td> <td>0.16</td> <td>0.09</td> <td>0.04</td> <td>0.01</td> <td>0.00</td> <td>0.00</td> </tr> <tr> <td><strong>7</strong></td> <td>0.03</td> <td>0.06</td> <td>0.06</td> <td>0.04</td> <td>0.02</td> <td>0.01</td> <td>0.00</td> <td>0.00</td> <td>0.00</td> </tr> <tr> <td><strong>8</strong></td> <td>0.01</td> <td>0.01</td> <td>0.01</td> <td>0.01</td> <td>0.01</td> <td>0.00</td> <td>0.00</td> <td>0.00</td> <td>0.00</td> </tr> </tbody> </table> <p><strong>Table 1: Example Goal Probabilities (%)</strong></p> <p>Since the predictions are based on past performance this season, I waited until week five of the EPL to start testing it so I had at least a month’s worth of previous results to work with. The first week went well, with the model correctly predicting the outcome of six of the ten matches that weekend. Table 2 shows the predicted probabilities (%) of the home team winning each match. From this I also calculated the odds and compared mine with those available from Betfair to see how well they compared.</p> <table class="table"> <tbody> <tr> <td><strong>Home</strong></td> <td><strong>Away</strong></td> <td><strong>Prediction</strong></td> <td><strong>Probability (%)</strong></td> <td><strong>Odds</strong></td> <td><strong>Betfair</strong></td> <td><strong>Result</strong></td> </tr> <tr> <td>Swansea</span></td> <td>Everton</span></td> <td>HOME</span></td> <td>56.3</span></td> <td>1.78</span></td> <td>3.35</span></td> <td>AWAY</span></td> </tr> <tr> <td >Chelsea</span></td> <td >Stoke City</span></td> <td >HOME</span></td> <td >63.4</span></td> <td >1.58</span></td> <td >1.39</span></td> <td >HOME</span></td> </tr> <tr> <td>Southampton</span></td> <td>Aston Villa</span></td> <td>AWAY</span></td> <td>49.2</span></td> <td>2.03</span></td> <td>3.1</span></td> <td>HOME</span></td> </tr> <tr> <td >West Brom</span></td> <td >Reading</span></td> <td >HOME</span></td> <td >41.1</span></td> <td >2.43</span></td> <td >1.82</span></td> <td >HOME</span></td> </tr> <tr> <td>West Ham</span></td> <td>Sunderland</span></td> <td>HOME</span></td> <td>35.7</span></td> <td>2.80</span></td> <td>2.24</span></td> <td>DRAW</span></td> </tr> <tr> <td >Wigan</span></td> <td >Fulham</span></td> <td >AWAY</span></td> <td >40.1</span></td> <td >2.49</span></td> <td >3.25</span></td> <td >AWAY</span></td> </tr> <tr> <td >Liverpool</span></td> <td >Man Utd</span></td> <td >AWAY</span></td> <td >75.6</span></td> <td >1.32</span></td> <td >2.82</span></td> <td >AWAY</span></td> </tr> <tr> <td >Newcastle</span></td> <td >Norwich</span></td> <td >HOME</span></td> <td >82.9</span></td> <td >1.21</span></td> <td >1.84</span></td> <td >HOME</span></td> </tr> <tr> <td>Man City</span></td> <td>Arsenal</span></td> <td>AWAY</span></td> <td>37.1</span></td> <td>2.70</span></td> <td>1.78</span></td> <td>DRAW</span></td> </tr> <tr> <td >Tottenham</span></td> <td >QPR</span></td> <td >HOME</span></td> <td >41.1</span></td> <td >2.43</span></td> <td >1.51</span></td> <td >HOME</span></td> </tr> </tbody> </table> <p><strong>Table 2: EPL Week 5 Predictions</strong></p> <p>Since then, the performance of the Poisson has between correctly predicting between 30-60% of matches each week (Figure 2). So far, the average accuracy is 46%, which is slightly higher than the 33% we could expect from randomly guessing each result.</p> <p><img alt="Pelican" src="../../../../images/121029_poissoncorrect.png" /></p> <p><strong>Figure 2: Weekly Performance of Poisson Predictive Model</strong></p> <p>I am hopeful the model’s success rate will improve over the course of the season as it gets more data to work with. There are also further improvements that can be made as well. For example, the model currently considers the goals scored by each team to be independent events. However, it may be that the two should be correlated together as it would seem intuitive that the more goals one teams scores the less likely the opposition is to score. At the moment though I wouldn’t place too much faith in the Poisson model.</p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>richard vadoret - August 13, 2013</strong></p> <p>Hi Martin ,</p> <p>your graph in figure 1 show that your poisson distribution can predict Under/Over 2.5 final soccer score with good accuracy. isn’t it ?</p> <p>thanks</p> <p>richard</p> <div class="hline"></div> <p><strong>Adarsh - October 11, 2014</strong></p> <p>Try poisson distribution on France Ligue 2… quite accurate…</p> <div class="hline"></div> <p><strong>Martin Eastwood - October 15, 2014</strong></p> <p>Cool, will take a look!</p>Martin EastwoodMon, 29 Oct 2012 19:30:00 +0000tag:,2012-10-29:2012/10/29PoissonRInfluence Of Clean Sheets/2012/10/26<h2>Introduction</h2> <p>To make much sense of the statistics available for football we need to have an understanding of their context so I am planning on starting off simple by looking at baselines for various events and statistics while I build up the information required to start a mathematical model.</p> <h2>Clean Sheets</h2> <p>While most football analytics seems to focus heavily on goals, I am going to start off with defending and the all important clean sheet. Clean sheets have been fairly consistent throughout the English Premier League’s (EPL) history, occurring in around 27% of matches between 1993 and 2011 (Figure 1). The data shows some variability around the mean with perhaps the slightest hint of an upwards trend, but in general the total number of clean sheets per season has remained constant.</p> <p><img alt="Pelican" src="../../../../images/121026_cleansheets.png" /></p> <p><strong>Figure 1: Total English Premier League Clean Sheets</strong></p> <h2>Home And Away</h2> <p>If we split the data by home and away then we can immediately see a significant difference (Figure 2; <em>p&gt;0.001</em>). On average, the home team will keep a clean sheet 33% of the time while the away team will only manage it in 22% of their matches. Interestingly, both sets of data appear to follow broadly similar patterns with peaks and troughs occurring in the same years. I hope to explore this in more detail in the future.</p> <p><img alt="Pelican" src="../../../../images/121026_homeawaycleansheets.png" /></p> <p><strong>Figure 2: Clean Sheets Home and Away</strong></p> <p>Clean sheets are valuable commodities as they guarantee you a minimum of one point. As the cliche goes, if you keep a clean sheet you cannot lose. Looking back over the EPL’s history shows that a clean sheet at home is actually worth 2.1 points on average. This means that over the course of a season obtaining a clean sheet in 33% of matches would be expected to generate 13.2 points. Away from home a clean sheet is of lower value, generating just 1.8 points each. Over the course of a season this would therefore bring in an additional 7.5 points.</p> <h2>The English Premier League</h2> <p>We can use these baselines to examine how teams are performing in terms of clean sheets home and away. Figure three shows the proportion of matches in which each team in the EPL obtained clean sheets for the 2011-2012 season. The teams in the upper right quadrant all acheived an above average number of clean sheets both home and away. In comparison, West Brom’s defence performed very well at home yet they struggled to obtain clean sheets away from the Hawthorns. Liverpool were the opposite of West Brom, keeping clean sheets away from Anfield but struggling at home. Bolton, Blackburn and Wolves all generated very low numbers of clean sheets home and away and were all relegated from the EPL. Norwich are an interesting exception as they possessed the worst away record for clean sheets yet managed to finish in a respectable 12th position last year.</p> <p><img alt="Pelican" src="../../../../images/121026_homeawaycomparison.png" /></p> <p><strong>Figure 3: Proportion of Matches With Clean Sheets Home and Away</strong></p> <h2>League Position</h2> <p>If we carry out linear regression on 2011-2012’s data (Figure 4) we can see the correlation between the number of clean sheets a team kept over the season and their final league position. The r2 value of 0.72 for the regression shows that the two are strongly correlated with each other so any team not keeping clean sheets could be expected to finish lower down the league table. This does not bode well for current champions Manchester City, who have conceded goals in seven of their eight EPL matches this season.</p> <p><img alt="Pelican" src="../../../../images/121026_cleansheetregression.png" /></p> <p><strong>Figure 4: Correlation of Final League Position to Number of Clean Sheets 2011/2012</strong></p> <h2>Comments</h2> <p><div class="hline"></div></p>Martin EastwoodFri, 26 Oct 2012 19:30:00 +0100tag:,2012-10-26:2012/10/26EPLFootball + Mathematics/2012/10/22<p>There are plenty of analytical football blogs already out there on the internet so I thought long and hard about whether to bother with pena.lt/y. I didn’t want to add to the high level of mundane background noise that already pervades the web and I am not out to compete with anyone to have the biggest, greatest or most popular blog in history.</p> <p>Instead, pena.lt/y is somewhere for me to jot down thoughts and ideas that interest me before I forget them. I am a data scientist and spend a large portion of my time analysing data to try and work out what is happening and why. Because of this, I have a fascination with numbers and data and trying to glean as much knowledge as possible from them.</p> <p>This is a hugely exciting time for football analytics. There has never been as much data available as there is now and no one truly knows what it really means, how it all links together or what to do with it. The first step in understanding it all is to set out what we want to achieve. Is our aim to be able to predict the outcome of matches in advance, optimize tactics, to identify player’s strengths, analyse the opposition and develop strategies to beat them, spot players at risk of injury, etc, etc?</p> <p>Understanding football is also an interesting intellectual challenge too. While some sports, such as baseball, lend themselves well to statistical analysis, football appears to be inherently more complicated. With 22 people running distance of up to 10 kilometres each and making hundreds of passes just to score a single goal it is difficult to identify the key data that explains the final outcome. How many passes does it take to score a goal? How many sideways passes have the same value as a forward pass? Does having a high possession percentage improve your chances of winning?</p> <p>I am also interested in the influence of random, or luck on football. The difference between a nil-nil draw and winning one-nil can be down to a scuffed shot trickling across the line so instinctively it feels like luck has a large role. Yet when you look at international competitions, such as the World Cup, where you would expect luck to have a larger effect it seems to be the same handful of teams winning. Does skill really outweigh luck and if so by how much?</p> <p>There is a huge amount of potential for statistics and analysis to influence the sport; we just need to untangle the data. I don’t expect to unearth some magical formula that will answer all our questions but hopefully this blog will satisfy my curiosity and maybe provide some interesting insights for other people into football.</p> <h2>Comments</h2> <p><div class="hline"></div></p>Martin EastwoodMon, 22 Oct 2012 19:30:00 +0100tag:,2012-10-22:2012/10/22It's Fergie Time/2012/10/19<h2>Introduction</h2> <p>After Manchester United’s recent defeat to Tottenham, Sir Alex Ferguson was once again furious about the amount of injury time played. He even went as far as to claim the four minutes Chris Foy added was an ‘insult’.</p> <blockquote>They gave us four minutes, that’s an insult to the game</blockquote> <p>However, while it is a common theme in the United manager’s post-match interviews that Manchester United do not receive enough added time, many other football fans think the opposite. In many people’s opinion Sir Alex Ferguson constant complaints pressures referees to award United excessive stoppage time.</p> <p>I wanted to see which was correct so I started off by looking at the amount of injury time added to all the premier league matches played this season and calculated the average for each team. This is total injury time so accounts for time added to both the first and second halves of the match.</p> <p><img alt="Pelican" src="../../../../images/20121012_injurytime.png" /></p> <p><strong>Figure One: Average stoppage time in seconds added so far in the Premier League</strong></p> <p>As you can see above, Manchester United are towards the top of the list but there are still three teams that on average receive more injury time in their matches. Fans of West Ham perhaps get the best value for money for their match tickets as they are top of the list with just under 500 seconds added to each match.</p> <p>Somewhat surprisingly Manchester City and Arsenal are both near the bottom, with 335 and 366 seconds, respectively. This suggests that so far in the season, bigger clubs are not necessarily receiving any bias based on their status.</p> <p>As we have only had seven weeks of the Premier League played so far we cannot draw too many firm conclusions from this relatively small sample set. Currently though there is no statistical difference (<em>p &gt; 0.05</em>) between the amount of stoppage time Manchester United have received compared with any other team in the Premier League. Maybe this is why Sir Alex Ferguson is getting so worked up…</p> <h2>Comments</h2> <p><div class="hline"></div></p> <p><strong>Jonas - December 12, 2012</strong></p> <p>More Or Less on BBC radio 4 did a piece on this a couple of weeks ago. They looked closer at differences when Manchester is under or draw, and also compared them to the other teams i EPL. If I recall correctly their conclusion was that better teams get more injurytime, and that there was some difference wether they played at home or away.</p> <p>http://www.bbc.co.uk/programmes/b01ny0fc</p> <p><strong>Martin Eastwood - December 12, 2014</strong></p> <p>Interesting, thanks for the link Jonas!</p>Martin EastwoodFri, 19 Oct 2012 19:30:00 +0100tag:,2012-10-19:2012/10/19EPLHello World/2012/10/18<p>Hello!!</p> <p>Hopefully this blog will be live soon once I have worked out how to use it…</p>Martin EastwoodThu, 18 Oct 2012 19:30:00 +0100tag:,2012-10-18:2012/10/18