Expected Goals For All
It seems that everybody has their own expected goals models for football nowadays but they all seem to be top secret and all appear to give different results so I thought I post a quick example of one technique here to try and stimulate a bit of chat about the best way to model them.
The Data
Over the past few weeks I have tediously collected several thousand xy co-ordinates for shot locations from Squawka and converted them into approximate distances from goal in metres, assuming that an average football pitch is 100m x 65m.
Goals Versus Distance
Figure 1 below shows the relationship between the probability of scoring a goal and how far away from the goal line the shot is taken from.
There seems to be a little bit of noise in the data, particularly around the 12-13m mark but overall I was pleasantly surprised how neat the data looks – there seems to be a pretty clear non-linear relationship between the likelihood of scoring and how far away from the goal the shot is taken from.
So how do we model this relationship? Obviously we cannot just stick a linear regression through the graph it as the relationship is clearly not linear so one possibility is to use a polynomial instead of a straight line (Figure 2).
Unfortunately, this does not give particularly good results as low order polynomials (the orange line) do not fit tightly enough to the non-linearity in the relationship while higher-order polynomials (the red line) start to fit to the noise in the data leading to problems with over-fitting.
So what do we do now? Well, looking closer the shape of the curve appears exponential so one option is to fit a Power function to it. We can do this pretty easily by taking the log of the data, fitting a linear regression against it and plotting this against our non-logged data (Figure 3).
This gives an extremely good fit with the data and seems a plausible choice. We know goal scoring is Poisson distributed so it would seem natural to fit expected goals using an exponential shaped curve since Poisson and exponential distributions are inherently linked – the exponential distribution in fact describes the time taken between individual events occurring in a Poisson process.
If we calculate the r squared value for the fit of the Power curve then we get a value of 0.84, meaning 84% of the variance in goal scoring can be attributed to how far away the player taking the shot is from the goal. This is pretty impressive as it leaves just 16% attributed to other reasons, such as the angle of the shot, goalkeeper positioning, defensive pressure, the shooting player’s talent etc.
Before you ask, I’ll be looking at whether adding these additional factors into the model can improve it or whether the added complexity is not worth chasing the 16% for in the coming weeks.
Using the Expected Goals Model
But how do we use the model? Although everybody else’s models seem to be top secret I’m going to give mine away. The coefficient for the regression is -1.036884 and the intercept is 0.05950286 .
To put this into action all you need to do is raise the distance away from the goal in metres to the power of the coefficient and multiply by 10 to the power of the intercept. For example, a shot from 8 metres gives:
8^-1.036884* 10^0.05950286 = 0.132771 expected goals
So how about we give it a proper test and try it out on this season’s English Premier League to date? The results are shown in Table 1 and overall give a root mean square error of 8.2 goals, which seems a pretty reasonable starting point for developing the model further from.
Team | Goals | expG | Residual | |
---|---|---|---|---|
1 | Man City | 68.00 | 46.90 | 21.10 |
2 | Liverpool | 63.00 | 42.56 | 20.44 |
3 | Arsenal | 48.00 | 35.26 | 12.74 |
4 | Chelsea | 48.00 | 42.56 | 5.44 |
5 | Man Utd | 41.00 | 35.26 | 5.74 |
6 | Southampton | 37.00 | 29.05 | 7.95 |
7 | Everton | 37.00 | 33.31 | 3.69 |
8 | Newcastle | 32.00 | 30.32 | 1.68 |
9 | Swansea | 32.00 | 26.89 | 5.11 |
10 | Tottenham | 32.00 | 31.21 | 0.79 |
11 | WBA | 30.00 | 31.17 | -1.17 |
12 | West Ham | 28.00 | 26.69 | 1.31 |
13 | Aston Villa | 27.00 | 24.20 | 2.80 |
14 | Stoke | 26.00 | 25.29 | 0.71 |
15 | Sunderland | 25.00 | 25.63 | -0.63 |
16 | Hull | 25.00 | 23.95 | 1.05 |
17 | Fulham | 24.00 | 24.39 | -0.39 |
18 | Cardiff | 19.00 | 24.67 | -5.67 |
19 | Norwich | 19.00 | 27.61 | -8.61 |
20 | Crystal Palace | 18.00 | 25.03 | -7.03 |
Table 1: Expected Goals For The English Premier League To Date
You can also see a pretty clear pattern in that the teams at the top of the league have generally over-performed the goal expectancy while those towards the bottom end have under-performed it. This would seem reasonable as we are predicting average goal expectancy and the top teams are obviously above average so should perhaps do better with their chances, while the lower teams are below average so would be expected to perform worse?
What Next?
I’m not claiming this to be the only way of calculating expected goals, or even the best way but hopefully it will encourage more discussion of how to calculate expected goals rather than a lot of secret black boxes all giving different results.
I hope to write more about expected goals over the coming weeks in order to test this equation to see how well it really works, to hopefully improve it further and to try and understand what the metric can and cannot tell us.
In the meantime, feel free to use my equation to calculate expected goals, all I ask is that you don’t try and pass the equation off as your own (you know who you are!!) and that if you use it then please acknowledge me and link back to my site.
Be warned though it’s a work in progress so is subject to change as and when I improve things…
Enjoy!
Hey, I’m a chemical engineer, so while my knowledge of statistics is limited to it’s application in my field, I would really like to get into football analytics, and I’d love to contribute, if possible, to your expected goals model. Could you tell me where the best publicly available data is, and, more importantly, the best way to access it? Thank you, Chris
Hi Chris,
I took all the data for constructing the model from Squawka. It’s a tedious process but with a bit of patience you can transcribe approximate xy coordinates for shots from their site.
Would love to know the process behind transcribing the shots into x,y coordinates if you have the time..
Interesting model.
I think the average of the pitches in Premier League, is way bigger than 100m x 65m.
Most of the pitches in Premier League is 105m x 68m – http://www.openplay.co.uk/blog/premiership-football-pitch-sizes-2013-2014/
Just for my curiosity, do you rely on all the data from Squawka?
Thanks for the link, not seen that one before. Should be fairly trivial to rescale should people wish to.
Yes I used Squawka to get all the xy coordinates.
Hi, nice work.
I find this conclusion very interesting “This is pretty impressive as it leaves just 16% attributed to other reasons”. Maybe this is the reason why the better teams outperform the prediction.
I just do not understand how does this help to predict the expected goals in a particular future game.
Yes, presumably somehere in that 16% is player talent, defensive pressure, goalkeeper skill etc. So far my equation is more explanatory rather than predictive. More work would be needed to produce a predictive model from shots.
I’m sure the spike in probability at the 12m mark is the result of penalties. Analyzing only goals from open play would increase the R^2 I would suspect.
Yes, when I next get some free time and look at removing them although the fit of the curve is so good is will probably have minimal effect.
Do you know whether the decay curve exponent is replicable across seasons?
Also do you know how much difference angle of the shot makes and if other “black box”-guys additionally use this? The fit by averaging across angles is good, but wide players may have a detriment when cast against a benchmark without it – or maybe the difference is small.
Well done for compiling the data from squawka – when I had a quick search the only way to get x-y’s seemed to be by eye and a ruler from their pitch plots!!
I’ve not looked season-by-season yet but the decay curve is created from multiple season’s worth of data aggregated together.
The angle presumably makes some difference but considering how high the r-squared is it must only account for a relatively small proportion of expected goals although I will look into this in more detail as soon as I get some free time.
Thanks!
Hey, just thought I’d cryptically point out there’s a better function to fit to the data than that.
Ooh you can’t just leave it that, tell me more
Well if I’m not mistaken, the probability of scoring from 0 yards is, according to your model, more than 100%. Not even Ronaldo will score more than 100% of the time, not even from the goal line
Why don’t you try a decaying exponential instead? That has all the properties you’re looking for, but it will also be well behaved at x=0.
Yes the 0 yards issue is a concern. I thought about forcing the curve through 1 but dislike taking that sort of brute force approach. The decaying exponential sounds a really good idea. I’ll look into that,thanks!
I.e.: http://fooplot.com/#W3sidHlwZSI6MCwiZXEiOiIxL3giLCJjb2xvciI6IiMwNEZGMDAifSx7InR5cGUiOjAsImVxIjoiZV4oLXgpIiwiY29sb3IiOiIjRkYwMDAwIn0seyJ0eXBlIjoxMDAwfV0-
Hey Martin,
I had a very noobish doubt. Please don’t mind it. Can you explain how you calculated the probability of scoring (Y axis) to arrive at the scatter plot in Figure 1?
Thanks in advance.
Poisson distribution of course. My bad!
If you could run me through the process or maybe send me some useful links, it would be greatly appreciated!
I split the shots into different bins by distance and then calculated the probability per bin. I then used this set of aggregated probabilities to construct the model and fit the curve.
Thanks a lot!
I need to brush up on my stats concepts.
how are you gathering xy coordinates from squawka? or are you using some other metric to approximate?
It took quite a bit of effort!