It seems that everybody has their own expected goals models for football nowadays but they all seem to be top secret and all appear to give different results so I thought I post a quick example of one technique here to try and stimulate a bit of chat about the best way to model them.
Over the past few weeks I have tediously collected several thousand xy co-ordinates for shot locations from Squawka and converted them into approximate distances from goal in metres, assuming that an average football pitch is 100m x 65m.
Figure 1 below shows the relationship between the probability of scoring a goal and how far away from the goal line the shot is taken from.
Figure 1: Shots Versus Distance From Goal
There seems to be a little bit of noise in the data, particularly around the 12-13m mark but overall I was pleasantly surprised how neat the data looks – there seems to be a pretty clear non-linear relationship between the likelihood of scoring and how far away from the goal the shot is taken from.
So how do we model this relationship? Obviously we cannot just stick a linear regression through the graph it as the relationship is clearly not linear so one possibility is to use a polynomial instead of a straight line (Figure 2).
Figure 2: Fitting a Polynomial
Unfortunately, this does not give particularly good results as low order polynomials (the orange line) do not fit tightly enough to the non-linearity in the relationship while higher-order polynomials (the red line) start to fit to the noise in the data leading to problems with over-fitting.
So what do we do now? Well, looking closer the shape of the curve appears exponential so one option is to fit a Power function to it. We can do this pretty easily by taking the log of the data, fitting a linear regression against it and plotting this against our non-logged data (Figure 3).
Figure 3: Power Curve
This gives an extremely good fit with the data and seems a plausible choice. We know goal scoring is Poisson distributed so it would seem natural to fit expected goals using an exponential shaped curve since Poisson and exponential distributions are inherently linked – the exponential distribution in fact describes the time taken between individual events occurring in a Poisson process.
If we calculate the r squared value for the fit of the Power curve then we get a value of 0.84, meaning 84% of the variance in goal scoring can be attributed to how far away the player taking the shot is from the goal. This is pretty impressive as it leaves just 16% attributed to other reasons, such as the angle of the shot, goalkeeper positioning, defensive pressure, the shooting player’s talent etc.
Before you ask, I’ll be looking at whether adding these additional factors into the model can improve it or whether the added complexity is not worth chasing the 16% for in the coming weeks.
But how do we use the model? Although everybody else’s models seem to be top secret I’m going to give mine away. The coefficient for the regression is $-1.036884$ and the intercept is $0.05950286$.
To put this into action all you need to do is raise the distance away from the goal in metres to the power of the coefficient and multiply by 10 to the power of the intercept. For example, a shot from 8 metres gives:
$8^{-1.036884} * 10^{0.05950286} = 0.132771$ expected goals
So how about we give it a proper test and try it out on this season’s English Premier League to date? The results are shown in Table 1 and overall give a root mean square error of 8.2 goals, which seems a pretty reasonable starting point for developing the model further from.
Team | Goals | expG | Residual | |
---|---|---|---|---|
1 | Man City | 68.00 | 46.90 | 21.10 |
2 | Liverpool | 63.00 | 42.56 | 20.44 |
3 | Arsenal | 48.00 | 35.26 | 12.74 |
4 | Chelsea | 48.00 | 42.56 | 5.44 |
5 | Man Utd | 41.00 | 35.26 | 5.74 |
6 | Southampton | 37.00 | 29.05 | 7.95 |
7 | Everton | 37.00 | 33.31 | 3.69 |
8 | Newcastle | 32.00 | 30.32 | 1.68 |
9 | Swansea | 32.00 | 26.89 | 5.11 |
10 | Tottenham | 32.00 | 31.21 | 0.79 |
11 | WBA | 30.00 | 31.17 | -1.17 |
12 | West Ham | 28.00 | 26.69 | 1.31 |
13 | Aston Villa | 27.00 | 24.20 | 2.80 |
14 | Stoke | 26.00 | 25.29 | 0.71 |
15 | Sunderland | 25.00 | 25.63 | -0.63 |
16 | Hull | 25.00 | 23.95 | 1.05 |
17 | Fulham | 24.00 | 24.39 | -0.39 |
18 | Cardiff | 19.00 | 24.67 | -5.67 |
19 | Norwich | 19.00 | 27.61 | -8.61 |
20 | Crystal Palace | 18.00 | 25.03 | -7.03 |
Table 1: Expected Goals For The English Premier League To Date
You can also see a pretty clear pattern in that the teams at the top of the league have generally over-performed the goal expectancy while those towards the bottom end have under-performed it. This would seem reasonable as we are predicting average goal expectancy and the top teams are obviously above average so should perhaps do better with their chances, while the lower teams are below average so would be expected to perform worse?
I’m not claiming this to be the only way of calculating expected goals, or even the best way but hopefully it will encourage more discussion of how to calculate expected goals rather than a lot of secret black boxes all giving different results.
I hope to write more about expected goals over the coming weeks in order to test this equation to see how well it really works, to hopefully improve it further and to try and understand what the metric can and cannot tell us.
In the meantime, feel free to use my equation to calculate expected goals, all I ask is that you don’t try and pass the equation off as your own (you know who you are!!) and that if you use it then please acknowledge me and link back to my site.
Be warned though it’s a work in progress so is subject to change as and when I improve things…
Enjoy!
Christopher Hoeger - February 13, 2014
Hey, I’m a chemical engineer, so while my knowledge of statistics is limited to it’s application in my field, I would really like to get into football analytics, and I’d love to contribute, if possible, to your expected goals model. Could you tell me where the best publicly available data is, and, more importantly, the best way to access it? Thank you, Chris
Martin Eastwood - February 13, 2014
Hi Chris,
I took all the data for constructing the model from Squawka. It’s a tedious process but with a bit of patience you can transcribe approximate xy coordinates for shots from their site.
Ali - March 19, 2014
Would love to know the process behind transcribing the shots into x,y coordinates if you have the time..
Claus Moeller - February 16, 2014
Interesting model.
I think the average of the pitches in Premier League, is way bigger than 100m x 65m.
Most of the pitches in Premier League is 105m x 68m – http://www.openplay.co.uk/blog/premiership-football-pitch-sizes-2013-2014/
Just for my curiosity, do you rely on all the data from Squawka?
Martin Eastwood - February 16, 2014
Thanks for the link, not seen that one before. Should be fairly trivial to rescale should people wish to. Yes I used Squawka to get all the xy coordinates.
Hugo Varandas - February 19, 2014
Hi, nice work. I find this conclusion very interesting “This is pretty impressive as it leaves just 16% attributed to other reasons”. Maybe this is the reason why the better teams outperform the prediction.
I just do not understand how does this help to predict the expected goals in a particular future game.
Martin Eastwood - February 19, 2014
Yes, presumably somehere in that 16% is player talent, defensive pressure, goalkeeper skill etc. So far my equation is more explanatory rather than predictive. More work would be needed to produce a predictive model from shots.
Justin - February 20, 2014
I’m sure the spike in probability at the 12m mark is the result of penalties. Analyzing only goals from open play would increase the R^2 I would suspect.
Martin Eastwood - February 20, 2014
Yes, when I next get some free time and look at removing them although the fit of the curve is so good is will probably have minimal effect.
Antony - March 15, 2014
Do you know whether the decay curve exponent is replicable across seasons?
Also do you know how much difference angle of the shot makes and if other “black box”-guys additionally use this? The fit by averaging across angles is good, but wide players may have a detriment when cast against a benchmark without it – or maybe the difference is small.
Well done for compiling the data from squawka – when I had a quick search the only way to get x-y’s seemed to be by eye and a ruler from their pitch plots!!
Martin Eastwood - March 17, 2014
I’ve not looked season-by-season yet but the decay curve is created from multiple season’s worth of data aggregated together.
The angle presumably makes some difference but considering how high the r-squared is it must only account for a relatively small proportion of expected goals although I will look into this in more detail as soon as I get some free time.
Thanks!
Benjamin Lindqvist - April 18, 2014
Hey, just thought I’d cryptically point out there’s a better function to fit to the data than that.
Martin Eastwood - April 18, 2014
Ooh you can’t just leave it that, tell me more :-)
Benjamin Lindqvist - April 18, 2014
Well if I’m not mistaken, the probability of scoring from 0 yards is, according to your model, more than 100%. Not even Ronaldo will score more than 100% of the time, not even from the goal line :D
Why don’t you try a decaying exponential instead? That has all the properties you’re looking for, but it will also be well behaved at x=0.
Martin Eastwood - April 18, 2014
Yes the 0 yards issue is a concern. I thought about forcing the curve through 1 but dislike taking that sort of brute force approach. The decaying exponential sounds a really good idea. I’ll look into that,thanks!
Benjamin Lindqvist - April 18, 2014
I.e.: http://fooplot.com/#W3sidHlwZSI6MCwiZXEiOiIxL3giLCJjb2xvciI6IiMwNEZGMDAifSx7InR5cGUiOjAsImVxIjoiZV4oLXgpIiwiY29sb3IiOiIjRkYwMDAwIn0seyJ0eXBlIjoxMDAwfV0-
Anonymous - April 22, 2014
Hey Martin,
I had a very noobish doubt. Please don’t mind it. Can you explain how you calculated the probability of scoring (Y axis) to arrive at the scatter plot in Figure 1?
Thanks in advance. :)
Anonymous - April 22, 2014
Poisson distribution of course. My bad!
Anonymous - April 22, 2014
If you could run me through the process or maybe send me some useful links, it would be greatly appreciated!
Martin Eastwood - April 22, 2014
I split the shots into different bins by distance and then calculated the probability per bin. I then used this set of aggregated probabilities to construct the model and fit the curve.
Anonymous - April 22, 2014
Thanks a lot! :)
I need to brush up on my stats concepts.
abhinav - July 26, 2014
how are you gathering xy coordinates from squawka? or are you using some other metric to approximate?
Martin Eastwood - July 26, 2014
It took quite a bit of effort!
Submit your comments below, and feel free to format them using MarkDown if you want. Comments typically take upto 24 hours to appear on the site and be answered so please be patient.
Thanks!