This all worked pretty well, giving us an r squared value of 0.95. However, while the r squared value was good there was still a flaw in the model we need to fix.
Eagle-eyed readers will have noticed that the fit of the curve broke down for very short distances, meaning the probability of scoring from zero metres was actually slightly above one. And as reader Benjamin Lindqvist commented, not even Ronaldo will score more than 100% of the time, not even from the goal line. Benjamin also had a good suggestion to improve this, adding an exponential decay function into the model to make it behave better around zero
If you aren’t familiar with exponential decay it basically means that a value decreases at a rate proportional to its current value. It’s a phenomenon that crops up fairly frequently in science and the natural world. For example, air pressure decays exponentially as you go higher up into the Earth’s atmosphere and radioactivity decreases exponentially over time.
A general equation for exponential decay is shown in Figure 1, where Y(t) is the value at time t, a is the starting value, k is the decay constant and t is time.
Figure 1: Exponential Decay
So how do we apply this to football? Well, the first thing to do is replace time with metres and assume that the probability of scoring a goal decreases exponentially based upon the distance from goal the shot is taken from.
Next we need to find the correct value for the decay constant as this controls the shape of the curve. Rather than doing this manually through trial and error, we can use something such as R’s optim function to find it for us. We can also tweak the equation to add in a multiplier for the independent variable and an intercept as found in a traditional regression model giving us the fit shown in Figure 2.
Figure 2: Shots Versus Distance From Goal
Notice how the orange line now hits the Y axis just below 1.0? This fixes the problem we had before where it was possible to score more than one goal from a single shot. In fact, if you’re standing on the goal line the model now predicts around 0.96 expected goals, so very likely to score but with a small chance of screwing up (yes Edin Džeko I’m looking at you).
The new curve fit also pushes the r squared value up to 0.9883, meaning 98.83% of the variance for the probability of scoring from a shot can be accounted for using just distance from goal along the X and Y axes.
The final equation (Figure 3) is slightly more complicated now but it’s still pretty simple to use.
Figure 3: Expected Goals Equation Incorporating Exponential Decay
Figure 4: Equation for d
and dx, dy are the difference between the x coordinates and y coordinates in metres for the shot location and the goal location.
As ever, let me know what you think!
OI - April 24, 2014
-Did you use two different samples for this article and the last one about the y-axis? Some points seem to be relocated from Figure 3 in the last piece to Figure 2 here. For example, there is hardly a difference in scoring probability between 5 and 6 metres distance in the other diagramm, whereas in this diagramm the difference is approximately 5%!
-Are blocked shots included in your calculations?-
Martin Eastwood - April 24, 2014
Yes, in between those articles I increased the number of shots in my database by nearly 25% so hopefully some of the noise for distances where I didn’t have many shots should be smoothed out. Everything categorised as a shot by Squawka is included in the calculation except for penalties and own goals.
Benjamin Lindqvist - April 24, 2014
Glad to have been of help. If you’re interested in hearning more negativity, I think your function is now probably overfitted :)
If you have Skype, feel free to add me (benjaminlindqvist). Not all topics regarding football and numbers are suited to the public!
Martin Eastwood - April 25, 2014
Ah the perennial conflict between optimising and over-fitting :)
It’s certainly a risk considering the number of data points I have but I’m not too concerned at the moment as it’s a fairly simple curve rather than some high-order polynomial weaving between the data points. Plus, even though the exponential decay certainly improved the fit the actual expected goal values predicted haven’t really change too much, so in the grand scheme of things any over-fitting probably isn’t having that much of an impact at the moment. It’s certainly something to bear in mind though!
I’ve added you to Skype, would be good to have a chat sometime if you’re free :)
PeP - April 28, 2014
I’m very intrigued by your expected goals model and I’m very impressed with the accuracy. Would it be possible to include the Z-axis into your model or is the lack of data holding you back on this.
Martin Eastwood - April 28, 2014
By z axis do you mean location in the goal? If so, then there is no reason it couldn’t be incorporated into the model I just don’t have the data available yet.
PeP - April 28, 2014
I meant at what height the ball is struck from off the pitch. For example if the xy coordinates were kept the same then was the ball struck from off the ground , was the ball struck on the volley or was it an overhead kick.
Benjamin Lindqvist - May 5, 2014
I highly doubt that would be a convex realtionship so that would be hard to fit into this particular model.
Max - May 20, 2014
This is amazing… How did you the power curve? Did you find it by trial and error?
Martin Eastwood - May 20, 2014
No, it was created using mathematical optimisation techniques rather than trial and error as they can do a better job than me!
EV - July 21, 2014
This is an amazing piece of work Martin. And thank you very much for sharing it with us. The effort of gathering the data must have been enormous. I have some questions that will probably be interesting for you:
I assume you group the shot distances into 1m intervals, and got the probability of a goal inside each interval as number of goals / number of shots inside that interval? Then you used some max likehood to fit the curve? However, the number of shots inside each distance interval is different, I’m gussing there were far more shots around 12m than shots around 3m, and probably no shots at 0m, but you would fit a point at (0m,P(goal)=1) for common sense anyway. Does this introduce a bias where some shots are given more weight than others in fitting the curve? So when predicting the total number of goals for a season, this model will be predicting significantly under the actual number of goals? (Because decay constant is too fast due to the heavier weight given to shorter distances)
Is it possible to fit a curve that gives equal weight to each shot? I imagine such a curve is likely to significantly under estimate the probability of scoring from short distances, but will predict season totals more accurately. But the real question is, which curve would predict individual teams’ or even individual matches’ goals more accurately?
Martin Eastwood - July 21, 2014
Yes the shots are binned by distance so there will be different numbers of shots per game. One way to investigate whether this causes any biases could be to bin by percentiles instead to normalize the bin sizes. Hopefully the fit of the curve will be stable enough that it wouldn’t really affect the results too much but there is only one way to find out…
Submit your comments below, and feel free to format them using MarkDown if you want.