Expected goals are one of the hot topics in the football analytics community at the moment and it’s a topic I’ve previously written a number of articles discussing how to calculate them. If you haven’t read those pieces yet it’s probably worth taking a quick look to set the context for the rest of this article.
A few week’s back I published a simple equation for calculating expected goals that received a lot of positive feedback from readers as it was easy to use and was pretty accurate based on its r squared value of 0.86. This effectively means the equation is capable of explaining 86% of the variance in the shots data I have collected from Squawka.
For such a basic equation this is a really good result. I’d purposely tried to keep things simple so that the equation was easy enough for non-mathematicians to use in order to try and encourage its adoption by other people. Rather than keep these sort of things to myself I’d much rather share them around and see them get used elsewhere.
One of the restrictions I’d set myself for this was to only use the distance the player shooting was from the goal along the X axis so that the equation only needed data along one dimension. However, I received a lot of messages through Twitter and on the blog asking about the Y axis so let’s take a look…
So the first question to ask was whether the Y axis was even worth bothering with, after all the r squared value when just using distance along the X axis was already 0.86 which only left around 14% of the variance in the data to account for.
Well, it turns out that how far away you are from the goal along the Y axis does have an impact (Figure 1). Unsurprisingly the further away you are then the less likely you are to score. Before you ask, the r squared value is 0.88 (I have learnt now to include r squared values for pretty much all charts otherwise I get bombarded by requests for them :-)).
Figure 1: Shots Versus Distance From Goal Along Y Axis
Okay, we know the Y axis has an effect on expected goals but how do we factor this into my previous equation? There are a number of mathematical techniques we can use to solve for multiple dimensions. However, I am keen to try and make this as simple as possible so that the lay-person can use it so let’s keep it basic and go with Pythagoras’ Theorem, a topic most people have touched on at High School at some point.
If we know the xy coordinates of the player taking the shot and the xy coordinates of the goal then using Pythagoras’ Theorem we can calculate the total distance between the two points. Figure two shows the equation for this where dx is the distance between the two x coordinates, dy is the distance between the two y coordinates and AB is the total distance the player is from the goal.
$AB=sqrt(dx^2+dy^2)$
Figure 2: Calculating the distance between two points
I did this for all 17,000 shots I have collected so far from Squawka (excluding penalties) to get their total distances from goal and calculated the probability of scoring from different distances based on the number of shots taken versus goals scored (Figure 3).
Figure 3: Shots Versus Total Distance From Goal
As previous, I’m using a power curve to fit the line through the data and as you can see it’s a pretty good fit. So what is the effect of adding in the Y axis? Well the r squared value has changed from 0.86 to…
drumroll
0.95
Yep, including both the x and y axis into the expected goals model accounts for 95% of the variance in the data. This barely leaves any room for the shooting player’s talent to have any effect or even for defensive pressure to play a part.
At first I thought this seemed a bit odd but thinking about it in more detail it actually seems logical. It doesn’t make much difference whether you are shooting from five metres out against a strong defence or a weak one, you still have the same chance of scoring from that particular position.
However, playing against a strong defence will likely mean you will get into that good position less often so your overall expected goals will be lower. Conversely, better players will be able to get into those good positions more often than weaker players so their overall expected goals will be higher.
In other words, at the individual shot level expected goals seems to be all about a player’s position in respect to the goal when they shoot. Other factors, such as player talent, defensive pressure etc are probably not visible until you start looking at larger samples, such as expected goals per fixture or even per season.
Anyway, here’s the final equation:
$ExpG=Distance^{-1.33796}*10^{0.4720605}$
Let me know what you think!
Lorenzo - April 17, 2014
How to collect data from Squawka?
Did you wrote a simple scraper or there is some public API?
Martin Eastwood - April 25, 2014
There is no API available so I collected the data myself from their site
Jonas - April 17, 2014
I am not sure if I agree with your interpretation that there is little room for player talent. If I have understood what you have done correctly, you are aggregating the shoot data to get frequencies (or probabilities if you want) for each 1 meter interval. You are in other words looking across all teams, both the good ones and the bad ones. In your previous post you can clearly see that some of the good teams have rather large positive residuals. These residuals I think are interesting, as they show how good a team is controlled for where they shoot from.
My guess is that if you look at the instances where a shoot from 20 meters or more yielded a goal, they are not going to be evenly distributed across all teams, but the good teams are going to be over represented.
Martin Eastwood - April 17, 2014
It’s a really good point and one that I agree with – I’ll take a look at this in more detail in a future post.
My point though is that at the point a shot is taken the most overwhelming factor in whether it results in a goal seems to be position the shot is taken from. If talent was a major factor at this stage then I would expect more variability in the data and a much poorer fit in the graph.
Where I think talent will play a major role is the positions players get into to take those shots. I would expect better players to be shooting more frequently from better positions and good defences to concede fewer shots from those good positions.
I’ll hopefully take a look at all this in the coming weeks to see whether my hypotheses hold up.
Max - April 17, 2014
What happens when you take out headers?
Martin Eastwood - April 25, 2014
Will be looking into that in more detail soon
John - April 18, 2014
Well, I have the equation, but how can I use it?
I can know the probabilty of a goal from a certain distance, but how can I get the total expectation?
Martin Eastwood - April 18, 2014
If, for example, a particular shot had a probability of 0.5 then the shot would go in once every two shots so is worth half a goal
John - April 18, 2014
Thanks for the quick reply.
Ok, but during a football game I can shoot from everywhere, so how can I calculate the total probability during that game and how can this be calculated in relation of the teams?
How can I find values like those in your app?
Martin Eastwood - April 18, 2014
Just sum up the expectancies from the shots to get the total per fixture.
These values aren’t in the app yet though, maybe next season…
John - April 18, 2014
Oh right, and how can you calculate the fixtures at the moment?
OI - April 18, 2014
What I like best about your metric is that it is a continuous one. The expected goal algortihms I have known up to now use zonation, wich is discrete. I’ve always thought that this is the need for improvement, and if I had had the data, I might have examined the concrete influence of shooting distance on my own. But now you did it, laudably.
If I were you, having the necessary data, I would be totally curious about “x-/y-axis distance”. According to your last posts, the y-axis distance explains a higher percentage of shot conversion than the x-axis distance (0,88>0,86). How does it affect angles? E.g., imagine the triangle with the dx, dy as the legs and AB as the hypotenuse.
The angle next to the goal could be a quite good measurement for a shot’s “centrality”, or concretely its sine: If a shot is taken straight in front of the goal, this angle is 90° (and the triangle doesn’t exist anymore). If you walk some metres to the left or to the right, the angle declines. This fits the sine that has his peak also at 90°.
I can imagine to divide the distance AB by the sine of this angle: If a shot is taken from a central position, “AB adjusted” doesn’t increase a lot (sin90°= 1). The effects are’nt too big as long as the angle isn’t too small. That’s beacuse of the sine curve that doesn’t slope extremely before and after 90° (e.g. sin60°= 0,87). I think this suits well to the probability of scoring: It is not a big disadvantage to shoot from slightly lateral positions, but it becomes one if you shoot from too farfrom centre.
It’s just one proposal to investigate out of thousands that could be done, but that one I find most interesting. Another one would be to consider the defensive side of your ExpG-metric either. But I’m sure you have developed some good plans on your own!
Martin Eastwood - April 18, 2014
Thanks, making it continuous was important to me as I also dislike the discrete approach of using zones. The angle is something I’ve been thinking about and I think it should be one of the next steps to factor in to the model. Distance is important but I agree that the angle of the shot must play a key role too. It’s definitely high up the todo list!
Tom Green - July 10, 2014
Hi Martin
Really interesting stuff. I’m still new to expected goals models and the X,Y stuff. If a shot is taken from 20 metres out, in the centre of the pitch, what would its co-ordinate be?
Thanks
Tom
Submit your comments below, and feel free to format them using MarkDown if you want. Comments typically take upto 24 hours to appear on the site and be answered so please be patient.
Thanks!