pena.lt/y

Massey Ratings For Football Part Two

Martin Eastwood — Thu, 04 Dec 2014 19:30:00 +0000

Introduction

In part one I introduced Massey Ratings and how they can be used to rank football teams in a way that accounts for their strength of schedule. Next, we’ll take a look at how Massey Ratings can be extended further to look at team’s attack and defence strength separately.

Massey Ratings

The idea behind Massey Ratings is that they rate teams such that the difference between any two teams is equal to the expected margin of victory between them. For example, if a team rated -1.0 played a team rated +1.0 then we’d expect the average goal difference between them to be two goals.

Since Massey Ratings look at goal difference rather than goals scored or conceded they account for a team’s overall strength and combine both their attack and defence strengths together into a single value. This means with a bit of mathematics we should be able to decompose a Massey Rating to split out these two constituent parts.

Attack And Defence

In part One we originally defined the Massey Rating as shown below in Equation One:

$y=ra–rb$

where y is the margin of victory for fixture, ra is the rating of team a and rb is the rating of team b. Let’s take this a step further and define the total goals a team should score in a match as Equation Two below:

$ya=oa–db$

where ya is the number of goals team a is expected to score, oa is team a’s attack strength and db is team b’s defence strength.

Extending this further we can say the total goals a given team should score over the course of a season is therefore equal to its attack strength multiplied by the number of matches played minus the sum of the defence strength of all its opponents. Since we know what the team’s overall rating are, how many matches they’ve played, how many goals were scored and who their opponents were we’re getting pretty close to getting what we need.

Decompose The Massey Matrix

Next we need to decompose the Massey Matrix we created in Part One into it’s diagonal and off-diagonal elements to give us two new matrices, G and P, which we use in Equation Three below:

$(G–P)r=p$

where G is total games played, P is the number of pairwise matchups each team has played, r are the team’s Massey Ratings and p is a vector of the team’s goal differentials.

From here, Ken Massey uses some clever algebra to derive the equivalent of Equation Four below:

$(G+P)d=Gr–f$

where G is total games played, P is the number of pairwise matchups each team has played, d is the defensive rating and f is the number of goals scored.

If you are interested in finding out more about the mathematics behind this then I heartily recommend taking a look through Ken Massey’s thesis where he explains it in much more detail than I’ve gone in to here.

Calculating The Ratings

Finally, we can now solve this linear system to get the attack and defence ratings for each team.

Figure One: Defensive Massey Ratings

Figure Two: Offensive Massey Ratings

It’s no surprise that Manchester City and Chelsea rate high for offensive strength but Everton are somewhat surprisingly rated third best offensive team even though they only rank mid-table in the league. Everton may only have a goal difference of +2 at the moment though but they are actually joint third highest goal scorers in the Premier League. They are performing well offensively, it’s their defence that is letting them down and is actually ranked worse than relegation-threatened Burnley’s.

QPR also rate pretty high in terms of attacking strength for a team in the relegation zone. Looking at their results for this season though they managed to score two against Manchester City, scored against Chelsea and are one of the few teams to actually get a goal against Southampton so they are performing well offensively against the league’s stronger teams. Like Everton though, their defence is performing poorly and dragging down their overall performance.

What’s that at the bottom of the offensive chart in red? Why it’s Aston Villa whose attack is so poor it actually gets a negative rating! I’ve mentioned in my last two articles about how Aston Villa’s Pythagorean and Massey Ratings show them to be seriously over-placed in the league and once again here’s another metric showing how poor they are. Bizarrely, Villa are somehow in twelfth place having managed a pitiful eight goals from fourteen matches. Although they are mid-table in the league and their defensive rating is pretty good, from an offensive point of view Aston Villa’s numbers suggest they are perhaps rather fortuitous to be so far away from the relegation zone…

Further Improvements

So far the Massey Ratings have considered each match a team plays equally but Ken Massey suggests they can be improved further by weighting matches based on their importance. For example, playing a cup match against a team from a lower division is probably less relevant to calculating the ratings than say a league match against a close rival. By weighting matches appropriately we can reduce the influence less relevant matches have on a team’s ratings and potentially improve their accuracy.

Example Code

If you are interested in having a go with Massey Ratings then I’ve put some example R code on GitHub. You’ll need to add your own data though as I’ve stripped out the section where it connects to my database for security reasons.

Comments

Peter - December 4, 2014

Great read as always.

Currently in the process of teaching myself R. Just wondering if you could give me a pointer as I’m really interested in giving this a go! What headers should the data be ordered in? Is this all taken from a league table, or from the results csv on the football-data website?

Cheers,

Peter

Martin Eastwood - December 5, 2014

It was all taken from my PostgreSQL database so you’ll need to make sure your data matches the naming conventions used in the code or change the code to match your data.

Kevin - December 5, 2014

Have you thought about improving the ratings by using expected goals, rather than goals, in your matrices?

Martin Eastwood - December 5, 2014

Not tried, but it’s an interesting idea!

Peter - December 8, 2014

Hi Martin,

I have given it a go (through Excel, not R) and while I have taken a different approach, things seem to look fairly consistent regarding the overall ratings. I’ll cautiously refer to it as an Adjusted Massey… I’m thinking decomposing these attack/defense ratings may prove a challenge however. I’m using it in conjunction with Pythagorean Expectation to gauge overall performance, and will have a blog post up fairly soon (with due reference to pena.lt/y/ for lighting the way of course)!

Cheers,

Peter

Martin Eastwood - December 9, 2014

Cool, look forward to reading it Peter!

Massey Ratings For Football Part One

Martin Eastwood — Thu, 27 Nov 2014 19:30:00 +0000

Introduction

We all know the league table can lie and one of the common causes of this is strength of schedule. Take Southampton, at the time of writing they are currently second in the Premier League twelve matches in yet still haven’t played Chelsea, Manchester City, Manchester United or Arsenal. Without wishing to be dismissive of Southampton, who undoubtedly are a very talented team, there’s a pretty decent chance that they’d currently be lower down the league table had these fixtures come up earlier in the season instead of Leicester, Hull or Aston Villa.

Massey Ratings

So if we can’t rely on the league table to tell us which teams are performing best what do we do? One alternative is to use Massey Ratings. This is a method devised by Ken Massey back in 1997 for his honours thesis that rates teams based on what opposition they’ve played. The system was originally designed for American Football but it can be adapted to football fairly trivially.

The idea behind Massey Ratings is that they rate teams such that the difference between any two teams is equal to the expected margin of victory between them, as shown in Equation One below:

$y=ra–rb$

where y is the margin of victory for fixture, ra is the rating of team a, rb is the rating of team b

Error

In an ideal world we’d have enough data that we could calculate true ratings for each team but with players moving from one team to another and with football seasons typically lasting just 38 matches we never have sufficient data for that so we have to settle for approximating ratings based on previous match results. This means we need to modify equation one to add in an error term to allow us to account for any unexplained variation in the outcome of games (Equation Two below).

$y=ra–rb+e$

where y is the margin of victory, ra is the rating of team a, rb is the rating of team b and e is the remaining error in the model.

So far so good, but how do we know what ra and rb should equal? Well, to start with we want that error term we added into Equation Two to be as small as possible so we use a technique called Least Squares to find the optimal set of ratings for each team in order to minimise e based on the past data we have.

The Matrix

Things get slightly trickier here but let’s say our past data comprises m matches involving n teams. We know what the margin of victory was for each match and who won but not the ratings for each team so we have m equations we need to solve to find the n unknown rating values, which we can write as Equation 3 below:

$y=Xr+e$

Where y is the the margin of victory, r is the rating we are trying to find, e is the remaining error and X is an m x m sized matrix of coefficients where each row represents a matchup containing a 1 for the winning team and -1 for the losing team. Unfortunately though, this gives us a very sparse matrix that is likely to be highly over-determined making it difficult to find a unique solution to the system.

The Massey Matrix

Thankfully Massey discovered that you can modify the matrix such that the diagonal elements equal the number of games each teams has played and the off-diagonal elements equal the negation of the number of matchups teams have played against each other giving Equation Four below:

$p=Mr$

where M is the modified Massey Matrix, p is a vector of the score differentials and r is the vector of unknown scores.

We are getting closer now but the matrix still doesn’t necessarily have a unique set of Ratings so Massey modifies it further to set the bottom row to zero and the corresponding element of p to zero too. This constraint creates a full rank matrix for us and forces the ratings to sum to zero.

Massey Ratings For The English Premier League

Finally, using some linear algebra we can solve the system and get the ratings for each team, shown below in Figure One.

Figure One: EPL Massey Ratings

It’s no surprise that Chelsea are ranked far ahead of anybody else in first place but Southampton do actually get ranked in second place, showing that even accounting for their easier schedule to date they deserve to be second in the league at the moment.

Interestingly, Swansea get ranked fourth rather than their current position of seventh in the league. However, Swansea have already played five of the six teams above them so their Massey Rating shows they are performing better than their raw points tally would suggest.

At the bottom of the table it’s not looking good for Aston Villa. I showed in my last article how their Pythagorean meant they were over performing being even as high as they are and this is now backed up by their Massey Rating ranking them in one of the relegation spots.

Next Steps

In my next article I’ll show how we can take Massey Ratings a step further and decompose teams’ overall ratings into separate ratings for both attack and defence. I’ll also add some example code too so you can have a go calculating them yourself.

In the meantime, if you are interested in finding out more about the maths behind Massey Ratings then take a look at Ken Massey’s honours thesis which goes into the theory in much more depth than my brief overview here.

English Premier League Pythagorean

Martin Eastwood — Tue, 04 Nov 2014 19:30:00 +0000

I’ve not posted this for a while so here is the latest Pythagorean for the English Premier League.

Football Pythagorean

If you’ve seen this before, it’s an adaptation of the baseball Pythagorean that allows you to estimate how many points a team would be expected to achieve on average based on the number of goals they have scored and conceded. It’s a simple equation but it is surprisingly accurate.

Take a look at my previous blog posts if you want to find out more about the theory behind it, how it was tested and what the equation itself actually looks like.

The Season So Far

Figure One below below shows the difference between how many points teams have achieved in the English Premier League and how many points their Pythagorean record predicts they should have.

Figure One: EPL Pythagorean Results So Far

Worryingly for Aston Villa they’ve currently got five points more than would be expected based on their goal record. These points were all from their crazy start to the season in which they were undefeated in their first four matches, somehow coming out with ten points even though they scored just four goals. However, they appear to have regressed somewhat since then managing just a single goal and zero points from their last six matches. It’s not looking good…

Chelsea are also up five points more than expected but in contrast things could not be looking better. They are playing well and gaining more points than their goal scoring record suggests. All the signs of potential champions – a good team that are exceeding their expected points. If you want to win the league you have to be good and lucky!

Comments

Anton Bashtavy - December 22, 2014

I’m not sure about Aston Villa – they appear to be unlucky with their shots/goals ratio. They score from every 15th shot on average (with 10 being the league average), and it’s not because they shoot a lot.

If both “clutch luck” and shots/goals ration reverse to mean, it’ll be more or less the same in terms of points per game.

Predicting Football Using R

Martin Eastwood — Sun, 02 Nov 2014 19:30:00 +0000

I recently gave a presentation to the Manchester R Users' Group discussing how to predict football results using R. My presentation gave a brief overview of how to create a Poisson model in R and apply the Dixon and Coles adjustment to it to account for dependance in the scores.

The slides are below for anybody interested and contain enough example R code to get you started. Unfortunately, there are no slide notes though but hopefully the slides should be descriptive enough to get you going!

Example code from the presentation can be found at my GitHub account

Predicting Football Using R from Martin Eastwood

Comments

Anonymous - November 3, 2014

Can you explain the +/- Dixon/Coles adjustment?

Martin Eastwood - November 3, 2014

Sure, if you are interested in the theory behind it better than I recommend reading Dixon And Coles paper where they propose their adjustment to account for dependency between the scores – http://www.math.ku.dk/~rolf/teaching/thesis/DixonColes.pdf

Peter - November 4, 2014

Is it possible for you to share the R code, how you implemented this adjustment in the model?

Martin Eastwood - November 4, 2014

Hi Peter – I’m not planing on adding the Dixon and Coles adjustment to the code as it was intended just as a simple demonstration rather than a full model. The adjustment requires carrying out an optimisation to estimate rho, which in turns requires a cost function etc so it increases the complexity of the example considerably.

Jonas - November 3, 2014

Do you apply the Dixon & Coles adjustment to the probabilities you got from the independent goals model? Do you estimate the rho parameter independently of the other parameters then?

Martin Eastwood - November 3, 2014

Hi Jonas, yes that’s right. You’ll need to run an optimisation to get rho and then use that to modify the probabilities from the Poisson model.

Jonas - November 3, 2014

That’s a neat trick, probably much easier than to fit the comlete Dixon & Coles model :)

Seth Dobson - November 4, 2014

Hi Martin! Thanks for posting this. Looking forward to trying it out on the SPFL.

Have you ever tried the fbRanks package in R?

Martin Eastwood - November 4, 2014

I didn’t even know it existing, will take a look!

Expected Goals: Foot Shots Versus Headers

Martin Eastwood — Thu, 28 Aug 2014 19:30:00 +0100

Introduction

My last article on expected goals introduced the concept of using exponential decay to estimate the probability of scoring based on the shooter’s distance from the goal. The article received lots of feedback (thanks everyone!!), with a couple of common comments standing out that I wanted to address.

Simplifying The Model

One common theme was whether the model was at risk of over-fitting and this is certainly something I was concerned about myself. In fact, I have since simplified the model to the equation below to help minimise this risk:

$expg=exp(-distance/a)$

Figure 1: Simplified Expected Goals Equation

As well as reducing the complexity of the model and making it easier to calculate the expected goals, the new equation has fewer parameters so the potential for overfitting is lower. The correlation between actual / expected goals has fallen slightly from 0.98 to 0.97 but the advantages of the simpler equation far outweigh such a minimal change.

Headers Versus Foot Shots

Another common question was whether it was important to split out headers and foot shots into separate models as the previous articles have so far ignored headers due to lack of data.

To investigate this I have been busy all summer collecting more shot data. I’m up to 45,000 shots in total now, including around 7,500 headers so I’m at the point where I’m happy to start the preliminary work comparing foot / headed shots although I certainly want more headers before drawing any definite conclusions.

I’ve run through all the curve fitting again for both headers and foot shots and plotted the resulting probability curves in Figure Two below.

Figure 2: Expected Goals: Shots Versus Headers

As you can see, headers have a noticeably lower chance of leading to a goal. The gap between head and foot shots appears largest around the ten metre mark, where foot shots have pretty much twice the probability of scoring. By 22 metres the chance of scoring from a header is virtually zero, while foot shots don’t reach this level until around 40 metres out.

Conclusions

But is this difference significant and do we actually need to bother creating separate expected goals models for headers and foot shots?

Well, if we compare the two probability curves against each other then the p value comes out at 0.064. Typically we take p values of 0.05 or lower to signify significance so by that count there is no real difference between the two.

However, p values should never be about some absolute cut off where <= 0.05 equals significance and everything else can just be ignored.

Having a value close to significance is suggestive that there may be a real difference there, especially when there is still a limited data size for headers so it’s certainly possible that headers and foot shots will warrant separate models. Luckily with the current equation this is really simple to do as we just need to alter the value of a as shown below in the appendix. This is an area I’ll be exploring in more detail as I add more headers to my database.

Appendix: Using the Expected Goals Model

To use the expected goals model you just need two numbers:

x = distance from goal in metres along x axis

y = distance from centre of goal in metres along y axis

These can then be used to calculate the total distance the shot is taken from:

$distance=sqrt(x^2+y^2)$

The expected goals for the shot is then just:

$expected goals=exp(-distance/a)$

where a = 4.4 for headers and 7.1 for foot shots

Example

Here’s an example for a player taking a header from the penalty spot.

x = 11 as penalty spots are roughly 11 metres from the goals (equal to 12 yards)

y = 0 as penalty spots should be level with the centre of the goal

$distance=sqrt(11^2+0^2)=11$

$expected goals=exp(-11/4.4)=0.08$

So on average, a header from the penalty spot would be worth around 0.08 goals.

Easy, just don’t forget you need to use negative distance inside the exponential!

Comments

Antony Lee - September 1, 2014

think the model fitted is much better now, as the previous one with a constant implied that there was a non-zero probability of scoring from 100m!

did you manage yet, as you now have 45000 data points, to check the stationarity in the underlying process by comparing season-on-season fitting parameters?

Martin Eastwood - September 2, 2014

Thanks, I’ll be taking a look at that soon!

Matthew Langston - September 4, 2014

Do you use any particular software or program to calculate the xy co-ordinates from the Squawka stats page?

Martin Eastwood - September 4, 2014

All the processing of the data and model fitting etc was done using R and SQL

OI - September 5, 2014

Firstly, thank you very much for continuously sharing your model to the readers. This is particularly valuable for other bloggers like me, and I’ll probably publish some Expg results on my German blog linked above (if you permit).

Secondly, I think that the separation of headers is a very large step forward. I have to repeat my thanks. As you’ve examined yourself, there is a (nearly) significant difference between headers and foot shots on the long term, and certainly the difference is even more significant in smaller sample sizes (for single chances). Having tried out the older version, I had the feeling that the ExpG values are generally too low. The results from the new formula fit my subjective impressions much better.

Thirdly, I still see a possibility to improve your model (although this might sound a bit ridiculous with R2=0.97). In my opinion, the angle is a bit underrepresented. I know you include “dy”, but imagine a foot shot from dx=1 and dy=6.5. The total distance is 6.58 and the ExpG value 0.396. The angle to the middle of the goal of 9° is very sharp. I can’t imagine that players really convert this chance in 39.6 of 100 tries. As R2=0.97 for distance alone proves, the overall difference might not be so big, but similar to head/feet there can be a big difference for a single chance.

What about the angle of view (see here: http://blog.kickdex.com/post/52303980749/angle-of-view)? The angle of view for the example shot is 14°, whereas it is 36,7° for a shot from the penalty spot (distance: 6,58m vs. 11m, angle of view 14° vs. 36,7°!). I deduced a formula to compute the angle of view from dy and dx. Unfortunately, it is clearly more complicated than the simple Pythagoras, and I don’t know how to paste a screenshot of it in the comment section. The mathematical text by itself would be unreadable. Are you interested in the angle of view? If yes, we should find a possibility to share the formula, if not, I’d completely understand that you prefer simplicity, especially with the simple version being very accurate (I mainly ask because I myself want to know if the angle of view is more accurate than distance alone ;).)

Martin Eastwood - September 5, 2014

Thanks for the message, yes you are welcome to use the ExpG results on your blog but I would appreciate it if you acknowledge me and provide a link back to my site :)

Also, thanks for the link about the angle of view, I have not seen that before and it certainly looks interesting. I’ll add it to my todo list to investigate further when I get some free time and will let you know how I get on!

Jamie - September 7, 2014

I don’t know how you collected the data (from squawka?) but it can see how it might be possible to extract the location & result of each shot from squawka. I can’t see how to distinguish between shots & headers though.

Did you have to collect them separately or where you able to filter the data later and do you have any suggestions for collecting such data?

Also, I don’t know how much you use/keep track of your fixture predictions but there seems to be some ‘errors’. For example: Metz v Nantes has Predicted Goals = 0.001 & 0.000 respectively.

It is a shame you aren’t able to post more often.

Gareth Owen-Smith - November 12, 2014

Hi – really interesting methodology, thanks for sharing! I have had my own go at scraping shot data off squawka using selenium webdriver in python and trying to get a model based on both x-distance and y-distance, based on your approach (using R, which I’m happy to share, if you want?). I haven’t separated by foot shots or headers, but that should be easy enough to do later. My expected goals model based on x, y coordinates (in yards), is: xG = exp(-x/9.67)*exp(-y/11.0)

Martin Eastwood - November 12, 2014

Looks interesting Gareth! Definitely take a look at splitting out the headers / foot shots though as I expect you’ll see a difference in the model coefficients between the two.

Mathematically Optimising Your Fantasy Football Team

Martin Eastwood — Thu, 24 Jul 2014 19:30:00 +0100

Introduction

The Premier League Fantasy Football is back ready for the new season so I thought I’d run through an example of how linear programming can help you select your team. If you haven’t come across linear programming before it’s a mathematical optimisation technique for that can be used to maximise the total number of points your team is worth within a set of constraints, e.g. staying within budget and not signing too many players from the same team.

Collecting The Data

The first thing we are going to need to do is scrape some data to optimise our team with so let’s fire up R. We are going to need the names of all the players that are available, what team they play for, how much they cost to sign and most importantly how many points they are worth. Conveniently, we can exploit of structure of the Premier League’s website to get the data and use it as a pseudo API.

DISCLAIMER: there is a fine line between scraping someone’s web site and creating a denial-of-service attack so make sure you spread out your calls to the website. Trying to scrape all the data in quick succession can put unnecessary strain on the site’s servers. If you scrape somebody’s data please ensure you do it in a way that does not impact the service they are providing!

#load libraries
library(lpSolve)
library(stringr)
library(RCurl)
library(jsonlite)
library(plyr)
# scrape the data
df = ldply(1:521, function(x){
# Scrape responsibly kids, we don't want to ddos
# the Fantasy Premier League's website
Sys.sleep(2.5)
url = sprintf("http://fantasy.premierleague.com/web/api/elements/%s/?format=json", x)
json = fromJSON(getURL(url))
json$now_cost = json$now_cost / 10
data.frame(json[names(json) %in%
c('web_name', 'team_name', 'type_name', 'now_cost', 'total_points')])
})

Constraints

Now we have the data we need to think about the constraints we will have to build into the linear system. For example, we can only spend a maximum of £100 million, we cannot have more than three players from the same team and are restricted to two goalkeepers, five defenders, five midfielders and three forwards.

#Create the constraints
num_gk = 2
num_def = 5
num_mid = 5
num_fwd = 3
max_cost = 100
# Create vectors to constrain by position
df$Goalkeeper = ifelse(df$type_name == "Goalkeeper", 1, 0)
df$Defender = ifelse(df$type_name == "Defender", 1, 0)
df$Midfielder = ifelse(df$type_name == "Midfielder", 1, 0)
df$Forward = ifelse(df$type_name == "Forward", 1, 0)
# Create vector to constrain by max number of players allowed per team
team_constraint = unlist(lapply(unique(df$team_name), function(x, df){
ifelse(df$team_name==x, 1, 0)
}, df=df))
# next we need the constraint directions
const_dir <- c("=", "=", "=", "=", rep("<=", 21))

The Objective

We also need to create the vector defining our objective, which is to maximise the number of points the team is worth within the constraints we are setting.

# The vector to optimize against
objective = df$total_points

Solving The Matrix

Finally, we put all the constraints into a matrix and let R solve the linear system to create our mathematically optimised team selection.

# Put the complete matrix together
const_mat = matrix(c(df$Goalkeeper, df$Defender, df$Midfielder, df$Forward,
df$now_cost, team_constraint),
nrow=(5 + length(unique(df$team_name))),
byrow=TRUE)
const_rhs = c(num_gk, num_def, num_mid, num_fwd, max_cost, rep(3, 20))
# And solve the linear system
x = lp ("max", objective, const_mat, const_dir, const_rhs, all.bin=TRUE, all.int=TRUE)
print(arrange(df[which(x$solution==1),], desc(Goalkeeper), desc(Defender), desc(Midfielder), desc(Forward), desc(total_points)))

The Results

The team the linear solver selected is shown in the table below – this is team with the highest possible number of points that can be achieved using the constraints we are working within.

Position	Team	Points	Name	Cost (£)
Goalkeeper	Everton	160	Howard	5.5
Goalkeeper	Crystal Palace	144	Speroni	5
Defender	Everton	180	Coleman	7
Defender	Chelsea	172	Terry	6.5
Defender	Arsenal	157	Mertesacker	6
Defender	Arsenal	155	Koscielny	6
Defender	Southampton	149	Fonte	5.5
Midfielder	Man City	241	Yaya Touré	11
Midfielder	Liverpool	205	Gerrard	9
Midfielder	Crystal Palace	131	Puncheon	6
Midfielder	Stoke	126	Sidwell	5.5
Midfielder	West Ham	125	Noble	5.5
Forward	Arsenal	187	Giroud	8.5
Forward	Liverpool	179	Lambert	7.5
Forward	Aston Villa	106	Weimann	5.5

Limitations

Now, before the internet gets grumpy and starts trolling me (whenever I’ve mentioned using mathematics for fantasy football people seem to get very irate) there are a few obvious limitations worth pointing out. First of all the new football season hasn’t started so I’m using the points totals from last season. This means all the players at the promoted teams and any new signings to the Premier League will have zero points and so will not get selected. I’m planning on running this script regularly throughout the coming season though to help guide my transfers, so as these players gain points they will start to get selected by the linear solver if they perform well enough.

Also, we’ve set the constraints to optimise for the best squad. You may want to spend all your money on the best possible first eleven and go for budget substitutes instead. For example, the table below shows what happens if you optimise for eleven players playing 1-3-4-3 at a total price of £82 million (this leaves enough to buy four substitutes at £4.5 million each).

Position	Team	Points	Name	Cost (£)
Goalkeeper	Everton	160	Howard	5.5
Defender	Everton	180	Coleman	7
Defender	Chelsea	172	Terry	6.5
Defender	Southampton	149	Fonte	5.5
Midfielder	Man City	241	Yaya Touré	11
Midfielder	Liverpool	205	Gerrard	9
Midfielder	Liverpool	178	Lallana	8.5
Midfielder	Stoke	126	Sidwell	5.5
Forward	Arsenal	187	Giroud	8.5
Forward	Liverpool	179	Lambert	7.5
Forward	Southampton	152	Rodriguez	7.5

Interestingly (for me at least) is that the cost of the players is fairly evenly spread across the team. Typically, when I select my fantasy football teams I tend to splash the cash on the big name strikers and then go for cheap defenders. However, based on these results though that’s looking like a bad decision so this season I’m going to follow the data and actually sign some decent defenders. Wish me luck…

Appendix

All code is available on GitHub

Comments

Peer - July 25, 2014

Hi. Interesting article.

I would be interested to know if the code can be utilised to pick the best available player with the added variables of points achieved per minutes played?

Martin Eastwood - July 25, 2014

Sure, it’s just a question of constructing the necessary constraints for the solver to optimise against

Neal Thurman - July 26, 2014

The theory is fine but for what you’re recommending to have any utility it needs to account for changes in situation from last season. Lambert isn’t likely to start now that he’s at Liverpool, Rodriguez is currently recovering from an injury, Sidwell isn’t likely to be the focal point at Stoke that he was at Fulham. What would be interesting is to see what the top 20 or 50 configurations look like and to see if the “spread the money around” strategy is dominant to buying a few very expensive players and filling in with bargains or if it just happens that the best outcome last season was spreading it around but there was a “galactico” strategy that was almost as good. Regardless, interesting food for thought. Cheers – Neal

EV - July 29, 2014

Good work. Will follow this. Possibly try different objective functions, like points/min played? and maybe adjust for this season’s schedule?

This approach has real potential for greatness.

jester112358 - July 30, 2014

Loved the post.

Didn’t read the code, but in the best 11 scenario: have you set the team to play 3-4-3 or why the code leaves out Sagna who has more points with the same price than Sidwell?

Martin Eastwood - July 30, 2014

Yes, I set the constraints to use a 3-4-3 formation.

Luis Pacheco - August 7, 2014

Thank you so much for the script! I was doing this with Excel and it is not as easy of just clicking enter.

I found that with 85 budget the best starting eleven for points was the 5-3-2. Second 4-4-2. I always played the 3-4-3. Now, I’m going to change it!

Martin Eastwood - August 8, 2014

That’s really interesting, looks like my trusty 3-4-3 I’ve been using for the past few years may not be the optimal formation!

Shalin - August 8, 2014

Hi Martin,

As someone who has a beginner knowledge of analytics and related tools, I had a few queries related to obtaining the data required for this solver. It seems you directly scrap the data into R from the FPL API. Are there any other reliable data sources available that you would recommend?

I believe it would also be possible to run such a solver in Excel. Your thoughts?

Martin Eastwood - August 8, 2014

If you want player data then Squawka and WhoScored are probably your best places to look. Yes, Excel and Open Office both have solvers built in so I expect it’s possible to do something similar with them.

Shalin - August 8, 2014

Hi Martin,

Thanks!

Shalin.

I believe it would also be possible to run such a solver in Excel. Your thoughts?

Pete - August 15, 2014

Interesting – I always wondered what would the perfect optimisation of value would look like – was thinking of trying it out but rodriguez, lallana are out and coleman might be. Give it another go but without injured players – quick! Haha

Brendan - August 25, 2014

This is very cool

Is it possible to make the objective function take into account the presence of a captain? I can’t think of a way of doing this that keeps it a linear constraint

Martin Eastwood - August 25, 2014

Thanks, it’s something I’d like to include but I’ve not come up with a suitable way to add it in yet.

marko - December 17, 2014

teams of 12 players with requirement that one player is played twice? maybe…

any chance of running the algorythm with points so far this season and current values?

Martin Eastwood - December 17, 2014

Good idea, will post a follow up when I get chance with the updated team recommendation. Thanks!

Expected Goals And Exponential Decay

Martin Eastwood — Tue, 22 Apr 2014 19:30:00 +0100

Introduction

In my last article on expected goals I showed how to incorporate the distance from goal along the Y axis into the expected goal model using Pythagoras' Thereom.

This all worked pretty well, giving us an r squared value of 0.95. However, while the r squared value was good there was still a flaw in the model we need to fix.

Better than Ronaldo

Eagle-eyed readers will have noticed that the fit of the curve broke down for very short distances, meaning the probability of scoring from zero metres was actually slightly above one. And as reader Benjamin Lindqvist commented, not even Ronaldo will score more than 100% of the time, not even from the goal line. Benjamin also had a good suggestion to improve this, adding an exponential decay function into the model to make it behave better around zero

Exponential Decay

If you aren’t familiar with exponential decay it basically means that a value decreases at a rate proportional to its current value. It’s a phenomenon that crops up fairly frequently in science and the natural world. For example, air pressure decays exponentially as you go higher up into the Earth’s atmosphere and radioactivity decreases exponentially over time.

A general equation for exponential decay is shown in Figure 1, where Y(t) is the value at time t, a is the starting value, k is the decay constant and t is time.

$y(t)=ae^{kt}$

Figure 1: Exponential Decay

So how do we apply this to football? Well, the first thing to do is replace time with metres and assume that the probability of scoring a goal decreases exponentially based upon the distance from goal the shot is taken from.

Next we need to find the correct value for the decay constant as this controls the shape of the curve. Rather than doing this manually through trial and error, we can use something such as R’s optim function to find it for us. We can also tweak the equation to add in a multiplier for the independent variable and an intercept as found in a traditional regression model giving us the fit shown in Figure 2.

Figure 2: Shots Versus Distance From Goal

Notice how the orange line now hits the Y axis just below 1.0? This fixes the problem we had before where it was possible to score more than one goal from a single shot. In fact, if you’re standing on the goal line the model now predicts around 0.96 expected goals, so very likely to score but with a small chance of screwing up (yes Edin Džeko I’m looking at you).

The new curve fit also pushes the r squared value up to 0.9883, meaning 98.83% of the variance for the probability of scoring from a shot can be accounted for using just distance from goal along the X and Y axes.

The final equation (Figure 3) is slightly more complicated now but it’s still pretty simple to use.

$expg=e^{-d/4.79}*0.921985+0.036212$

Figure 3: Expected Goals Equation Incorporating Exponential Decay

where:

$d=sqrt(dx^2+dy^2)$

Figure 4: Equation for d

and dx, dy are the difference between the x coordinates and y coordinates in metres for the shot location and the goal location.

As ever, let me know what you think!

Comments

OI - April 24, 2014

Two Questions:

-Did you use two different samples for this article and the last one about the y-axis? Some points seem to be relocated from Figure 3 in the last piece to Figure 2 here. For example, there is hardly a difference in scoring probability between 5 and 6 metres distance in the other diagramm, whereas in this diagramm the difference is approximately 5%!

-Are blocked shots included in your calculations?-

Martin Eastwood - April 24, 2014

Yes, in between those articles I increased the number of shots in my database by nearly 25% so hopefully some of the noise for distances where I didn’t have many shots should be smoothed out. Everything categorised as a shot by Squawka is included in the calculation except for penalties and own goals.

Benjamin Lindqvist - April 24, 2014

Hi Martin,

Glad to have been of help. If you’re interested in hearning more negativity, I think your function is now probably overfitted :)

If you have Skype, feel free to add me (benjaminlindqvist). Not all topics regarding football and numbers are suited to the public!

Martin Eastwood - April 25, 2014

Ah the perennial conflict between optimising and over-fitting :)

It’s certainly a risk considering the number of data points I have but I’m not too concerned at the moment as it’s a fairly simple curve rather than some high-order polynomial weaving between the data points. Plus, even though the exponential decay certainly improved the fit the actual expected goal values predicted haven’t really change too much, so in the grand scheme of things any over-fitting probably isn’t having that much of an impact at the moment. It’s certainly something to bear in mind though!

I’ve added you to Skype, would be good to have a chat sometime if you’re free :)

PeP - April 28, 2014

Hi Martin,

I’m very intrigued by your expected goals model and I’m very impressed with the accuracy. Would it be possible to include the Z-axis into your model or is the lack of data holding you back on this.

Martin Eastwood - April 28, 2014

By z axis do you mean location in the goal? If so, then there is no reason it couldn’t be incorporated into the model I just don’t have the data available yet.

PeP - April 28, 2014

I meant at what height the ball is struck from off the pitch. For example if the xy coordinates were kept the same then was the ball struck from off the ground , was the ball struck on the volley or was it an overhead kick.

Benjamin Lindqvist - May 5, 2014

I highly doubt that would be a convex realtionship so that would be hard to fit into this particular model.

Max - May 20, 2014

This is amazing… How did you the power curve? Did you find it by trial and error?

Martin Eastwood - May 20, 2014

No, it was created using mathematical optimisation techniques rather than trial and error as they can do a better job than me!

EV - July 21, 2014

This is an amazing piece of work Martin. And thank you very much for sharing it with us. The effort of gathering the data must have been enormous. I have some questions that will probably be interesting for you:

I assume you group the shot distances into 1m intervals, and got the probability of a goal inside each interval as number of goals / number of shots inside that interval? Then you used some max likehood to fit the curve? However, the number of shots inside each distance interval is different, I’m gussing there were far more shots around 12m than shots around 3m, and probably no shots at 0m, but you would fit a point at (0m,P(goal)=1) for common sense anyway. Does this introduce a bias where some shots are given more weight than others in fitting the curve? So when predicting the total number of goals for a season, this model will be predicting significantly under the actual number of goals? (Because decay constant is too fast due to the heavier weight given to shorter distances)

Is it possible to fit a curve that gives equal weight to each shot? I imagine such a curve is likely to significantly under estimate the probability of scoring from short distances, but will predict season totals more accurately. But the real question is, which curve would predict individual teams’ or even individual matches’ goals more accurately?

Martin Eastwood - July 21, 2014

Hi EV,

Yes the shots are binned by distance so there will be different numbers of shots per game. One way to investigate whether this causes any biases could be to bin by percentiles instead to normalize the bin sizes. Hopefully the fit of the curve will be stable enough that it wouldn’t really affect the results too much but there is only one way to find out…

Expected Goals: The Y Axis

Martin Eastwood — Wed, 16 Apr 2014 19:30:00 +0100

Introduction

Expected goals are one of the hot topics in the football analytics community at the moment and it’s a topic I’ve previously written a number of articles discussing how to calculate them. If you haven’t read those pieces yet it’s probably worth taking a quick look to set the context for the rest of this article.

The Story So Far

A few week’s back I published a simple equation for calculating expected goals that received a lot of positive feedback from readers as it was easy to use and was pretty accurate based on its r squared value of 0.86. This effectively means the equation is capable of explaining 86% of the variance in the shots data I have collected from Squawka.

For such a basic equation this is a really good result. I’d purposely tried to keep things simple so that the equation was easy enough for non-mathematicians to use in order to try and encourage its adoption by other people. Rather than keep these sort of things to myself I’d much rather share them around and see them get used elsewhere.

One of the restrictions I’d set myself for this was to only use the distance the player shooting was from the goal along the X axis so that the equation only needed data along one dimension. However, I received a lot of messages through Twitter and on the blog asking about the Y axis so let’s take a look…

The Y Axis

So the first question to ask was whether the Y axis was even worth bothering with, after all the r squared value when just using distance along the X axis was already 0.86 which only left around 14% of the variance in the data to account for.

Well, it turns out that how far away you are from the goal along the Y axis does have an impact (Figure 1). Unsurprisingly the further away you are then the less likely you are to score. Before you ask, the r squared value is 0.88 (I have learnt now to include r squared values for pretty much all charts otherwise I get bombarded by requests for them :-)).

Figure 1: Shots Versus Distance From Goal Along Y Axis

Adding The Y Axis Into The Equation

Okay, we know the Y axis has an effect on expected goals but how do we factor this into my previous equation? There are a number of mathematical techniques we can use to solve for multiple dimensions. However, I am keen to try and make this as simple as possible so that the lay-person can use it so let’s keep it basic and go with Pythagoras’ Theorem, a topic most people have touched on at High School at some point.

If we know the xy coordinates of the player taking the shot and the xy coordinates of the goal then using Pythagoras’ Theorem we can calculate the total distance between the two points. Figure two shows the equation for this where dx is the distance between the two x coordinates, dy is the distance between the two y coordinates and AB is the total distance the player is from the goal.

$AB=sqrt(dx^2+dy^2)$

Figure 2: Calculating the distance between two points

I did this for all 17,000 shots I have collected so far from Squawka (excluding penalties) to get their total distances from goal and calculated the probability of scoring from different distances based on the number of shots taken versus goals scored (Figure 3).

Figure 3: Shots Versus Total Distance From Goal

As previous, I’m using a power curve to fit the line through the data and as you can see it’s a pretty good fit. So what is the effect of adding in the Y axis? Well the r squared value has changed from 0.86 to…

drumroll

0.95

Yep, including both the x and y axis into the expected goals model accounts for 95% of the variance in the data. This barely leaves any room for the shooting player’s talent to have any effect or even for defensive pressure to play a part.

At first I thought this seemed a bit odd but thinking about it in more detail it actually seems logical. It doesn’t make much difference whether you are shooting from five metres out against a strong defence or a weak one, you still have the same chance of scoring from that particular position.

However, playing against a strong defence will likely mean you will get into that good position less often so your overall expected goals will be lower. Conversely, better players will be able to get into those good positions more often than weaker players so their overall expected goals will be higher.

In other words, at the individual shot level expected goals seems to be all about a player’s position in respect to the goal when they shoot. Other factors, such as player talent, defensive pressure etc are probably not visible until you start looking at larger samples, such as expected goals per fixture or even per season.

Anyway, here’s the final equation:

$ExpG=Distance^{-1.33796}*10^{0.4720605}$

Let me know what you think!

Comments

Lorenzo - April 17, 2014

How to collect data from Squawka?

Did you wrote a simple scraper or there is some public API?

Martin Eastwood - April 25, 2014

There is no API available so I collected the data myself from their site

Jonas - April 17, 2014

I am not sure if I agree with your interpretation that there is little room for player talent. If I have understood what you have done correctly, you are aggregating the shoot data to get frequencies (or probabilities if you want) for each 1 meter interval. You are in other words looking across all teams, both the good ones and the bad ones. In your previous post you can clearly see that some of the good teams have rather large positive residuals. These residuals I think are interesting, as they show how good a team is controlled for where they shoot from.

My guess is that if you look at the instances where a shoot from 20 meters or more yielded a goal, they are not going to be evenly distributed across all teams, but the good teams are going to be over represented.

Martin Eastwood - April 17, 2014

It’s a really good point and one that I agree with – I’ll take a look at this in more detail in a future post.

My point though is that at the point a shot is taken the most overwhelming factor in whether it results in a goal seems to be position the shot is taken from. If talent was a major factor at this stage then I would expect more variability in the data and a much poorer fit in the graph.

Where I think talent will play a major role is the positions players get into to take those shots. I would expect better players to be shooting more frequently from better positions and good defences to concede fewer shots from those good positions.

I’ll hopefully take a look at all this in the coming weeks to see whether my hypotheses hold up.

Max - April 17, 2014

What happens when you take out headers?

Martin Eastwood - April 25, 2014

Will be looking into that in more detail soon

John - April 18, 2014

Well, I have the equation, but how can I use it?

I can know the probabilty of a goal from a certain distance, but how can I get the total expectation?

Martin Eastwood - April 18, 2014

If, for example, a particular shot had a probability of 0.5 then the shot would go in once every two shots so is worth half a goal

John - April 18, 2014

Thanks for the quick reply.

Ok, but during a football game I can shoot from everywhere, so how can I calculate the total probability during that game and how can this be calculated in relation of the teams?

How can I find values like those in your app?

Martin Eastwood - April 18, 2014

Just sum up the expectancies from the shots to get the total per fixture.

These values aren’t in the app yet though, maybe next season…

John - April 18, 2014

Oh right, and how can you calculate the fixtures at the moment?

OI - April 18, 2014

What I like best about your metric is that it is a continuous one. The expected goal algortihms I have known up to now use zonation, wich is discrete. I’ve always thought that this is the need for improvement, and if I had had the data, I might have examined the concrete influence of shooting distance on my own. But now you did it, laudably.

If I were you, having the necessary data, I would be totally curious about “x-/y-axis distance”. According to your last posts, the y-axis distance explains a higher percentage of shot conversion than the x-axis distance (0,88>0,86). How does it affect angles? E.g., imagine the triangle with the dx, dy as the legs and AB as the hypotenuse.

The angle next to the goal could be a quite good measurement for a shot’s “centrality”, or concretely its sine: If a shot is taken straight in front of the goal, this angle is 90° (and the triangle doesn’t exist anymore). If you walk some metres to the left or to the right, the angle declines. This fits the sine that has his peak also at 90°.

I can imagine to divide the distance AB by the sine of this angle: If a shot is taken from a central position, “AB adjusted” doesn’t increase a lot (sin90°= 1). The effects are’nt too big as long as the angle isn’t too small. That’s beacuse of the sine curve that doesn’t slope extremely before and after 90° (e.g. sin60°= 0,87). I think this suits well to the probability of scoring: It is not a big disadvantage to shoot from slightly lateral positions, but it becomes one if you shoot from too farfrom centre.

It’s just one proposal to investigate out of thousands that could be done, but that one I find most interesting. Another one would be to consider the defensive side of your ExpG-metric either. But I’m sure you have developed some good plans on your own!

Martin Eastwood - April 18, 2014

Thanks, making it continuous was important to me as I also dislike the discrete approach of using zones. The angle is something I’ve been thinking about and I think it should be one of the next steps to factor in to the model. Distance is important but I agree that the angle of the shot must play a key role too. It’s definitely high up the todo list!

Tom Green - July 10, 2014

Hi Martin

Really interesting stuff. I’m still new to expected goals models and the X,Y stuff. If a shot is taken from 20 metres out, in the centre of the pitch, what would its co-ordinate be?

Thanks

Tom

English Premier League Pythagorean Update

Martin Eastwood — Fri, 04 Apr 2014 19:30:00 +0100

Introduction

I’ve not posted an update on the Pythagorean for the English Premier League (EPL) for a while so the latest figures are below.

Football Pythagorean

In case you haven’t seen it before, my football Pythagorean is an adaptation of the baseball pythagorean that allows you to quickly estimate how many points a team would be expected to achieve on average based on the number of goals they have scored and conceded. It’s a pretty simple little equation but it is surprisingly accurate.

Take a look at my previous blog posts here about it if you want to find out more about the theory behind it, how it was tested and what the equation itself actually looks like.

The Season So Far

Figure One below shows the difference between the actual points each Premier League team has achieved this season and how much my Pythagorean predicts they should have on average. For teams in green the difference is positive meaning they have more points than expected while those teams in red have less points than expected based on the number of goals they have scored and conceded.

Figure One: EPL Pythagorean Results So Far

Once again Tottenham are way ahead of where they would be expected to be, with an astonishing 15 points extra. Either Spurs are doing something fantastically efficient this season or they are extremely lucky to be where they are in the league. Take those 15 points away and they drop down to 10th place just ahead of Stoke. This season has been a bit of a write off for Spurs compared with pre-season expectations but it could / should have been so much worse based on their Pythagorean.

Down at the other end of the table Hull should probably be feeling quite pleased with themselves as they are looking a pretty safe bet to avoid relegation even with their Pythagorean of -5.

Poor Swansea though have the lowest Pythagorean in the league. On average teams with their goal record would expect to have achieved roughly nine more points than their current total. In fact if Swansea and Tottenham both had the average points their goals suggest then the Swans would actually be the higher placed of the two teams!

Let’s see what happens if / when regression towards the mean starts to kick in…

Comments

Nick - April 16, 2014

Great blog and the EI index is very interesting. It seems that teams that overachieve according to their EI index are those that have suffered heavy defeats with a big goal margin (Tottenham losing 4-0, 4-0 and 6-0 to the top 3, Arsenal losing to them 6-3, 6-0 and 5-1; Norwich 7-0 and 5-1). Obviously, their expected points are negatively affected by these few heavy defeats. Wouldn’t it make sense to cap the margin of victory/defeat to 3 goals and re-calculate the EI?

Martin Eastwood - April 16, 2014

Perhaps, I did reduce the value of each goal but in the original equation but there may well be a better way of doing it than the polynomial function I used

Expected Goals Updated

Martin Eastwood — Sat, 01 Mar 2014 19:30:00 +0000

Introduction

When I introduced my Expected Goals model a few weeks back a number of people commented on the bump in the curve where I had included penalty shots in the data set used to fit the model. The reason I’d originally left penalties in was I felt their number was too few to have an impact on the fit of the model and at the time I hadn’t actually tracked which shots were and were not from penalties.

Since that decision seemed to cause quite a kerfuffle I have since gone back to the raw data, removed all the penalties and refitted the curve. While I was at it I also added in more shots I had collected and rescaled all the co-ordinates to use a larger pitch (105 x 68m) as Claus Moeller had suggested my estimate of Premier League pitch size was too small.

As expected, the difference in the fit of the curve is very small (Figure 1) but it has pushed the r squared value up to 0.86 from 0.84, meaning that 86% of the variance in goal scoring is due to the distance from the goal the shot is taken from and just 14% is due to other reasons, such as player talent, defensive pressure, goalkeeper etc.

Figure 1: Shots Versus Distance From Goal

The equation for expected goals is now updated to -1.014718 for the coefficient and 0.05082859 for the intercept so for my previous example a shot from 8 metres gives:

$8^{-1.014718}*10^{0.05082859}=0.1362846$ expected goals

Comments

Actual Goals Versus Expected Goals

Martin Eastwood — Sat, 15 Feb 2014 19:30:00 +0000

Introduction

Since my last article about how to calculate expected goals one question has come up more than any other and that is about the correlation between expected goals and actual goals so here you go:

Figure 1: Shots Versus Distance From Goal

Figure 2: Expected Goals Away Versus Actual Goals Away

The correlations look pretty good, 0.86 for goals for and 0.72 for goals away. I’m not sure yet why the correlations differ slightly for home / away and whether it means anything or is just down to noise in the data but I’ll keep an eye on that as I collect more shots over the course of the season.

Another question that popped up a few times was whether my expected goals correlated with actual goals better than Total Shot Ratio (TSR) does and the answer is yes it does.

This is to be expected really as expected goals account for shot location while TSR considers all shots to be equal when clearly they are not – a shot from one metre out is vastly more likely to lead to a goal than a shot from 20 metres out.

Figure 3: TSR Versus Actual Goals For

Figure 4: TSR Versus Actual Goals Away

There is still a heap of work to do to improve / optimise / characterise the expected goals model further but it is a promising start for it so far. I’ll post more updates as as I progress with the model’s development over the coming weeks.

Comments

Expected Goals For All

Martin Eastwood — Wed, 12 Feb 2014 19:30:00 +0000

Introduction

It seems that everybody has their own expected goals models for football nowadays but they all seem to be top secret and all appear to give different results so I thought I post a quick example of one technique here to try and stimulate a bit of chat about the best way to model them.

The Data

Over the past few weeks I have tediously collected several thousand xy co-ordinates for shot locations from Squawka and converted them into approximate distances from goal in metres, assuming that an average football pitch is 100m x 65m.

Goals Versus Distance

Figure 1 below shows the relationship between the probability of scoring a goal and how far away from the goal line the shot is taken from.

Figure 1: Shots Versus Distance From Goal

There seems to be a little bit of noise in the data, particularly around the 12-13m mark but overall I was pleasantly surprised how neat the data looks – there seems to be a pretty clear non-linear relationship between the likelihood of scoring and how far away from the goal the shot is taken from.

So how do we model this relationship? Obviously we cannot just stick a linear regression through the graph it as the relationship is clearly not linear so one possibility is to use a polynomial instead of a straight line (Figure 2).

Figure 2: Fitting a Polynomial

Unfortunately, this does not give particularly good results as low order polynomials (the orange line) do not fit tightly enough to the non-linearity in the relationship while higher-order polynomials (the red line) start to fit to the noise in the data leading to problems with over-fitting.

So what do we do now? Well, looking closer the shape of the curve appears exponential so one option is to fit a Power function to it. We can do this pretty easily by taking the log of the data, fitting a linear regression against it and plotting this against our non-logged data (Figure 3).

Figure 3: Power Curve

This gives an extremely good fit with the data and seems a plausible choice. We know goal scoring is Poisson distributed so it would seem natural to fit expected goals using an exponential shaped curve since Poisson and exponential distributions are inherently linked – the exponential distribution in fact describes the time taken between individual events occurring in a Poisson process.

If we calculate the r squared value for the fit of the Power curve then we get a value of 0.84, meaning 84% of the variance in goal scoring can be attributed to how far away the player taking the shot is from the goal. This is pretty impressive as it leaves just 16% attributed to other reasons, such as the angle of the shot, goalkeeper positioning, defensive pressure, the shooting player’s talent etc.

Before you ask, I’ll be looking at whether adding these additional factors into the model can improve it or whether the added complexity is not worth chasing the 16% for in the coming weeks.

Using the Expected Goals Model

But how do we use the model? Although everybody else’s models seem to be top secret I’m going to give mine away. The coefficient for the regression is $-1.036884$ and the intercept is $0.05950286$.

To put this into action all you need to do is raise the distance away from the goal in metres to the power of the coefficient and multiply by 10 to the power of the intercept. For example, a shot from 8 metres gives:

$8^{-1.036884} * 10^{0.05950286} = 0.132771$ expected goals

So how about we give it a proper test and try it out on this season’s English Premier League to date? The results are shown in Table 1 and overall give a root mean square error of 8.2 goals, which seems a pretty reasonable starting point for developing the model further from.

	Team	Goals	expG	Residual
1	Man City	68.00	46.90	21.10
2	Liverpool	63.00	42.56	20.44
3	Arsenal	48.00	35.26	12.74
4	Chelsea	48.00	42.56	5.44
5	Man Utd	41.00	35.26	5.74
6	Southampton	37.00	29.05	7.95
7	Everton	37.00	33.31	3.69
8	Newcastle	32.00	30.32	1.68
9	Swansea	32.00	26.89	5.11
10	Tottenham	32.00	31.21	0.79
11	WBA	30.00	31.17	-1.17
12	West Ham	28.00	26.69	1.31
13	Aston Villa	27.00	24.20	2.80
14	Stoke	26.00	25.29	0.71
15	Sunderland	25.00	25.63	-0.63
16	Hull	25.00	23.95	1.05
17	Fulham	24.00	24.39	-0.39
18	Cardiff	19.00	24.67	-5.67
19	Norwich	19.00	27.61	-8.61
20	Crystal Palace	18.00	25.03	-7.03

Table 1: Expected Goals For The English Premier League To Date

You can also see a pretty clear pattern in that the teams at the top of the league have generally over-performed the goal expectancy while those towards the bottom end have under-performed it. This would seem reasonable as we are predicting average goal expectancy and the top teams are obviously above average so should perhaps do better with their chances, while the lower teams are below average so would be expected to perform worse?

What Next?

I’m not claiming this to be the only way of calculating expected goals, or even the best way but hopefully it will encourage more discussion of how to calculate expected goals rather than a lot of secret black boxes all giving different results.

I hope to write more about expected goals over the coming weeks in order to test this equation to see how well it really works, to hopefully improve it further and to try and understand what the metric can and cannot tell us.

In the meantime, feel free to use my equation to calculate expected goals, all I ask is that you don’t try and pass the equation off as your own (you know who you are!!) and that if you use it then please acknowledge me and link back to my site.

Be warned though it’s a work in progress so is subject to change as and when I improve things…

Enjoy!

Comments

Christopher Hoeger - February 13, 2014

Hey, I’m a chemical engineer, so while my knowledge of statistics is limited to it’s application in my field, I would really like to get into football analytics, and I’d love to contribute, if possible, to your expected goals model. Could you tell me where the best publicly available data is, and, more importantly, the best way to access it? Thank you, Chris

Martin Eastwood - February 13, 2014

Hi Chris,

I took all the data for constructing the model from Squawka. It’s a tedious process but with a bit of patience you can transcribe approximate xy coordinates for shots from their site.

Ali - March 19, 2014

Would love to know the process behind transcribing the shots into x,y coordinates if you have the time..

Claus Moeller - February 16, 2014

Interesting model.

I think the average of the pitches in Premier League, is way bigger than 100m x 65m.

Most of the pitches in Premier League is 105m x 68m – http://www.openplay.co.uk/blog/premiership-football-pitch-sizes-2013-2014/

Just for my curiosity, do you rely on all the data from Squawka?

Martin Eastwood - February 16, 2014

Thanks for the link, not seen that one before. Should be fairly trivial to rescale should people wish to. Yes I used Squawka to get all the xy coordinates.

Hugo Varandas - February 19, 2014

Hi, nice work. I find this conclusion very interesting “This is pretty impressive as it leaves just 16% attributed to other reasons”. Maybe this is the reason why the better teams outperform the prediction.

I just do not understand how does this help to predict the expected goals in a particular future game.

Martin Eastwood - February 19, 2014

Yes, presumably somehere in that 16% is player talent, defensive pressure, goalkeeper skill etc. So far my equation is more explanatory rather than predictive. More work would be needed to produce a predictive model from shots.

Justin - February 20, 2014

I’m sure the spike in probability at the 12m mark is the result of penalties. Analyzing only goals from open play would increase the R^2 I would suspect.

Martin Eastwood - February 20, 2014

Yes, when I next get some free time and look at removing them although the fit of the curve is so good is will probably have minimal effect.

Antony - March 15, 2014

Do you know whether the decay curve exponent is replicable across seasons?

Also do you know how much difference angle of the shot makes and if other “black box”-guys additionally use this? The fit by averaging across angles is good, but wide players may have a detriment when cast against a benchmark without it – or maybe the difference is small.

Well done for compiling the data from squawka – when I had a quick search the only way to get x-y’s seemed to be by eye and a ruler from their pitch plots!!

Martin Eastwood - March 17, 2014

I’ve not looked season-by-season yet but the decay curve is created from multiple season’s worth of data aggregated together.

The angle presumably makes some difference but considering how high the r-squared is it must only account for a relatively small proportion of expected goals although I will look into this in more detail as soon as I get some free time.

Thanks!

Benjamin Lindqvist - April 18, 2014

Hey, just thought I’d cryptically point out there’s a better function to fit to the data than that.

Martin Eastwood - April 18, 2014

Ooh you can’t just leave it that, tell me more :-)

Benjamin Lindqvist - April 18, 2014

Well if I’m not mistaken, the probability of scoring from 0 yards is, according to your model, more than 100%. Not even Ronaldo will score more than 100% of the time, not even from the goal line :D

Why don’t you try a decaying exponential instead? That has all the properties you’re looking for, but it will also be well behaved at x=0.

Martin Eastwood - April 18, 2014

Yes the 0 yards issue is a concern. I thought about forcing the curve through 1 but dislike taking that sort of brute force approach. The decaying exponential sounds a really good idea. I’ll look into that,thanks!

Benjamin Lindqvist - April 18, 2014

I.e.: http://fooplot.com/#W3sidHlwZSI6MCwiZXEiOiIxL3giLCJjb2xvciI6IiMwNEZGMDAifSx7InR5cGUiOjAsImVxIjoiZV4oLXgpIiwiY29sb3IiOiIjRkYwMDAwIn0seyJ0eXBlIjoxMDAwfV0-

Anonymous - April 22, 2014

Hey Martin,

I had a very noobish doubt. Please don’t mind it. Can you explain how you calculated the probability of scoring (Y axis) to arrive at the scatter plot in Figure 1?

Thanks in advance. :)

Anonymous - April 22, 2014

Poisson distribution of course. My bad!

Anonymous - April 22, 2014

If you could run me through the process or maybe send me some useful links, it would be greatly appreciated!

Martin Eastwood - April 22, 2014

I split the shots into different bins by distance and then calculated the probability per bin. I then used this set of aggregated probabilities to construct the model and fit the curve.

Anonymous - April 22, 2014

Thanks a lot! :)

I need to brush up on my stats concepts.

abhinav - July 26, 2014

how are you gathering xy coordinates from squawka? or are you using some other metric to approximate?

Martin Eastwood - July 26, 2014

It took quite a bit of effort!

Comparing Players Using Cluster Analysis

Martin Eastwood — Mon, 10 Feb 2014 19:30:00 +0000

Introduction

As there were a couple of presentations at the recent Opta Pro Forum talking about identifying player similarities I thought I’d give a quick example of how to do something similar using k-means cluster analysis.

The Data

All the data used in the analysis was taken from public websites, such as whoscored, squawka, transfermarkt etc and painstakingly matched together to try and get as much information on each player as possible.

The first stage of analysis was to normalize the data so it was all in the same range to avoid biasing the clustering. If you think about how many goals a typical player scores per match compared with how many passes they play then the scale is quite different. Since k-means clustering uses Euclidean Distance the clusters formed are influenced strongly by the magnitudes of the variables, especially by outliers. By normalizing all data into the same range this bias can be avoided.

Principal Component Analysis

While normalizing the data, I also performed Principal Component Analysis (PCA) on it too. This step isn’t essential but it is a handy way of reducing the dimensions in the data down to a more manageable size by squashing all the data together into new variables known as principal components.

These principal components are created in such as way so that the first one accounts for as much as the variance in the data as possible, the second one then accounts for as much of the remaining variance and so on.

As you can see in Figure 1 below, the first component represents pretty much 70% of all the variance in the data with each additional component accounting for less and less. This means we can represent pretty much all the information in the data without losing much using just five components, and around about 80% using just two components.

Figure 1: PCA scree plot showing amount of variance accounted for by each principal component

Clustering The Players

The next step was to then run the k-means clustering algorithm on the data. As shown in Figure 2 the players split relatively neatly into five distinct coloured clusters when plotted by the first two principal components.

Figure 2: Players split into different clusters by colour

Goalkeepers

As a quick test we can look at the grey cluster located at the bottom of the image in more detail to see which players are contained within it (Figure 3). If you click the image to zoom in on it you can see it’s done a pretty good job of pulling out the goalkeepers from the rest of the players. This is to be expected since goalkeeper’s stats should be pretty distinct from outfield players but it’s reassuring to check the technique passes this first simple test before we move on.

Figure 3: The grey cluster up close

Vincent Kompany

Now that we have separated out the goalkeepers we can take a look at how well the technique copes with outfield players, starting with Manchester City’s central defender Vincent Kompany located at the centre of Figure 4. The results are pretty good, with Kompany surrounded by players predominantly considered to be defenders. As you move up the image the players start to get a bit more attacking with people like David Luiz, Phil Jones and Fabien Delph starting to appear

Figure 4: Clustering of Vincent Kompany

Adnan Januzaj

Next up is Adnan Januzaj, one of the few Manchester United players to be having anything resembling a decent season this year. Again the results look pretty plausible (Figure 5), with Januzaj surrounded by predominatly attacking midfielders. There are a couple of slightly surprising results in there though, such as Manchester City’s strikers Álvaro Negredo and Edin Džeko.

Figure 5: Clustering of Adnan Januzaj

Mikel Arteta

Finally, I added in Arsenal’s midfielder Mikel Arteta (Figure 6). This one was probably the most surprising of all the players I’ve looked at as there seems to be quite a mix of players around Arteta, including both offensive and defensive players, although perhaps this is actually representative of Arteta’s role at Arsenal?

Figure 6: Clustering Mikel Arteta

Next Steps

For a first go the results are pretty promising but there are plenty of ways the technique could be improved. At the moment I have used all the data I had available for each player but I suspect more specific results could be obtained by filtering the data.

For example, there may be specific attributes of a player you want to match on e.g. looking for attackers by just their creative output may be more useful than including their tackles, interceptions etc, which may be of minor importance to their role.

Finally, all the data used here are aggregated. A really interesting next step would be to include xy co-ordinates for shot locations, interceptions, passes etc to cluster players based on the locations of their actions on the pitch (donations of xy data will be gratefully accepted :)).

Comments

Filipe Rodrigues - February 20, 2014

Hello Martin, great job once again.

Can you tell me what data was used to complete this post?

Martin Eastwood - February 20, 2014

It is collected from all over the Internet and then algorithmically matched together. Sadly it’s not an easy job to acquire.

Dank - June 29, 2014

Hello Martin, i want to know which the values of the axes, can you explain it me?

Martin Eastwood - June 29, 2014

Hi Dank, the axes represent the first and second principle components. I don’t have time to explain it all here at the moment so in the meantime it’s probably worth taking a look at the Wikipedia entry as a starting point – http://en.wikipedia.org/wiki/Principal_component_analysis

Antony - March 12, 2014

~Hi Martin – really interesting analysis which shows what you can do with pure top down data driven approach.

I did a similar analysis with the MCFC/Opta data last year but instead of using PCA I first just considered midfielders and then qualitatively chose the attacking and defensive attributes I wanted to determine out/underperformance in versus the sample (by scaling per 90 and normalising by mean and variance). By adding up defensive and offsensive (rather than projecting onto eigenvectors) a 2D plot revealed many qualitative knowns and some surprises. The population nicely clustered into the enforcers, schemers, creators and luxuries respectively and then PCA helped to characterise the variance within each cluster. Would be interesting to compile these types of systems for many seasons and see what the principal dynamical modes are, like improving within a cluster or moving between them, e.g. the trajectory of Giggs and now Gerrard.

Interesting the different approaches we used in the ordering and application of qualitative intuition vs. quantitative rigour, which I think is the marriage that has to be progressed and its limitations understood for analytics to really take off and be adopted.

Also like the piece on ExpG….amazing fit to the decay curve!

Martin Eastwood - March 13, 2014

Hi Antony,

That sounds really cool! At some point I want to go back and try something similar with the clustering using subsets of the players attributes. At the moment I use all the players stats but it would be interesting to try just clustering players based on passing stats or defensive stats etc.

I think splitting analyses out over seasons will be a really important thing to do to assess trajectories of player’s careers and how they develop / change with age or move between clusters. Just need the data to do it :)

EPL 2013/2014: Football Pythagorean So Far

Martin Eastwood — Mon, 20 Jan 2014 19:30:00 +0000

Introduction

Welome back! Now that I'm no longer part of Onside Analysis I'm free to start blogging again so let's start off by taking a look at how my football Pythagorean is doing for the English Premier League so far this season.

Football Pythagorean

In case you haven’t seen it before, my football pythagorean is an adaptation of the baseball Pythagorean that allows you to quickly estimate how many points a team would be expected to achieve on average based on the number of goals they have scored and conceded. It’s a pretty simple little equation but it is surprisingly accurate!

The Season So Far

Figure One below shows the difference between the actual points each Premier League team has achieved this season and how much my Pythagorean predicts they should have on average. For teams in green the difference is positive so they actually have more points than expected while those in read have gained less points than would be expected based on the number of goals they have scored and conceded.

Figure One: EPL Pythagorean Results So Far

The stand out team here is obviously Tottenham, who have somehow managed to end up with eleven points more than would be expected based on their goals. Spurs’ Pythagorean has looked pretty big for a while now so I suppose you could look at this two ways – either they have developed an extremely effective and efficient system or they have been lucky to get as may points as they have. I’ll let you decide on the answer to that one…

Interestingly, Manchester City are pretty close to their expected points total despite their enormous goal difference. One reason for this is that my football Pythagorean is not linear so as you score more goals they become less valuable to help account for high scoring matches, such as most of City’s home games this season! This helps prevent over-prediction of expected points for teams scoring heavily – having a good goal difference is obviously helpful but whether you win by one goal or five goals you still only get three points from the match.

As it stands though Manchester City are in second place behind Arsenal who have acquired six points more than expected, meaning that typically we would not expect Arsenal to be top based on their results so far this season.

How Will The Season End?

As well as looking at how teams are doing so far, we can also extrapolate the results and predict how the teams will end up at the end of the season (Table One). This is a very simplistic prediction, for example it does not take into account strength of schedules, but it is fairly accurate – the r squared value for Pythagorean predicted points versus actual points across multiple leagues worldwide was 0.938 with an average error of less than four points – so it should give a reasonable estimate of how the Premier League will finish next May.

	Team	Points
1	Manchester City	84.50
2	Arsenal	83.59
3	Chelsea	80.99
4	Liverpool	73.75
5	Everton	72.20
6	Tottenham Hotspur	65.89
7	Manchester United	62.56
8	Newcastle United	59.32
9	Southampton	54.42
10	Aston Villa	41.52
11	Hull City	40.97
12	West Bromwich Albion	40.74
13	Swansea City	39.66
14	Stoke City	36.85
15	Norwich City	35.78
16	West Ham United	33.91
17	Sunderland	32.28
18	Crystal Palace	31.30
19	Fulham	30.65
20	Cardiff City	29.29

Table One: Pythagorean Predicting Final Standings For the English Premier League 2013/2014

Comments

Rasmus Dam - January 20, 2014

Hi Martin

Great to see you posting again. It’s an interesting and easy to understand post. Furthermore your model is easy to use for other league forecasts. If interested I’ve done so for the Danish Superliga in these two posts:

http://super-analyse.blogspot.dk/2014/01/forecast-superliga-1314.html

http://super-analyse.blogspot.dk/2013/06/pythagorean-expectation-in-football.html

As a side comment: Isn’t the R^2 between EP and actual final points less than 0.938 after 22 rounds? (I have a value of 0.7 after 22 rounds calculated from last season in Denmark)

Sorry for linking, but just wanted to let you know that you inspired me to do the same to danish football so you could see it if interested :)

Martin Eastwood - January 21, 2014

Hi Rasmus

Thanks for the links, it’s great to see my Pythagorean equation getting used elsewhere :-)

Yes you are correct the r2 value I quoted was for the end of the season. Perhaps I should have made that clearer in my last post? It’s good to see that you found the same results as me and that the predicted points stabilises pretty fast in Danish football too.

UEFA Champions League – Route To The Final

Martin Eastwood — Mon, 30 Sep 2013 19:30:00 +0100

Introduction

With the UEFA Champions League group stage now underway I took a quick look at what it typically takes for teams to reach the final.

I started off by looking at how well teams from the major six domestic leagues (England, France, Italy, Spain, Germany and Portugal) performed in the UEFA Champions League based on what position they qualified in domestically (Figure 1) as this affects at what point they enter the competition…

Find out more by reading the full article on the Onside Analysis blog here.

Analysing Football Teams Using Cluster Analysis and Principal Component Analysis

Martin Eastwood — Fri, 30 Aug 2013 19:30:00 +0100

Introduction

The amount of football data available is growing rapidly – with every passing week of the season more matches are played and even more data gets collected. This is great as it allows us to increase our understanding of the game but it also means we quickly end up with more information than could ever be analysed manually.

Instead, we can use techniques such as cluster analysis and principal component analysis (PCA) to critically analyse these large sets of football data to identify important patterns and relationships that can help explain a team’s performances.

Find out more by reading the full article on the Onside Analysis blog here.

Anouncement

Martin Eastwood — Wed, 19 Jun 2013 19:30:00 +0100

Introduction

You may have noticed that my blogging has slowed down over the past few weeks and the reason is that I have joined Onside Analysis as a computational statistician.

I am really excited about my new role as it means that I will be working on football analysis full time instead of trying to squeeze it in around my day job, family, sleep etc. I am not sure exactly what this means for my blog here though but the plan is that I will be contributing to the Onside Analysis blog so keep an eye out on that if your interested in what I have been writing about so far.

I’ll also still be around on Twitter so please keep in touch :)

Betting With The Eastwood Index And Kelly Criterion

Martin Eastwood — Thu, 23 May 2013 19:30:00 +0100

Introduction

I demonstrated in my last post that the odds calculated using the Eastwood Index were slightly more accurate than the bookmakers over the course of the football season. My next goal is to work out the optimal way of using this edge to make a profit, starting off with the Kelly Criterion.

The Kelly Criterion

The first point of call for any staking plan is the Kelly Criterion, a method developed by John Larry Kelly Jr to determine the optimal bet size based on how far the odds are perceived to be in your favour.

The equation used to calculate the Kelly Criterion is shown in Figure 1 where $p$ is your expected probability of winning, $b$ is the odds offered and $f$ is the Kelly Criterion or recommended percentage of your bankroll to bet.

$f=(pb-1)/(b-1)$

Figure 1: Kelly Criterion

Let’s run through a quick example using Fulham’s last match of the season against Swansea City. Bet365 offered Fulham to win at odds of 4.75, which is equivalent to an expected win probability of around 21%, while let’s say you think the probability of Fulham winning is actually closer to 24%.

$f = (0.24 * 4.75 – 1) / (4.75 – 1)$

$f = 0.14 / 3.75$

$f = 0.0373$

$f = 3.73%$

So according to the Kelly Criterion we should be willing to risk 3.73% of our bankroll on this bet.

Applying the Kelly Criterion to the Eastwood Index

So what is the best way of applying the Kelly Criterion to the Eastwood Index? There are numerous different strategies that could be used but to start off with I’ve gone purely with value bets.

For each match I calculated the Kelly Criterion based on the Home, Draw and Away odds from Bet365 and looked for the outcome where the recommended bet was the largest. The reason for this was that the larger the recommended bet then the greater the difference between my probabilities and the bookmaker’s odds so the greater the potential value of the bet.

Figure Two shows the results over the course of the season. Starting off with a bankroll of £100 there was a slight loss over the first half of the season followed by pretty steady growth to finally finish with £114 in the bank. This gave a return on investment (ROI) of 14% for the Eastwood Index based on the 2012–2013 premier League season.

Figure 2: Bankroll over 2012-2013 Season

Fractional Kelly

Being relatively risk averse, I’ve found that using a fractional Kelly Criterion is preferable for me. Although using the full Kelly Criterion is optimal for maximizing growth of the bankroll long term, there is more risk of being caught out by variance and an unlucky streak wiping out your bank balance.

Betting a fractional value, such as half the recommended amount slows down growth but helps protects you from volatility. As a comparison, take a look at Figure Three where I bet a full Kelly Criterion on each match and you can see that at its peak the ROI reaches 73%, yet at the end of the season variance has pulled the bankroll down below its starting value causing a loss overall.

Figure 3: Volatility Betting The Full Kelly Criterion

Early Season Dip

One intriguing aspect of Figure Two is why the bankroll grew so much more in the second half of the season compared with the first? Partly this may be due to random variance but another possibility is betting on recently promoted teams.

At the moment promoted teams take over the EI ratings of the equivalent relegated team so the team promoted as champions take the EI rating of the team relegated third from bottom, the team promoted second take over the rating of the team relegated second from bottom and the team promoted in the playoffs gets the ratings of the team finishing bottom.

These ratings will not be exactly correct for the promoted teams but should over time move towards the right levels to reflect the team’s performances. Looking through the history of the bets made though those involving a promoted team during the first half of the season lost money overall while those not involving promoted teams made a profit.

By the end of the season I had made a profit out of the promoted teams, suggesting that the team’s ratings had sufficiently corrected themselves. This means though that I could improve the ROI even further by avoiding bets on the promoted teams early on in the season or improving the way the promoted teams ratings are handled, such as correcting their EI ratings faster.

Conclusions

This is only a quick look at a very simple strategy for placing bets using the Eastwood Index; there are still plenty of improvements that can be made to improve results further. Yet even with this relatively naive approach the ROI is 14%, which is much more than I would have made sticking my money into a bank account.

Applying the Eastwood Index to betting is also a great ways to identify the strengths and weaknesses of the model as the ROI gives a clear indicator of what works, what doesn’t work and what can be optimized further.

Addendum

I was asked on Twitter what the ROI works out at per bet for the Eastwood Index without betting the Kelly Criterion. Using a fixed bet of £1 per bet gave an overall profit for the season of £17 over 380 matches, which works out at an ROI of around 4.5% per bet.

Comments

James - June 1, 2013

Hey Martin.

I used your EI for betting purposes for the last few weeks of the season and obtained 21% ROI, so i must have been lucky in picking a good week to start. Have you any plans to work out the ROI for specific situations? For example just looking at when the ‘value’ selection was a home win, draw, etc. so that you can see where your model is better than the bookies and where it isn’t.

Or, do a graph with each teams ROI to see if anything sticks out.

Thanks.

Martin Eastwood - June 1, 2013

Excellent work Jamie, always good to see someone take money off the bookies :)

Good idea, I’ll try and take a look at those over the summer!

Amir - June 4, 2013

Just be careful with jumping into conclusions over a small sample…

Martin Eastwood - June 4, 2013

Agreed, one year’s worth of data is certainly not definitive for football.

Did The Eastwood Index Beat The Bookmakers?

Martin Eastwood — Tue, 21 May 2013 19:30:00 +0100

Introduction

It’s the end of the season so it’s time to review how the Eastwood Index performed over the year and how it compared with the bookmakers.

Ranked Probability Scores

One of the most important aspects to me is how accurate the forecasts were and I’ve assessed this using Ranked Probability Scores, as recommended by Constantinou and Fenton. I’ve discussed Ranked Probability Scores on the blog before but for people new to them they measure the difference between the forecasts and what really happened. Scores range between 0–1 and represent the amount of error in the predictions so lower Ranked Probability Scores are better and signify greater accuracy.

Comparison With Bookmakers

Looking at Figure 1 you can see that the Eastwood Index has consistently outperformed the bookmakers all season – and this isn’t just one bookmaker that the Eastwood Index has beaten but the combined knowledge of the industry as I’ve aggregated multiple bookmakers’ odds together and stripped out the overround to make the comparison as tough as possible.

Figure 1: Eastwood Index Vs Aggregated Bookmakers

Interestingly, the difference in accuracy seems to be greatest as both ends of the season. I expected the start of the season to be difficult to forecast as new teams have been promoted, players have been bought and sold, and managers may have changed clubs but the Eastwood Index seems to have coped with these variables better than the bookmakers’ odds have.

Over the course of the season the bookmakers’ forecasts improved until there was very little difference between them and the Eastwood Index but I was somewhat surprised to see how far out their accuracy drifted over the final few weeks of the season.

In theory these should be the easiest matches to forecast as we have the most information but in reality they can be tricky as team’s motivations change. For example, Manchester United have been playing their reserve goalkeeper so he gets enough appearances to earn his winners medal while Swansea’s players may as well have been on holiday since they won the league cup.

These changes seem to have thrown the bookmakers’ odds out quite noticeably while the Eastwood Index’s accuracy has remained constant. In fact, it suggests that bookmakers may be over-compensating for these apparent end-of-season effects as the Eastwood Index does not currently take them into account and has not struggled because of it.

Conclusions

Overall, I am pleased with the Eastwood Index’s debut season. I was slightly reticent to publish the forecasts at first in case the model did not hold up but it has remained accurate throughout the year. The next stage of its development is to identify any patterns as to where its forecasts differ from the bookmakers and how that could be combined with various staking strategies as well as looking at expanding to cover other leagues too.

Comments

George - May 21, 2013

Cool outcome. From my cursory looks into this, it is definitely possible, its just a case of trying to maximise any returns by only taking picks of a certain ratio etc. I’ve looked at other sports as well and sometimes find you can get a good number on a particular team for a period of weeks etc. Out of interest how are you translating %’s into home, away or draw (or are you assuming where the EI percentage is greater than the bookmakers % you are taking that result)? Things to look at going forward could be things like home field advantage (e.g. like the Clarke, Norman paper from 1995) or the effect of travel or fatigue. Good luck.

Martin Eastwood - May 21, 2013

Thanks George, that’s the tricky bit – how to convert from my %’s to winning staking strategy. My plan for the summer is to work on that further as I’ve got some ideas ready to test out for next season :)

Alex - May 21, 2013

Okay – perhaps this is naive, but if you’ve demonstrated that your index has outperformed the bookies all season, why can’t you invest, say, $100/matchday and split the bets according to the outcomes predicted by the EI over the course of the season? You should win the majority of the bets and show some return for your money, no?

Martin Eastwood - May 22, 2013

Yes and I would make a profit over the season doing that. What I want to work out though is what would be the best way of splitting that $100/matchday between the matches to maximize the profit made, e.g. something like the Kelly Criterion.

Alex - May 23, 2013

Ahh, I see. This is extremely interesting stuff. I’ve been tinkering around with a Monte Carlo approach in Matlab (using data from Football Manager 2013) for predicting game results, but I haven’t had great – or even remotely convincing – success. Your stuff here is brilliant.

Martin Eastwood - May 23, 2013 Thanks Alex :)

Is the data easy to extract from Football Manager 13? I’ve never touched it since the all Championship Manager days because it is too addictive but it could be an interesting data source…

Alex - June 6, 2013

Yep! There’s a few editors available (just google for ‘em), it becomes pretty straightforward to pull data.

Man, if you can show that you can outperform the bookies by even 4%, I think this would make a great alternative to savings accounts. Are you going to be publishing predictions for matchdays over the next season? I think putting in, say, $20 a matchday might be fun.

George - May 23, 2013

Don’t know which source you use for data but I use football-data.co.uk, which is great for any of the European Leagues (including the Premiership). What’s really nice for doing this kind of thing, is they usually put stuff in an xls or csv, so you can just do your thing straight away. They also usually have a range of bookmaker odds (as I am expecting that certain bookmakers deal with “sharp”er customers than others (if you know what I mean) so I expect that you could probably find a bookmaker that the EI is more prone to exploiting (or perhaps that the EI for a particular team is more prone to exploring).

Martin Eastwood - May 23, 2013

Something else to add to my to-do list, which is the best bookmaker for using the EI with!

amir - May 23, 2013

Please read this article for optimal allocation of bets: http://www.academia.edu/1027427/Algorithms_for_optimal_allocation_of_bets_on_many_simultaneous_events

amir - May 23, 2013

I see you already wrote another article stating Kelly Criterion…

Martin Eastwood - May 23, 2013

Thanks Amir that link looks really interesting!

Lars - October 18, 2013

If I see this correctly, you introduced the EI in February 2013.

Now you are comparing its predictions with the whole of the 2012/13 season.

That is not really “beating” the bookmakers. To beat them, you need to make your predictions BEFORE the matches not after them.

If you beat them again next year without changing your methods I’ll give you all the credit you deserve.

Martin Eastwood - October 18, 2013

Lars – I actually went back and recreated predictions for every match of the season based on just the data that would have been available at the point in time when the match was originally played.

Nic - November 24, 2013

Thanks for the good read. It’s really interesting.

When you compare against the bookmakers are you taking their opening odds? Because after the odds are published the fluctuations are due to general populace betting patterns.

This could account for your sudden strong finish vis-a-vis the bookmakers as the public was betting more against academic results etc which then the bookmakers have to tweak the odds for.

Another question. What’s the average spread a bookmaker keeps. To beat them do you mean you overcome the spread as well?

Martin Eastwood - November 25, 2013

Thanks Nic,

The odds were taken from the football-data website. There were then normalised to account for the overround and aggregated IIRC. Not sure how much the average bookmaker keeps as the overound, charges etc vary so much between traditional highstreet bookmakers and online betting.

Cheers,

Martin

Patrick - December 10, 2013

You state that you stripped out the bookmakers over-round in your comparison. Should you not be comparing your predictions against the bookmakers prices WITH over-round included?

Otherwise is it not a little bit pointless (in betting terms) as you won’t be able to take prices with the over-round removed? ie the edge you’ve found only exists when betting against 100% books…

Cheers

Martin Eastwood - December 13, 2013

Yes, I agree. I was undecided as to the best way of looking at it so ended up removing the over round as gave me the worst results so I went with the worst case scenario. If I leave the over round in which is probably more realistic then I actually got even better results

EI Match Probabilities for the English Premier League

Martin Eastwood — Fri, 17 May 2013 19:30:00 +0100

We have finally reached the end of the season so for the last time in 2012-2013 here are the Eastwood Index’s (EI) probabilities for the English Premier League.

Once the season is over and done with I’ll be looking back at how the EI has performed and how well it’s predictions compare with the bookmakers so look out for that next week!

Home Team	Away Team	Home (%)	Draw (%)	Away (%)
Chelsea	Everton	52	28	20
Liverpool	QPR	71	18	11
Man City	Norwich	77	14	9
Newcastle	Arsenal	23	32	45
Southampton	Stoke	42	31	27
Swansea	Fulham	46	30	24
Tottenham	Sunderland	67	21	13
West Brom	Man United	13	30	57
West Ham	Reading	49	29	23
Wigan	Aston Villa	43	31	27

Comments

George - May 17, 2013

Cool website. I really enjoy making projections and football seems to be the more interesting sport to do this because of the variety of approaches (given the low scoring nature of it which makes it less predictable). I go the least squares way and assume error is normally distributed around predictions from a rating system (and its the only one I could do easily in Excel given my limited maths). I think the guys at DTech go the ordered probit/logistical regression route (which seems to be the way to go I think) and I haven’t figured out your method yet but it looks interesting. They all seem to have similar numbers though:

e.g.: for Chelsea/Everton (Home/Draw/Away)

Your way: 52/28/20

DTech: 55/23/22

Least Squares: 59/22/18

All of which are around the bookmaker odds for that game 60/24/19 (random bookmaker picked)

Out of interest, have you noticed any kind of preferred ratio of your projections to bookmakers odds along the lines of the Dixon/Coles paper into this kind of thing?

Keep up the good work.

Martin Eastwood - May 18, 2013

Thanks George! Once the season finishes I’m planning looking back at how I’ve done over the year compared with the bookmakers to see if there are any patterns to my projections versus theirs.

George - May 18, 2013

Thanks. Re: the bookmakers that’s what I’ve found, just because you can generate a number similar to theirs – what does it actually tell you (as we don’t know what their perspective is)? What biases is it accounting for? Don’t know if you saw the Steven Levitt paper in 2004 on this kind of thing (was on the NFL though), and the various papers done on football (e.g. Graham and Stott from DTech in 2008) and bookmaker efficiency. I know DTech have worked out they would make something like 10% when their number differed from bookmaker numbers over however many years they have been doing it. The Dixon/Coles paper also found that when the ratio of their probability against the bookmakers probability was about 1.2 it was optimum for generating profit. What this tells us though – I don’t know.

Good luck with everything and I look forward to reading it.

Martin Eastwood - May 21, 2013

Thanks George, I’ve not seen the Levitt paper before.

Jimmy - August 23, 2013

Very interesting, been trying myself to see how accurate the bookies have been with their odds. Scraping my data off Oddsportal, using a Java app to analyse it. And then generatinig my own rating based on form etc Only point of note I have noticed so far is that home teams with odds between 1.999 and 2.5 are the best to bet on. Consistently generating a % profit across all leagues. I assume these are games where bookies are loosest with their odds.

Martin Eastwood - August 23, 2013

yes the bookies are pretty good but they are certainly not perfect and there are situations where they can be beaten – the tricky bit is being able to do it consistently over a long period of time :)

Jimmy - August 23, 2013

I’d like to see you try your system against the Iranian pro league. Very obscure league to be betting on I know but I have found my own rating system giving me accuracy of 70% with average odds of over 2.2. I was raking it in last season. Not so much this time round as the bookies seem to have tightened their odds on that particular pony.

Martin Eastwood - August 23, 2013

cool, well done it’s always great to hear about people taking money off the bookies :)

I haven’t tried my system outside of Europe or MLS yet, now you have got me really interested in trying it out on more leagues!

EI Match Probabilities for the English Premier League

Martin Eastwood — Fri, 10 May 2013 19:30:00 +0100

Here are the latest match probabilities for the English Premier League calculated using the Eastwood Index (EI).

Somewhat surprisingly, Liverpool are only just favorites to beat Fulham with the odds so close that a draw would seem the likely outcome.

Down at the bottom of the table Newcastle versus QPR and Norwich versus West Brom look likely to finish tied, while Sunderland are slight favorites against Southampton meaning Wigan desperately need to take points off Arsenal to stand any chance of avoiding relegation.

Home Team	Away Team	Home (%)	Draw (%)	Away (%)
Aston Villa	Chelsea	21	32	47
Stoke	Tottenham	24	32	44
Everton	West Ham	67	20	12
Fulham	Liverpool	31	33	36
Norwich	West Brom	37	32	31
QPR	Newcastle	32	32	35
Sunderland	Southampton	49	29	22
Man United	Swansea	76	15	9
Arsenal	Wigan	71	18	11
Reading	Man City	8	28	64

Comments

Acka Bilk - May 10, 2013

Interesting to see that you make Sunderland such strong favourites against Southampton. On what basis can that be? Are you just looking purely at very recent results and the league table?

Martin Eastwood - May 10, 2013

It is based on results over the past few seasons, with greater weighting applied the more recent the result is. The two team’s are rated reasonably similar at the moment so part of Sunderland’s advantage is likely due to playing at home.

Saze - May 14, 2013

If the elo ratings are based on results over the past few seasons then how do they apply to newly promoted teams that have never played in the premier league before? Also when you say “over the past few seasons” how many seasons are you exactly talking about.

And have you ever considered crafting elo ratings for individual players and then using this to create an elo rating for a whole team as team lineups do usually vary from match to match and can sometimes significantly affect the outcome of a match. The downside to this however is that team lineups are only announced 30-45 minutes before the match starts so you won’t really be able to make a prediction until the match has practically started although it can still be useful to look at which players are performing at the very top level.

Martin Eastwood - May 14, 2013

My EI ratings are currently based on three previous season’s data plus this season. Over the summer I hope to look at whether extending this offers any advantages.

Newly promoted teams are tricky. For the EPL teams are assigned the relevant relegated team’s rating so the team promoted in first place get the team relegated third from bottom’s rating. It’s not perfect but over time the rating will correct itself and move towards the correct value. For the MLS there is no relegation to worry about so I can avoid this.

I am interested in trying to model teams based on their players but this still requires more research into what stats to use per player – passes, tackles, shots, etc etc??

Saze - May 14, 2013

I understand that a promoted teams elo rating will correct itself over time but wouldn’t just creating elo ratings for the Championship, league one and league 2 overcome this issue.

Moreover if you are interested in individual player elo ratings then castrol rankings is a good player ranking site; “http://www.castrolfootball.com/rankings/rankings/?team=&comp=&nation=&position=&search=&offset=0&jump=1 ” You can find out more on how they create the rankings in their FAQ section.

Martin Eastwood - May 14, 2013

I’m not sure it would really work. For example Cardiff’s EI would be based on how they have performed against Championship quality teams so would not reflect how they would perform against Premier League teams.

Goalimpact - May 15, 2013

‘Individual player ELOs’

This is what I do. It works quite well, however it is far from being easily implemented. If you are interested please drop by my blog Goalimpact.com

Saze - May 16, 2013

Very interesting. I like the “top-down” approach that you have taken.

MLS Player Salaries: 2013

Martin Eastwood — Fri, 10 May 2013 19:30:00 +0100

Introduction

The latest Major League Soccer (MLS) salaries were released recently by the MLS Players’ Union so I thought I would post a quick summary of the data.

Average Salary By Team

The first thing I was interested in was average salaries per team and whether there had been any changes compared with previous seasons (Figure 1).

The trend over the past few years has been pretty constant, with LA Galaxy and New York Red Bulls having the highest outgoings on wages, which again continues for 2013.

Toronto have typically followed in a distant third place but this season sees them overtaken by Seattle following the addition of Obafemi Martins to their roster.

click the legend headers to show / hide each season’s data and hover the data points for more information

Number of Players

Next I looked at the number of players currently playing in each position. The results are pretty much the same as 2012, with a marginal gain in the number of forwards and defenders registered for this season (Figure 2).

click the legend headers to show / hide each season’s data and hover the data points for more information

Average Salary By Position

Next I looked at average salary by position and it is probably no surprise that forwards receive the most remuneration (Figure 3). In fact, the higher up the field you are, the more money you earn, with goalkeepers earning the least followed by defenders, midfielders, attacking midfielders and then forwards.

The only player outside of this trend are defensive midfielders who earn even less than goalkeepers. In terms of salary this appears to be the least appreciated position by quite a large margin. If you are out to make money then you are much better off specializing as either a clear-cut defender or midfielder rather than something perhaps between the two. Or even better, learn to score goals…

click the legend headers to show / hide each season’s data and hover the data points for more information

The Big Earners

Although the average salaries show that forwards earn noticeably more than any other position, the actual value is skewed by a few high-profile players earning big bucks. Table 1 shows the top ten earners in the MLS compared with the overall league average. Of the ten players, eight are forwards and two are midfielders. The highest paid defender is Toronto’s Darren O’Dea, ranked 18th overall and the highest paid goalkeeper is Portland’s Donovan Ricketts, ranked just 41st overall.

Club	Last Name	First Name	Pos	Base Salary	Compensation
NY	Henry	Thierry	F	$3,750,000	$4,350,000
LA	Keane	Robbie	F	$4,000,000	$4,333,333
NY	Cahill	Tim	M	$3,500,000	$3,625,000
LA	Donovan	Landon	F	$2,500,000	$2,500,000
MTL	Di Vaio	Marco	F	$1,000,008	$1,937,508
SEA	Martins	Obafemi	F	$1,600,000	$1,725,000
TOR	Koevermans	Danny	F	$1,250,000	$1,663,323
VAN	Miller	Kenny	F	$1,114,992	$1,132,492
SEA	Montero	Fredy	F	$700,000	$856,000
DAL	Ferreira	David	M-F	$625,000	$730,000
			League Average	$141,903	$159,849

Since these star players are skewing the averages, we can analyse the median salary instead using box and whiskers plots (Figure 4). These show the distribution of the different salaries for each position where the thick line across the center of the box is the median salary, the top and bottom of the box are the 75th and 25th percentiles and the whiskers represent 1.5x the interquartile range. Outliers outside of this range are then plotted as dots.

Looking at the median salaries there is actually very little difference between the outfield players. The average MLS player’s salary is also clearly nothing like the league’s star players, in fact if we remove the top twenty earners then the overall league average falls from $159,849 to $113,516 with a median of $83,000 and a mode of $46,500, which is the league minimum for first teamers (roster positions 1-24).

Conclusions

This is only a quick overview of the data and there is still a lot more to explore so feel free to get in touch if there is anything in particular you want to have a look at.

Comments

Dennis - May 10, 2013

Could you just add in the position-specific medians and averages, as well as both of those without the top-payed players? The last chart does a lot of work but is a little short on the details.

Also, it looks like the median salary for defenders is higher than midfielders. Why is that?

Thanks!

Martin Eastwood - May 10, 2013

Good idea, I’ll take a look!

The Eastwood Index, MLS and Parity

Martin Eastwood — Tue, 07 May 2013 19:30:00 +0100

Introduction

I showed in my last post how Major League Soccer (MLS) is a much more closely matched league than the English Premier League (EPL), with the wage cap and draft system increasing the parity between teams.

The Eastwood Index

This high level of parity can also be seen using the Eastwood Index (EI), a rating system designed to calculate odds of match outcomes when different teams play each other.

The Eastwood Index rates teams so that the average rating is 2000 and the higher the rating the better a team is compared with the rest of the league.

EI ratings increase when teams win matches or draw against superior opposition and decrease when teams lose matches or draw against weaker opposition. The size of the gain or loss in ratings is linked to the quality of the opposition so that beating a superior team is worth more than winning against a lower ranked team.

The change in EI rating is also weighted by the goal difference in the match so that the greater the difference in goals scored or conceded then the greater the change in ratings. Home advantage is also included in the calculations so that the home team is expected to perform better when playing at home compared with away.

Major League Soccer EI Ratings

Currently, the highest rated team in MLS is LA Galaxy, with an EI of 2506 (Table 1) while the lowest is Toronto FC, with an EI of just 1303 (Table 2). Outside of this, the majority of teams are fairly evenly matched in MLS and are rated around 1880 – 2300 demonstrating the parity in the league.

Position	Club	EI Rating
1	New York Red Bulls	2225
2	Sporting Kansas City	2374
3	Houston Dynamo	2210
4	Montreal Impact	2052
5	Columbus Crew	2082
6	Philadelphia Union	1809
7	New England Revolution	1610
8	Chicago Fire	2063
9	Toronto FC	1303

Table 1: MLS Eastern Conference EI Ratings

Position	Club	EI Rating
1	FC Dallas	2150
2	LA Galaxy	2506
3	Real Salt Lake	2271
4	Portland Timbers	1804
5	Colorado Rapids	1871
6	Chivas USA	1433
7	San Jose Earthquakes	2267
8	Vancouver Whitecaps	1690
9	Seattle Sounders FC	2392

Table 2: MLS Western Conference EI Ratings

It is still a bit early in the season to draw too many conclusions but if we combine the two MLS conferences together then Columbus currently come out as mid-table, with an EI rating of 2082. This is just 424 lower than the top team (LA Galaxy) and 569 higher than the bottom of the league (Toronto FC), and close to theoretical league average EI of 2000.

MLS Compared with EPL

Compare this with the EPL (Table 3) and you can see an immediate difference in the level of parity. Taking the average of West Ham and Stoke to be the middle of the table then a mid placed EPL team’s EI is below the theoretical average at 1634, just 264 better than QPR at the bottom of the table and a gigantic 1431 away from Manchester United. The top of the EPL has been very much a league-within-a-league for a while now, with average teams vastly more likely to be relegated than they are of ever winning anything or even reaching the European qualification spots.

Position	Club	EI Rating
1	Manchester United	3064
2	Manchester City	2909
3	Chelsea	2598
4	Arsenal	2627
5	Tottenham Hotspur	2514
6	Everton	2351
7	Liverpool	2291
8	West Bromwich Albion	1883
9	Swansea City	1797
10	West Ham United	1520
11	Stoke City	1747
12	Fulham	1806
13	Aston Villa	1704
14	Southampton	1611
15	Sunderland	1741
16	Norwich City	1607
17	Newcastle United	1814
18	Wigan Athletic	1650
19	Reading	1397
20	Queens Park Rangers	1369

Table 3: EPL EI Ratings

Parity

Compared with the EPL, MLS is a very evenly matched league where the margins between the top and bottom of the conferences are small, making it a really exciting league to follow as virtually any team is in with a chance of reaching the playoffs at the start of the season.

Comments

EI Match Probabilities for the English Premier League

Martin Eastwood — Fri, 03 May 2013 19:30:00 +0100

Here are the latest match probabilities for the English Premier League calculated using the Eastwood Index (EI).

Please note that I have not included next week’s mid-week matches yet as the odds for those will change depending on how this weekend’s matches finish. I’ll try and add those on Tuesday once I have all the data available.

Edit – Table 1 is now updated with the odds for this week’s mid week matches.

Home Team	Away Team	Home (%)	Draw (%)	Away (%)
Fulham	Reading	59	25	16
Norwich	Aston Villa	44	30	26
Swansea	Man City	15	31	54
Tottenham	Southampton	68	20	12
West Brom	Wigan	54	27	19
West Ham	Newcastle	37	32	32
QPR	Arsenal	13	30	57
Liverpool	Everton	44	30	26
Man United	Chelsea	60	24	16
Sunderland	Stoke	45	30	25
Man City	West Brom	72	18	10
Wigan	Swansea	41	31	28
Chelsea	Tottenham	47	30	23

Comments

How Much Does Luck Affect MLS?

Martin Eastwood — Thu, 02 May 2013 19:30:00 +0100

Introduction

Following my recent article for Betting Expert quantifying how large a role luck plays in the English Premier League (EPL) I thought it would be interesting to look at Major League Soccer (MLS) too.

MLS Structure

MLS is structured differently to the EPL as it has followed other North American sports in implementing wage caps and player drafts. Unlike the current salary free-for-all in the EPL, MLS clubs are currently limited to spending a maximum of $2.95 million in wages over their first 20 roster spots, with up to three additional designated players paid (partially) outside of this salary cap.

MLS also has a draft system that takes place each January during which teams can sign players graduating from college or otherwise signed by the league. The draft is split into three rounds and is designed to give priority to the league’s weaker teams allowing them first choice of players ahead of the more successful teams.

MLS is also a shorter season than the EPL with teams playing just 34 matches compared with the EPL’s 38. This is important as the more matches that are played then the more opportunity talent has to overcome luck.

Overall, this all works towards increasing parity in the MLS and making it a more evenly balanced league, which in turn should enhance the role luck plays.

Results

Using my adaptation of Tom ‘Tango’ Tiger’s baseball equation I calculated the average win rate in MLS going back to 2004 and the variance of the win rate. I them calculated the variance expected due to luck and subtracted one from the other to get the amount of variance attributed to talent.

Luck accounts for around 35% of a team’s win rate in the EPL and I was expecting MLS to be higher, but it initially came out at a staggering 82% for MLS. Instinctively this seems too high and I suspect it is inaccurate due to the changes in MLS’s structure over the years. For example, back in 2004 there were only ten teams and one conference while there are currently 19 teams and two conferences. There have also been changes to the level of the salary cap and the number of designated players allowed over this time period too.

So I went back and reprocessed the results using just the 2010–2012 data. Although this reduces the sample size considerably it leaves us with data more representative of the current state of MLS. And the results this time? Luck accounted for around 57% of a team’s win percentage compared with just 43% for talent.

So compared with the EPL, the structure of MLS does appear to increase parity and enhance the influence luck has in deciding the league champions. In fact, being lucky is probably the more important of the two, although luck on its own is not enough – you need to be a talented team with luck on its side to win the MLS Cup.

Comments

How Much Does Luck Affect Football?

Martin Eastwood — Tue, 30 Apr 2013 19:30:00 +0100

Introduction

I’ve written a new article for Betting Expert quantifying how much luck affects football. Take a look here as it is probably more than you are expecting!

Comments

What Is A Meaningful Sample Size?

Martin Eastwood — Sun, 28 Apr 2013 19:30:00 +0100

Introduction

I had an article published at Betting Expert last week looking at how to determine statistically how much data you need to make accurate predictions.

Find out more by reading the rest of this article here.

Comments

EI Match Probabilities for the English Premier League

Martin Eastwood — Fri, 26 Apr 2013 19:30:00 +0100

It’s Friday again so here are this weekend’s match probabilities for the English Premier League.

I was a little surprised to see Manchester United come out as favourites against Arsenal, even they they are away from home. but the odds are so close though that it looks like a potential draw. However, it all depends on Sir Alex Ferguson’s squad selection, with the league won will he rest the bigger stars and let some of the second-string players reach enough appearances to be eligible for a winners medal? Anders Lindegaard, for example, still needs to play another two matches this season to claim his medal.

Other possible draws include Southampton Vs West Brom and Newcastle Vs Liverpool, while you’d hope Reading Vs QPR will not be a draw as a single point is useless for either team.

Home Team	Away Team	Home (%)	Draw (%)	Away (%)
Man City	West Ham	78	13	9
Everton	Fulham	58	25	17
Southampton	West Brom	39	31	29
Stoke	Norwich	47	29	24
Wigan	Tottenham	20	32	48
Newcastle	Liverpool	35	32	33
Reading	QPR	44	30	26
Chelsea	Swansea	65	22	13
Arsenal	Man United	31	32	37
Aston Villa	Sunderland	40	31	29

Comments

EI Match Probabilities for the English Premier League

Martin Eastwood — Fri, 19 Apr 2013 19:30:00 +0100

It’s been a busy day but I’ve finally got the probabilities for this weekend’s matches completed.

There are some pretty close games, with QPR Vs Stoke, Tottenham Vs Man City and Liverpool Vs Chelsea all looking like potential draws. Plus you could maybe throw Sunderland Vs Everton and even West Ham Vs Wigan into that group too.

The only clear favourties are Manchester United and Norwich so it’s going to be a tricky week to call.

Home Team	Away Team	Home(%)	Draw (%)	Away (%)
Fulham	Arsenal	26	32	42
Norwich	Reading	53	27	20
QPR	Stoke	37	32	31
Sunderland	Everton	28	33	39
Swansea	Southampton	49	29	22
West Brom	Newcastle	45	30	25
West Ham	Wigan	41	31	28
Tottenham	Man City	31	33	36
Liverpool	Chelsea	36	32	32
Man United	Aston Villa	79	12	9

Comments

EI Match Predictions for the English Premier League

Martin Eastwood — Fri, 12 Apr 2013 19:30:00 +0100

Here we go with this week’s predictions from the Eastwood Index (EI)!

The EI doesn’t hold out much chance for Wigan or West Ham getting three points off the two Manchester clubs this week, with the lowest odds I think the EI has ever produced. Interestingly, both of Fulham’s matches look like possible draws.

Home Team	Away Team	Home (%)	Draw (%)	Away (%)
Arsenal	Norwich	70	19	11
Aston Villa	Fulham	35	32	33
Everton	QPR	70	18	11
Reading	Liverpool	22	32	46
Southampton	West Ham	50	29	21
Newcastle	Sunderland	48	29	23
Stoke	Man United	10	29	61
Arsenal	Everton	51	28	21
Man City	Wigan	77	14	9
West Ham	Man United	7	27	66
Fulham	Chelsea	32	32	36

Comments

Andrew Beasley - April 13, 2013

Hi Martin – is that Villa v Fulham prediction the closest you’ve had (as in a range of just three between largest and smallest)?

Ever had a 33/33/34 for instance?

Cheers.

Martin Eastwood - April 13, 2013

I think it is the closest yet. I wonder if equal odds of home and away win suggest a draw is likely?

EI Match Predictions for the English Premier League

Martin Eastwood — Fri, 05 Apr 2013 19:30:00 +0100

A couple of weeks back I demonstrated how the EI is more accurate than the bookies based on rank probability scores but a few people have asked if I can do something a bit simpler so Figure 1 shows how often the EI picked the winner as being the favourite compared with aggregated bookmaker’s odds. It’s pretty close but the EI seems to have a small but reasonably constant margin over the bookmaker so far this season.

Figure 1: EI Versus Bookmakers

Last week turned out to be a pretty good week with the EI managing to correctly predict the winner in eight out of the ten matches played. I’d made a few minor tweaks before posting last week’s odds to try and enhance the the way draws and away wins are calculated so hopefully the EI will be able to maintain its edge over the bookmakers.

Anyway, here are this week’s odds:

Home Team	Away Team	Home (%)	Draw (%)	Away (%)
Reading	Southampton	38	32	30
Norwich	Swansea	41	31	28
Stoke	Aston Villa	50	29	21
West Brom	Arsenal	26	32	42
Liverpool	West Ham	66	22	12
Tottenham	Everton	49	29	22
Chelsea	Sunderland	66	21	13
Newcastle	Fulham	44	30	26
QPR	Wigan	39	31	30
Man United	Man City	51	28	21

Comments

Dan - April 11, 2013

Hi Martin,

Enjoy reading your articles.

Just wondered if you considered your Liverpool-West Ham prediction as a success versus the bookmakers.

I ask as although you suggest Liverpool are the most likely winners (66%) the odds makers priced them up as 1/3 (75%) chances. So in that case would you say your data would trigger a lay of Liverpool at 1/3 rather than a back?

Cheers,

Dan.

Martin Eastwood - April 11, 2013

Hi Dan

That’s an interesting point. I am still working on what is the best strategy to use with the model. Ideally I will be able to identify a pattern between where my odds diverge from the bookmakers and from there work out where the real value lies.

Liverpool’s odds from the bookmakers are always a little strange though as they supposedly get bet on very heavily from Asia, which leads to the odds moving towards Liverpool so I suspect they were rated more likely to win than they actually were by the bookmakers.

Cheers,

Martin

Understanding Total Shot Ratio in Football

Martin Eastwood — Tue, 02 Apr 2013 19:30:00 +0100

Introduction

The use of Total Shot Ratio (or TSR) seems to have slowly been gaining ground so I thought it would be worth analyzing the statistic in more detail to see what it can and cannot do.

What is Total Shot Ratio?

Put simply Total Shot Ratio is the proportion of shots taken by one team compared with another. It can be calculated by dividing the number of shots taken by a team by the total shots overall (Figure 1).

$TSR=ShotsFor/(ShotsFor+ShotsAway)$

Figure 1: Total Shot Ratio

It is often used as a surrogate for dominance as the presumption is that the team taking the majority of the shots will be controlling the match and possibly limiting the opposition’s ability to shoot at goal.

Total Shot Ratio Data

Using data taken from the football-data.co.uk website I calculated the Total Shot Ratios for all matches from the English Premier League going back to the 2001-2002 season, giving a total of 8360 data points, which are normally distributed (Figure 2).

Figure 2: Distribution of Total Shot Ratios

The average Total Shot Ratio is always 0.5, because for every value above 0.5 you always an equivalent value below it for the opposition. For example, if the home team’s Total Shot Ratio is 0.75 then the away team’s ratio must be 0.25.

$(0.75 + 0.25) / 2 = 0.5$

The standard deviation, which is a measure of the dispersion of the data around the average value, was 0.166.

Total Shot Ratio: Correlation With Goals Scored

Since Total Shot Ratios are being used to show dominance in a match it makes sense to assess the correlation with the number of goals scored. The higher a team’s Total Shot Ratio is then the more shots it is having compared with the opposition so the expectation would be that they would score more goals. However, this does not seem to be the case as the relationship between the two is extremely weak (Figure 3; r2=0.079).

Figure 3: Correlation Between Total Shot Ratio and Goals Scored

Total Shot Ratio: Correlation With Goal Difference

So how about the relationship between Total Shot Ratio and goal difference instead? Since teams with higher Total Shot Ratios are thought to be dominating matches, perhaps they are more likely to have a higher goal difference in the match as they may also be less likely to concede goals? Again though the correlation is weak (Figure 4, r2=0.11).

Figure 4: Correlation Between Total Shot Ratio and Goal Difference

Total Shot Ratio: Correlation With Match Outcomes

Additionally, the relationship between Total Shot Ratio and match outcome is also poor (r2=0.066) suggesting that Total Shot Ratio also has very little influence on the likelihood of a team winning a particular match. Just because you are taking a greater proportion of the shots does not mean you are any more likely to win.

Total Shot Ratio: Correlation With Points Per Season

Although the match-by-match correlations above are weak there is the suggestion of a trend so it may be that Total Shot Ratio is heavily luck driven in the short term and that we need more matches before we can see the overall effects of a higher ratio. For example, looking at the correlation between Total Shot Ratio and points over an entire season shows a pretty decent relationship between the two (Figure 5). This suggests that long term possessing a higher Total Shot Ratio is in fact associated with fewer matches being lost per season.

Figure 5: Correlation Between Total Shot Ratio and Points

Total Shot Ratio: How Much Data is Enough?

So if Total Shot Ratio is only becoming meaningful over longer periods of time then how much data do we actually need before it becomes a useful metric? To look at this I calculated the overall Total Shot Ratio per season by team and then randomized the order of each match that season. I then looked at how the deviation changed over course of a season compared with the overall ratio, e.g. after five matches, ten matches etc (Figure 6).

Figure 6: Deviation in Total Shot Ratio by Sample Size

As more data is used to calculate the Total Shot Ratio it moves closer towards its true value and the deviation decreases as the effect of any outlier matches becomes less influential. With fewer matches being used to calculate the Total Shot Ratio there is more dispersion and variability in the calculated value. Interestingly, there is still a reduction in the deviation moving from 30 matches to 38 matches, suggesting that we may need at least a full season’s worth of data to get an accurate measure of a team’s Total Shot Ratio.

Total Shot Ratio: Calculated Sample Size

Another option to find out how much data we need is to calculate the sample size required to identify specific differences in Total Shot Ratio. There are a number of different methods for this but the commonly used t-test sample size estimation suggests that to be 95% certain that two teams with a difference in Total Shot Ratio of 0.1 are actually different from each other takes 45 matches.

So, to be statistically certain that a team with a Total Shot Ratio of 0.6 actually has a higher ratio than a team with a Total Shot Ratio of 0.5 rather than it just being down to random variability requires over a season’s worth of matches to be played.

As the differences become smaller then the number of matches required increases even further – to identify a difference in Total Shot Ratio of 0.05 takes nearly five season’s worth of matches!

Conclusions

In the short term, Total Shot Ratio appears to virtually meaningless in terms of goals scored or match outcomes as its variability is so high.

Over the long term though, skill outweighs luck and Total Shot Ratio becomes increasingly correlated with outcomes. However, it may take a long time for this to occur and may be less accurate than other statistics available if you are interested in predicting performance.

Finally, this article is not intended to say “do not to use Total Shot Ratio” as it is still an interesting metric. Rather, make sure that you are aware of its abilities and limitations if you are planning on using it for analysis.

Comments

Bob - April 2, 2013

You’re getting closer to the holy grail. Quality of shot data is important. TSR can be refined into something slightly less random (though shots alone are still less random than goals).

Martin Eastwood - April 2, 2013

I totally agree, the quality of shots are important. You can improve your TSR by taking lots and lots of shots but they are not necessarily going to improve you chances of scoring. It is the quality of shots that are important not the amount of them.

sidereal - April 2, 2013

A good compromise might be looking at shots in the box rather than overall shots. It’s a little bit more recordkeeping (though Opta has them for leagues it tracks) without having to subjectively evaluate shot quality. And I suspect it’d correlate better over a shorter sample size. I can run the correlation between TBSR and results in MLS when I get some time.

Martin Eastwood - April 2, 2013

Sounds good, would be interesting to see how the correlation looks

sidereal - April 4, 2013

Had time to run this today. With two years of MLS data I found substantially lower correlations than your EPL data. Possibly because of the smaller sample size. More likely because MLS shot quality is lower and more random.

R squared for TSR to goals is 0.0245 and to GD is 0.05. Switching to box shots improves those marginally to 0.042 and 0.087.

But at the season level the improvement mostly goes away. TSR to seasonal PPG is 0.324 and TBSR to seasonal PPG is 0.349.

Martin Eastwood - April 4, 2013

Thanks, that is really interesting to see!

There is also much more parity in the MLS compared with other leagues too, which may be having an effect as the more closely matched the teams are then the more impact luck has on determining outcomes.

Bob - April 2, 2013

It isn’t actually that difficult to measure quality of shots, as long as you’re prepared to put in 10-15 minutes work per week (per league). For the top five leagues in Europe anyway, plus a few others (including the npower leagues from next season).

I agree shots alone has it’s limitations but over a 20-25 game sample, I do think TSR is an extremely valuable measure and one that has called a few regressions this season (Sunderland the most obvious) that most observers did not see coming.

shuddertothink - April 3, 2013

What would be the best way to quantify ‘shot quality’?

The best we have on ‘shot quality’ is Shots on target to points has an r2 of .0685 in 2012/13.

There may be issues with the sample size of just 600 data points in comparison to 7600 or so in Martin’s 10 year sample.

As was stated skill outweighs luck given a bigger sample

Turkish - November 29, 2013

Do you teach any courses at the moment on football based prediction?

I am sure a lot of people would be keen to see you present your information – would you be interested on doing that?

Martin Eastwood - December 4, 2013

Hi Matthew,

I don’t have any courses planned but it would be something really interesting to do if there was enough interest from people!

Cheers,

Martin

Dzof - June 14, 2014

Thanks for the article, very interesting.

Do you have the numerical values for figure 4 published anywhere (Correlation between shot ratio and goal difference)? Or raw data?

I basically was looking for the observed probability a team wins/draws/loses a game given a certain shot ratio, e.g. When a team has 60-70% shot ratio, what % of games do they win?

Extending this, the other question that comes to mind is if this probability is consistent across seasons / teams / leagues.

Keep up the good work!

Martin Eastwood - June 14, 2014

Not to hand but I’m planning an update to the site over the summer to provide access to that sort of thing so keep an eye out for that!

Michael - October 31, 2014

Thanks for the article. I was looking for statistics like that! You said there are other statistics for predicting performance which are more accurate than the TSR. Which ones were you talking about?

Martin Eastwood - October 31, 2014

Hi Michael, thanks for the message. Take a look at expected goals to start with :)

EI Match Predictions for the English Premier League

Martin Eastwood — Thu, 28 Mar 2013 19:30:00 +0000

After last week’s international matches, domestic football is finally back so here are this weekend’s match predictions using my EI predictive model. Let’s see if it can keep up its good form and continue to beat the bookmakers!

Home Team	Away Team	Home (%)	Draw (%)	Away (%)
Sunderland	Man Utd	10	29	61
Arsenal	Reading	74	16	10
Man City	Newcastle	70	18	11
Southampton	Chelsea	20	32	48
Swansea	Tottenham	27	32	40
West Ham	West Brom	30	32	37
Wigan	Norwich	43	30	26
Everton	Stoke City	60	24	16
Aston Villa	Liverpool	28	30	42
Fulham	QPR	59	25	16

Comments

How Accurate Are The EI Football Predictions?

Martin Eastwood — Thu, 21 Mar 2013 19:30:00 +0000

Introduction

Unfortunately time caught up with me last week and I was unable to post any predictions from my Eastwood Index. However, since then I have been busy validating the results to see how accurate the predictions really are using the 296 matches played in the English Premier League so far this season.

Ranked Probability Scores

I have previously discussed the problems of trying to determine the accuracy of probability-based models and Jonas posted a suggestion in the comments section recommending the use of ranked probability scores, which turned out to be a really interesting idea.

Ranked probability scores were originally proposed by Epstein back in 1969 as a way to compare probabilistic forecasts against categorical data. Their main advantage over other techniques is that as well as looking at accuracy, they also account for distance in the predictions e.g. how far out inaccurate predictions are from what actually happened.

They are also easy to calculate. The equation for ranked probability scores is shown in Figure 1 for those of a mathematical disposition, where $K$ is the number of possible outcomes, and $CDF_{fc}$ and $CDF_{obs}$ are the predictions and observations for prediction $k$.

Interpreting Ranked Probability Scores

Ranked probability scores range between 0–1 and are negatively orientated meaning that the lower the result the better. For simplicity, you can think of them representing the amount of error in the predictions where a score of zero means your predictions are perfect.

The Results

I started off looking at how well I would have done if I had just guessed at random for each match in the English Premier League this season rather than using the Eastwood Index and obtained a ranked probability score of 0.231.

Next, I looked at how well the bookmaker’s odds predicted matches. To do this I aggregated the odds from multiple bookmakers, partly to reduce the comparisons needed and partly because aggregating data often improve predictions and I wanted to give the Eastwood Index the toughest test possible. This gave a ranked probability score of 0.193 for the bookmakers.

Finally I calculated the score for the Eastwood Index and got…

drum roll please

a ranked probability score of 0.191. Okay, it is not much lower than the bookmakers but it does mean that so far this season the Eastwood Index has been more accurate at predicting football matches than the combined odds of the gaming industry which is really pleasing for me.

Conclusions

Most importantly though this suggests that the Eastwood Index works. I had originally set myself the target of being able to compete with the bookmakers as I consider them to be gold standard prediction for football. These are large companies employing professional odds compilers to generate their odds so for me to be able to beat them, even by a small amount, using a bunch of equations is a big success for the Eastwood Index.

It is still early days and it is still a relatively small number of predictions (n=296) so I will be continuing to monitor the results to check the accuracy doesn’t change over time. It is a fantastic start though and great inspiration to continue developing and improving the Eastwood Index further!

Comments

amir - March 23, 2013

Have you calculated the RPS for the same 296 matches for the bookmakers too? Otherwise, you put them at disadvantage to begin with, as early matches are harder to predict. Also, have you used the data from this year’s EPL to improve the EI. If you did, you probably did over-fitting…

Martin Eastwood - March 23, 2013

Hi Amir

EI and bookmaker’s RPS was tested using exactly same set of matches.

The EI was developed using historical data from the EPL rather than relying on this season’s data to prevent overfitting model.

Thanks for leaving your comment.

amir - March 23, 2013

The results are very impressive then.

Looking forward to read more details about the EI methodology!

Martin Eastwood - March 23, 2013

Thanks!

Lars - January 20, 2014

The Ranked Probability Scores method looks straightforward and really meaningful.

Thanks for bringing it up, I plan to use it myself, too.

Martin Eastwood - January 20, 2014

Yes, it’s a great way to measure accuracy. I’m using it more and more for assessing football models now.

Lars - January 28, 2014

I have worked my way through Epstein’s paper now, a few comments:

Contrary to what is written here, a high RPS is good, not bad. Even without understanding the equations, you can see in the table 2 on page 987 of the paper that the score is 1 when the prediction is correct and is <1 the worse the prediction is.

Secondly, I have tried to work out where you get the simplified equation from that you show above.

In my eyes, for football (3-way result) it should rather be:

RPS = S – 0.5 * (P_d+2*P_a) in case of home win

RPS = S – 0.5 * (P_h+P_a) in case of draw

RPS = S – 0.5 * (2*P_h+P_d) in case of away win

where:

S = 1.5 – 0.25*(P²_h +(P_h+P_d)²+ (P_d+P_a)²+P²_a )

and:

P_h is the probability for a home win, P_d for a draw and P_a for an away win.

Maybe this can be corrected above or let me know where I am wrong.

Martin Eastwood - January 29, 2014

Perhaps the Epstein paper doesn’t make it particularly clear but the RPS is the sum of the squared differences between the forecast and observation. Therefore the more accurate the forecast then the smaller the difference is and the lower the RPS.

If you are interested in digging deeper into RPS then this book has quite a good chapter on it IIRC – http://www.amazon.co.uk/Statistical-Atmospheric-Sciences-International-Geophysics/dp/0123850223/ref=tmm_hrd_title_0?ie=UTF8&qid=1390988812&sr=8-2-fkmr0

Lars - January 29, 2014

I disagree. The “sum of the squared differences between the forecast and observation” does not take into account the ranking characteristic. And your formula from above does not either.

There would really be no reason to call it RPS if it was just a sum of squared differences.

What Epstein does makes sense and is clearly different. And I did not find that the paper leaves open questions or was not particularly clear.

Martin Eastwood - January 29, 2014

Have you checked whether Epstein have an additional 1- term in his paper because RPS typically ranges from 0-1, with 0 considered the perfect score. Or perhaps you’re looking at Ranked Probability Skill Scores where higher values are better?

Lars - January 29, 2014

Whether 0 or 1 is defined as perfect is a minor issue. It is more where the “ranked” comes in.

For that I have found on google a site that uses the same formula as you:

http://www.eumetcal.org/resources/ukmeteocal/verification/www/english/msg/ver_prob_forec/uos2b/uos2b_ko1.htm

And they add to their description that CDF is the CUMULATIVE distribution.

Now that makes more sense and I think you probably do the same it is just not mentioned above. I did not get this the whole time.

I see now why you do it and why it is called RPS. That’s fine.

Epstein still does something different though (see my comment above where I apply Epstein to football).

Martin Eastwood - January 29, 2014

Thanks Lars

Adam - March 17, 2014

Hi Martin,

Have you ever used the Ranked Probability Skill Score to evaluate your model rather than just the RPS? I’ve been using the RPS however I’ve seen on sites discussing probabilistic forecast verification, a mention of the RPSS, but i’m unsure as to how to apply it to football predictions. It compares the forecast to a ‘reference forecast’, and I’m wondering what the equivalent is in football (a sample climatology is the example given for weather forecasting). It’s defined here: http://www.cawcr.gov.au/projects/EPSverif/scores/scores.html

Cheers,

Adam.

Martin Eastwood - March 17, 2014

Hi Adam,

I’ve never tried the RPSS, perhaps you could use aggregated bookmaker odds as the ‘reference’ and compare against that?

Thanks

Martin

Ian - March 27, 2014

Hi Martin,

Just wanted to make sure I understand, is CDFfc, effectively your estimated probability of said outcome? As such, in a match where a bookmaker offered 1/2 on a team to win, this would be 0.67? Then the CDFobs would be either 0 or 1?

Am I right in thinking that therefore in theory, a coin toss prediction, where the prediction of 50% is in fact the perfect probability, would have an RPS of approximately 0.25?

Also, any idea why it divides by (K-1) rather than just K?

Cheers,

Ian

Ian - July 11, 2014

Hi Martin,

Did you have any thoughts on what I asked above?

Sorry to persist,

Ian

Martin Eastwood - July 12, 2014

Apologies Ian, I completely missed your comment.

In terms of the coin toss you wouldn’t use RPS as it is intended for when there are more than two possible outcomes. Instead, you would use Brier Scores which is effectively the same thing but for situations with two outcomes, giving the mean squared error of the forecasts. And yes, for the coin toss example you would expect a Brier Score of 0.25.

Not sure about the k-1 part, I’d have to go back and look at Epstein’s paper. I’ve not read it for a while..

Thomas - July 27, 2014

Hi Martin,

Why are your values so constant after the first matches? I would expect to see a spike for those matches where an underdog wins?

Bests and thanks for the great work,

Thomas

Thomas - July 27, 2014

Sorry the comment was indented for post: http://pena.lt/y/2013/05/21/did-the-eastwood-index-beat-the-bookmakers/

Martin Eastwood - July 27, 2014

It’s the average rps of all the forecasts so individual matches tend not to cause spikes due to the smoothing from the aggregation.

Is Brendan Rogers Improving Liverpool?

Martin Eastwood — Wed, 13 Mar 2013 19:30:00 +0000

Introduction

As well as using my EI Index to predict future matches, it can also be used to look back at how team’s performances have changed over time. An interesting example is Liverpool, who sacked Kenny Dalglish at the end of the 2011–2012 season to bring in Brendan Rogers from Swansea City.

The green line in Figure One shows the weekly EI rating for Kenny Dalglish’s Liverpool team over the course of the 2011–2012 season, with the black line showing the moving three-match average. Up until around Christmas time Liverpool were making decent progress in terms of EI, improving from a rating of 2127 up a peak of 2247 following their 3-1 victory against Newcastle United.

However, Liverpool’s form plummeted soon after that, with 11 losses out of their remaining 19 matches dragging Liverpool’s EI back down rapidly. Their worst performances in terms of EI were losses against Bolton Wanderers and Wigan Athletic, both of which Liverpool’s EI ratings suggested they should have had a good chance of winning. Despite a small flurry at the end of the season, Liverpool still finished with an EI lower than they started with.

In contrast, the red line in Figure One shows Liverpool’s weekly EI ratings under Brendan Rogers, with the black line again showing the moving three-match average.

The first few matches of the season did not go particularly well for Rogers and Liverpool’s EI dropped even lower than under Dalglish. The obvious narrative here is that Liverpool may have needed time to adjust to Rogers tactical changes but it’s also worth noting they had a tough start to the season, with fixtures against Manchester City, Arsenal and Manchester United all within the opening few weeks.

Since then, Liverpool has shown a pretty steady increase in EI over the rest of the season. There have been a few drops along the way due to unexpected losses against teams such as Aston Villa and Stoke but their EI rating is currently on course to exceed Dalglish’s peak by the end of the season.

To put these numbers into context, Chelsea are currently in fourth position with an EI of 2464 while Tottenham Hotspur finished fourth last season with an EI of 2329. While Liverpool’s EI isn’t quite that high yet, if they can maintain their current rate of improvement then their EI rating suggests they have a decent chance of challenging for a Champion’s League place next season.

addendum

In case anyone wonders why Brendan Rogers’ starting EI is lower than Kenny Dalglish’s final EI – Brendan Rogers lost his first match against West Bromwich Albion so the difference between the two is the loss in EI caused by that particular match.

Comments

How Much Risk Should a Football Manager Take?

Martin Eastwood — Mon, 11 Mar 2013 19:30:00 +0000

Introduction

How much risk should a football manager take if their team is the underdog in a match? Should they take on the opposing team to try and win the game or sit back and just try not to lose? The decision made is inherently linked to how risk averse the club’s manager is, but is there actually an optimal strategy to use in when in this position?

Exploring tactics using the EI Index

My EI Index considers a team’s performance to be normally distributed around their true skill level so for any given match we can predict the probabilities that a team will perform above or below their average rating. By looking at how different tactics change these performance curves we can see how they affect each team’s chance of winning.

The average EI rating is 2000, with better teams having higher ratings. The larger the difference in EI between two teams then the greater the chance the higher rated team will outscore the lower team. However, since team’s performances vary from match-to-match it is possible for the lower rated team to out-perform the higher rated team and win the match.

The Underdog

Looking at Figure 1 as an example, the underdog (orange) on average plays with an EI rating of 1800 and the favourite (blue) plays with an EI rating of 2000. Overall the favourite is expected to out-perform the underdog and win the match, yet in around 15% of matches the underdog will actually play above the favourite’s EI rating of 2000.

All Out Attack

So what happens if the underdog decides to play the more high risk strategy of attacking the match and going all out for the win? We would expect the more risk a team takes then the more variance we will see in their performances as they have more chance of scoring and yet more chance of conceding too.

Figure 2 shows what happens when the underdog’s variance doubles. Notice how there is now much more of the orange curve exceeding the favourite’s average EI rating of 2000. In fact, in this example the underdog now has around a 31% chance of playing above the favourite’s average and so is much more likely to win the match than before.

There is a down side though as there is also more orange distributed towards lower EI ratings meaning that although they have a greater chance of winning, the underdog has also seriously increased their chances of a humiliatingly large defeat.

Playing Safe

Let’s compare this to Figure 3 where the underdog sits back and plays conservatively hoping they will not get beaten. This low risk strategy reduces their performance variance meaning they are much less likely to out-perform their opponents and win the match. In fact reducing their variance by a half reduces their chance of playing above the favourite’s average down to just 2% at the benefit of maybe grabbing a draw or minimising the risk of an embarrassing defeat.

The Favourite

What about the favourite, how should they respond to a change in risk by their opponents? The optimal choice would appear be to utilise a low risk approach to reduce the variance in their performance and minimize the chance of playing at a level below the underdog’s expected performance (Figure 4). This means there is less chance of a glamorous, high-scoring win for them but importantly less chance of making a mistake and throwing an easy victory away.

The Real World

So what tactics should a manager choose? In the case of the underdog it surely makes sense to take the high risk strategy of attacking the match to increase their chances of winning all three points. The downside is of course that they are at more risk of losing by a heavier score line. But whether a team loses by one goal or by four goals, the net outcome in terms of points is the same – zero.

Over the course of a season it is much more beneficial to gain additional points at the risk of worse goal difference. One extra victory is worth more in league placement than having a superior goal difference to the teams around you. Plus, you need to hold on and scrape three draws to equal the benefit of getting that one extra win.

To counter this high risk approach, the favourite should then play safe to reduce their risk of a poor performance and try to maintain the relative difference in expected performances.

Overall this means we would expect the lower rated team attack the game and take risks while the higher rated team plays safe and waits for the underdog to make an error.

Conclusions

Is this what actually happens in football though?

Personally, I suspect not. It is difficult to quantify the actual risk teams are taking in matches so it is impossible to say for sure but from personal observations it seems much more likely that the underdog will play safe to try and avoid defeat and hopefully grab a lucky draw even though they are then at a much lower chance of winning the match.

There are certainly times when this approach has worked, such as this season’s Champions League match when Celtic beat Barcelona against the odds. But it is perhaps not the best strategy long term over the course of a season to maximize a team’s outcomes.

So why would a team not play to an optimal strategy? There are likely many competing reasons of which one is that football managers are not statisticians and cannot necessarily be expected to view matches from a probability or risk-based viewpoint.

Another explanation also seems to be the public and media’s perception. One victory and three heavy losses are worth more in points and league placements than two draws and two narrow 1-0 defeats yet the manager presiding over the three heavy losses would come under much more criticism even though he had achieved more. A manager’s career is short and unstable – it doesn’t take much for a trigger-happy chairman to wield the managerial axe so rightly or wrongly many managers seem to be focussed on the goal of retaining their job ahead of anything else.

Luck and variation will always play a big role in a team’s results, that’s why football is so exciting, but playing the least risky strategy available may not be the best approach for the smaller teams. Sometimes behind brave is best.

Comments

GoalImpact - March 11, 2013

I came to an equal conclusion here (sorry German)

http://www.goalimpact.com/2013/02/der-sturm-gewinnt-spiele-die-abwehr.html

Jupp Heynkes once said: Attack wins games, defense championships. The opposite is true for the underdog.

Martin Eastwood - March 12, 2013

Thanks for the link, I put it through Google translate and it looks like we came to similar conclusions that the underdog needs to take risks and attack the match rather than sit back and play safe.

Looks like a really interesting site, I look forward to reading more :)

Miguel - March 13, 2013

I disagree with your premise and think there is actually an optimized strategy for the underdog.

There is an assumption that a team that plays it safe does so in order to not get scored on. While this is true, they also play defensively because it increases their chance of scoring. The more defensive they play, the more numbers the favorite must send up to attack, the less numbers the favorite has defending, the more probability of scoring.

So when the underdog plays defensively, it not only decreases the chance to receive a goal but it also increases their chance to score one.

GoalImpact - March 13, 2013

I agree with your assessment. However, I’m not sure it necessarily contradicts Martin’s statement. IF playing defensive increases the goal difference, i.e. increasing the mean of the own distribution and/or decreasing the opponents, it may be worthwhile going for it. Otherwise, seeking their chances may be a better way to optimize the team’s number of points.

Martin Eastwood - March 13, 2013

Yes, it is more about mangers being prepared to take the risks needed to win the match rather being negative and playing to not lose. There may well be cases where that risk lies in playing defensively.

2ndMan - March 13, 2013

Good article, in terms of getting more points I certainly think it’s worth a risk, but agree that when you consider squad harmony and morale then playing it safe may be better for the long term. Allardyce’s “respect the point” comes to mind.

I disagree completely though Miguel, you say defending means the opposing team commit more men to the attack, but that also requires you commit more men to defend. Generally the team in possession is gonna have at least 1 more defender back than attackers you have forward, and the more men you bring back to defender the greater the opposing teams advantage (having 3 attackers v 4 defenders, down to 2 v 3 and 1 v 2).

Miguel - March 15, 2013

2ndMan, the arithmetic that you are using cannot be applied to how the game is realistically played. In other words, 1 defender does not cancel out 1 attacker or, even, 2 defenders cancel out 1 attacker. The game is played at a very fluid pace and when a team is playing very defensively and they are able to get a turnover in the middle of the field, they can exploit the space that is available. And the space, which is the key, is the difference maker.

http://www.youtube.com/watch?v=P2jq2NP2osM

Look at this video, it is a video of a series of counterattack goals by Real Madrid. I know they are one of the best teams in the world, but in almost everyone of those plays, the defenders have the numerical advantage. The huge disadvantage that the defense has is that they are running back to cover the space and it is that space, that is usually not present even when you are attacking with your whole team, that creates the offensive advantage. I would love to see a statistical analisis of success rate of a counterattack, but I can guarantee you a much bigger percentage of goals are scored when the defense is running back to cover the space in front of the box, than when they are positioned there to begin with, regardless of the numbers each team has attack or defense, and this is the offensive advantage that a team has when playing it safe and counterattacking, that they dont have when attacking with numbers.

Nick - March 20, 2013

There is also a mode of thought that the more “possessions” there are in a game, the more likely the better team is to take advantage of those possessions. I believe this came from the NBA.

Therefore an underdog who limits the changes of possession, either by keeping the ball, making the opposition “over-pass”, time wasting, slowing the game down, etc is actually shortening the length of the game and is increasing the chances of an upset.

Does anyone know if there are figures for the average number of team possessions in EPL, for example?

Martin Eastwood - March 21, 2013

Good points Nick.

I don’t think that data for average number of team possessions is currently available. As far as I am aware Opta calculate possession percentages from completed passes rather than the number of actual possessions each team has had during the match.

EI Match Predictions for the English Premier League

Martin Eastwood — Fri, 08 Mar 2013 19:30:00 +0000

Here we go again!

Last Week was another success for the EI, with seven out of the ten predicted favourites winning their matches. It is still a bit early to be drawing too many conclusions but so far that is 14 out of 19 for the EI, which seems a pretty good start to me!

Jonas commented on my recent post discussing the difficulties of assessing probability-based models to suggest trying Ranked Probability Scores which looks like a really good idea so look out for that once I have a bit more data to play with.

It is only a small gameweek this week due to the FA cup but here are the predictions anyway.

Home Team	Away Team	Home (%)	Draw (%)	Away (%)
Norwich	Southampton	48	24	28
QPR	Sunderland	36	27	37
Reading	Aston Villa	42	26	32
West Brom	Swansea	46	25	29
Newcastle	Stoke	49	24	27
Liverpool	Tottenham	38	27	35

Comments

Chris Pope - March 8, 2013

I am only a curious amateur stat freak , but love how close this weeks predictions are. Love the blog and am telling everyone about it.

Martin Eastwood - March 9, 2013

Thanks Chris :)

It is a really tough week to predict so I expect the accuracy of the EI to drop a bit but over enough data it should all cancel itself out as will be some easier weeks too.

EI Match Predictions for the English Premier League

Martin Eastwood — Fri, 01 Mar 2013 19:30:00 +0000

Since last week’s predictions turned out to be so popular I thought I would continue testing the EI index in public so here are this week’s predictions. Fingers crossed they turn out well again!

Home Team	Away Team	Home (%)	Draw (%)	Away (%)
Chelsea	West Brom	60	20	21
Everton	Reading	64	17	19
Man United	Norwich	76	10	14
Southampton	QPR	49	24	27
Stoke	West Ham	56	21	23
Sunderland	Fulham	42	26	32
Swansea	Newcastle	42	26	32
Wigan	Liverpool	31	27	42
Tottenham	Arsenal	44	25	31
Aston Villa	Man City	14	25	61

Comments

EI Match Predictions for the English Premier League

Martin Eastwood — Thu, 28 Feb 2013 19:30:00 +0000

The EI

Last week was a big test for my new EI Index – it had finally reached the point where I was confident it was working well enough to post its predictions in public.

For those people who haven’t come across it before, the EI index is a mathematical system I have been developing for ranking football teams and predicting the outcomes of matches. Using the EI it is possible to predict the odds for each team winning, drawing or losing its match.

Results

So how did the EI do?

Well, this is the tricky bit as the EI is a probability model. For linear models it is relatively simple to assess accuracy as you get an R-squared value showing you how well your predictions match the observed result. The higher the R-squared then the better you did.

For a probability-based model though you cannot do this. An obvious alternative is to just look at whether the model’s predicted favourites won their matches. And on this measure the EI performed fantastically well by correctly matching seven of its nine predictions, giving it an overall success rate of (78%).

But we need to be careful here as this can be a misleading way of looking at accuracy. Just because Manchester City had a 54% probability of beating Chelsea doesn’t mean they will win the match purely because it is the most probable result. Instead, it means that if this match was played 100 times then Manchester City would be expected to win 54 of them and not win 46 of them.

Rather than looking at the accuracy of the predicted favourite winning we really need to look at the accuracy of the predicted probabilities. Are teams ranked with a 50% chance of winning actually winning 50% of the time? Are teams ranked with a 25% chance of winning actually winning 25% of the time?

This can only be done by making lots and lots of predictions so over the coming weeks I will keep making them until I have enough of predictions to get an estimate of how good they are.

Overall though it is a very exciting start for the EI, bring on next week!

Comments

Jonas - March 1, 2013

Have you heard about the rank probability score (RPS)? It seems, to me a least, to be a reasonable measure of how well a proabalistic model fares.

You should check out the paper “Solving the Problem of Inadequate Scoring Rules for Assessing Probabilistic Football Forecast Models” by Anthony Costa Constantinou and Norman Elliott Fenton.

Martin Eastwood - March 1, 2013

Thanks Jonas, that looks really interesting!

GoalImpact - March 12, 2013

Hi Martin,

Thanks for sharing your approach. This appears to be more sound to me than most of the rankings around. One easy way to check your prediction quality would be the power stat. Just sort the predicted outcomes by probability and make a xy chart with the sorted predictions on the x axis and the cumulative real outcome on the y axis. Then calculate the Gin I on that graph.

Martin Eastwood - March 12, 2013

Good idea, I like the sound of that. Will be giving that a try :)

EI Match Predictions for the English Premier League

Martin Eastwood — Thu, 21 Feb 2013 19:30:00 +0000

For a bit of fun, here is a trial run at predicting this weekend’s EPL matches using my EI ratings. I haven’t compared these with anyone else’s odds yet but they generally look about what I would have expected.

Poor QPR don’t seem to have much chance holding out against Manchester United, even playing at home they are only rated at having a 9% chance of winning.

It looks like it could be a good weekend for Arsenal to bounce back from their Champions League defeat as they have a massive 67% chance of beating Aston Villa.

Personally, I am surprised Newcastle are rated quite so highly against Southampton. I wonder if this may be due to Newcastle being so strong last season while Southampton only have this season’s data for generating EI ratings from? If so, it may be that I need to go back and tweak the equation weightings slightly to account for situations like this.

Match	Home (%)	Draw (%)	Away (%)
Fulham Vs Stoke	46	25	29
Arsenal Vs Aston Villa	67	15	17
Norwich Vs Everton	29	27	43
QPR Vs Man United	9	23	68
Reading Vs Wigan	43	25	31
West Brom Vs Sunderland	47	24	28
Man City Vs Chelsea	54	22	24
Newcastle Vs Southampton	53	22	25
West Ham Vs Tottenham	19	27	54

Comments

JVent - February 25, 2013

Awesome EL predictions you made you only miss two predictions but nobody would of know that Reading Pavel Pogrebnyak was going to receive a red card. So that would only make one wrong prediction. Hope you share the formula later on when you perfected

Also you got one or the most interesting blogs about Football.

Martin Eastwood - February 25, 2013

Thanks! Was definitely a good start, let’s see if it continues :)

Casey - February 27, 2013

Oh goodness I would love to apply this to MLS. New season starts Saturday. No new teams.

Introducing the Eastwood Index

Martin Eastwood — Thu, 21 Feb 2013 19:30:00 +0000

Introduction

My past couple of articles have focused on Elo ratings and how they can be applied to football teams to rank them against each other and to estimate win probabilities.

Problems with Elo Ratings

On the whole the Elo system works okay but it was not designed with football in mind and so there are some issues with it, for example it can only handle two distinct outcomes – winning and losing.

Elo ratings try to get around this problem by considering each draw to be half a win and half a loss. However, this means that the win probabilities calculated using the Elo equation are actually the probability of winning or drawing versus the probability of losing or drawing, which isn’t particularly useful.

For a game such as chess, which Elo ratings were originally developed for, this may not be too much of an issue as tied matches are comparatively rare but in football draws are a common occurrence so we really need to be able to model three outcomes – win, loss and draw.

The Eastwood Index

So instead of combining draws with wins and losses, we need to be able to calculate their probabilities individually. To do this I have been developing my own ranking system, which for want of a better name I am currently calling the Eastwood Index, or EI for short (it feels rather pretentious to be naming it after myself so if anyone has any better names for it then feel free to let me know!)

The Eastwood Index allows football teams to be ranked using a mathematical rating system that evaluates relative strength based on previous performances weighted so that more recent matches have a greater impact on a team’s ranking.

Methodology

Teams EI ratings are scaled so that the average rating is 2000. The higher the rating the better a team is compared with the rest of the league.

EI ratings increase when a team wins a match or draws against superior opposition. Conversely, EI ratings decrease when teams lose matches or draw against weaker opposition. The size of this increase or decrease in ratings is linked to the quality of the opposition. For example, beating a superior team is worth more than winning against a lower ranked team.

The change in EI rating is also weighted by the score line so that the greater the difference in goals scored or conceded then the greater the change in ratings. Home advantage is also included in the calculations so that the home team is considered to perform better at home compared with away.

So far this all sounds similar to an Elo rating. However, the Eastwood Index has a major advantage over Elo in that it is multinomial, meaning it can function with multiple outcomes. This makes it possible to accurately calculate the probabilities of teams winning, drawing or losing matches.

A further advantage of the Eastwood Index is that it is does not rely on the Logistic distribution the same way Elo ratings do. The use of the Logistic distribution in Elo ratings originates from chess where it was considered to predict chess outcomes reasonably well. Football and chess are different games with different outcomes so instead the Eastwood Index uses custom curves developed using football data. This means that predictions for football should be more accurate using the EI compared with Elo ratings.

Example

The underlying mathematics for the EI is completely different to how an Elo rating is calculated but rather than wade through a list of equations it is simpler to show how it works using the recent Liverpool versus Swansea match.

Prior to the game Liverpool had an EI rating of 2151 compared with Swansea’s rating of 1891. Team performances are considered to be normally distributed around their rating so on any given day a team may play above or below their true skill level. By comparing the distribution curves for the two teams we can then calculate the probabilities of each outcome of the match before it is played.

Although both teams have similar ratings Liverpool has the home advantage giving them overall a 52% chance of a win compared with a 25% chance of Swansea winning and a 23% chance of a draw (Figure 1).

Figure 1: Predicting Liverpool Versus Swansea City

We can also use these probabilities to calculate the expected points from the match. If these two teams were to play the same match repeatedly then on average Liverpool would be expected to earn (0.52 * 3) + (0.23 * 1) = 1.79 points while Swansea would be expected to earn (0.25 * 3) + (0.23 * 1) = 0.98 points.

Once we know the actual result we can then update the EI for each team based on their current ratings and the score line, which was Liverpool 5 – 0 Swansea. Since Liverpool already had a higher EI rating and had beaten somewhat lesser opposition they would expect only a small rise in their EI but taking into account their high score in the match Liverpool’s rating moves up to 2183 while Swansea’s falls to 1859.

Conclusions

The EI Index offers a potentially superior way of rating football teams compared with other ranking systems, with the advantage that it can predict wins, losses and draws, and uses mathematics specifically designed to accurately model football data.

I will be discussing the EI in more detail in future posts and showing how it can be used to analyse and predict football matches.

As ever, get in touch if you have any comments of questions!

Comments

Lars - February 21, 2013

Hello Martin,

I admire the courage to come up with a completely new ranking system. There are a lot of things to be taken into consideration if you want to set up a solid theoretic basis for such a complex problem. That is why I shy away from developing something completely new and rather stick to Elo, certainly not perfect but still very good in my eyes.

Please tell us more about the maths behind it so that substantiated comments are possible. Until then, let me give you my first thoughts I had when reading this:

1) It seems that by using the 3-point rule for the ranking, you leave the ground of zero-sum-games. This would imply that two teams can increase (or decrease) their average ranking just by playing each other. I wonder if that is intended?

2) If you do not want to be pretentious, name it after what it does or its unique feature (multinomial or whatever).

Martin Eastwood - February 21, 2013

Hi Lars,

Yes Elo is certainly adequate and I am not trying to criticize it. Rather I am just trying to improve things further although I am sure there is still more to do as this is just the first version of the system. In answer to your questions:

1) The mathematics is designed to ensure the system is zero-sum so the average rating for the league will always be 2000

2) perhaps the Multinomial Football Index? I may just stick with EI and then just avoid referring to what the E stands for :)

Rob - February 21, 2013

Hi Martin

Enjoy your blogs and find them very interesting .

Just trying to get my head around your example of Liverpool v Swansea. If I am correct then you award extra points for a 5-0 win (goals) but Swansea had rested the majority of the team if memory serves me correctly so could you take that into account when awarding points or is Subjectivity a dangerous state to avoid when putting together ratings ?

Martin Eastwood - February 21, 2013

Thanks Rob. At the moment it is based purely on the actual match result, so far I haven’t found a good way to quantify whether a team has put out a weakened squad for a match.

Baloo - February 21, 2013

I use a similar rating system (purely to assess opposition strength) and if I had a team rated 250 points higher, it would mean they are around 1.25 goal favourites. Add on home advantage and you get Liverpool in at 1.65 goal jollies (ie roughly 73.5% to win the game).

My pricing method is a lot more complicated but I also had Liverpool around 1.65 goal favourites and bet accordingly.

How did you get to 52%?

Martin Eastwood - February 22, 2013

The 52% is based on the difference between the two team’s performance distribution curves. For such closely matched teams though 73.5% sounds slightly high?

Baloo - February 22, 2013

What do you mean by performance distribution?

I would strongly disagree that they are closely matched on performance (or even just shot) data, which is essentially what drives the betting markets. They are closely matched only in pure results terms.

Liverpool continually divide opinion however, and I have to admit I’m in the minority when it comes to rating them.

They are an outlier, just as Man Utd are also an outlier but at the opposite end of the spectrum.

Martin Eastwood - February 22, 2013

The model considers individual performances to be distributed around the team’s current EI rating. Yes Liverpool are quite a controversial team to rate, personally I never think the odds for them look how I would expect them to. I agree about Man Utd they are a huge outlier this season and based on many of their individual stats they would not be expected to be where they are.

Vasilis - August 30, 2013

Hi, I have a question. At the beginning of a season, so you make and a subjective evaluation of all teams, do you start all teams with 2000 points, do you rely on last years ranking to handicap better teams? And once the start up points have been awarded, does your system takes into account purely results, or do you feed it with subjective criteria as well?

Martin Eastwood - August 30, 2013

All teams initially started off equal on 2000 points. Promoted teams them take over the equivalent relegated team’s rating and other team’s ratings carry over from one season to the next. At the moment it is based purely on results but at some point I plan to investigate whether subjective criteria can help account for changes e.g.manager changes / transfers / summer breaks etc.

Vasilis - August 30, 2013

Nevertheless, dont you think that not all teams have the same probability of winning the trophy? I mean, if you figured out a way to evaluate odds for each team winning the 1st place, and then compiled them around 2000, but with weights in order to have a more realistic initial point, wouldn’t that be more accurate? Also, does your model take into account the total number of teams participating in the league. I mean, do you use the 2000 initial points for Scottish Premier league (12 teams), and for England Premier (20 teams)too? I guess that a team loosing all matched would have its points limit close to zero, but not negative, correct? Hence the 2000 points shouldn’t be adjustable to each league?

Martin Eastwood - August 30, 2013

When I initially set up the model I ran it over the three previous season’s data so that all teams ratings had time to move from the 2000 to the correct value to represent the team’s values so no need to weight them.

The effect of league sizes is something that intrigues me as ideally I would like to run the model over multiple leagues and seasons and have ratings comparable between them all so some form of weighting will be required, although I am not sure of the best solution though. Still trying to find an answer that I am happy with.

Nick - April 16, 2014

EI is great but your statement that draws in chess are relatively rare is not correct. “For a game such as chess, which Elo ratings were originally developed for, this may not be too much of an issue as tied matches are comparatively rare but in football draws are a common occurrence” Actually, quite the opposite – no other game has as high ratio of draws as chess, over 50% of chess games played at high (professional) level are drawn. https://en.wikipedia.org/wiki/Draw_(chess)#Frequency_of_draws

Martin Eastwood - April 16, 2014

I stand corrected Nick :-)

Johan - April 23, 2014

Hi Martin.

I’m relatively new to your blog but reading a new article every morning on my way to work. Interesting content (and comments).

Your system the EI index sounds somewhat similar to Paul Steele’s power ratings. Would you mind sharing the formula (like Steele) so it’s possible to compare their performance on the same data sets?

I found Mr. Steele’s system to work quite well especially for Home wins (roughly 66% correct) but it would be good fun if there was a system that could beat it especially on away wins (roughly 40%).

I can send you his formula if you like (pls advise me of your email).

Best regards,

Johan

http://google.com/ - June 14, 2014

whoah this weblog is fantastic i like studying your posts.

Keep up the great work! You already know, a lot of individuals are hunting around for this information, you can help them greatly.

Empecinator - July 7, 2014

Hi Martin,

Thanks for the blog as it is very interesting and easy to read. However and same as Baloo I don’t see how you get to the 52% chance of victory for L’pool, i.e. how to you infer the multinominal probabilities from the Win-Lose calculated with the Ea logistic formula (explained in a previous article)?

I understand it can be calculated from the overlay distributions but will you (or have you) publish it?

Cheers

Martin Eastwood - July 12, 2014

Hi Juan,

I’ve moved away from the EI as I couldn’t find a good way of forecasting scores from it so while it worked well for 1X2 probabilities it was not much use for Asian handicaps. It’s unlikely I’ll bother publishing it now since I no longer use it or keep it up to date.

Understanding Elo Ratings Part Two

Martin Eastwood — Thu, 07 Feb 2013 19:30:00 +0000

Introduction

Now that we understand the theory behind Elo Ratings, let’s take a look at how to calculate them and how to make them more relevant to football.

Calculating Elo Ratings

The equation for calculating a team’s Elo rating is shown below in Figure 1, where $Ra_{new}$ is the team’s new Elo rating after a match, $Ra_{old}$ is the team’s previous Elo rating before the match and $k$ is a weighting factor. $Sa$ is the outcome of the match normalised to the range 0–1 so that 0 is a loss, 0.5 is a draw and 1 is a win.

$Ra_{new}=Ra_{old}+k(Sa-Ea)$

Figure 1: Elo Rating equation

$Ea$ is the expected probability of the team winning the match and is calculated using the equation in Figure 2 where $Rb-Ra$ is the difference in Elo ratings between the two teams.

$Ea=1/1+10^{(Rb-Ra)/400}$

Figure 2: Expected win probability equation

Win Expectancy

The calculation for $Ea$ is actually slightly different from the original Elo equation as it uses a logistic distribution for player performances rather than a normal distribution. The use of the logistic distribution stems the chess community, who suggested that it fit player performances better than the normal distribution did. In effect, the differences between the two are relatively minor, with the logistic curve skewing more performances to the tails of the distribution, meaning players are slightly more likely to over- or under-perform (Figure 3).

Figure 3: Comparison of logistic and normal distributions

Weighting Factor

The constant $k$ in the equation controls how many points are gained or lost each match. Increasing k will apply more weight to recent matches while lowering it will allow historic matches to have more of an effect on a team’s Elo rating. Therefore, using an inappropriate rating for $k$ may lead to inaccurate Elo ratings being calculated.

Eloratings.net is a website that applies Elo ratings to international football. They use a weighting of 60 for a world cup final, 50 for continental championship finals and major intercontinental tournaments, 40 for World Cup and continental qualifiers, 30 for all other tournament matches and 20 for international friendly matches. However, since none of these ratings apply directly to domestic football and since Eloratings.net does not explain how they were determined I decided to calculate my own.

Using Least Squares I optimized the value of $k$ to minimize the error of the predicted outcomes versus the actual match results using data from the English Premier League. Overall, the most accurate predictions were obtained using a value of 15 for $k$.

Figure 4: Effect of k on error of Elo prediction

Goal Difference

Another modification we can do to make the Elo ratings more applicable to football is to take into account the number of goals scored so that beating the opposition by two goals for example is better than wining by just one.

We can do this by scaling $k$ by the goal difference so that the larger the difference the more points are gained by the victor and the more lost by the loser. There are a number of ways this can be done but in my method each additional goal a team scores becomes increasingly less important. For example, going from 1–0 to 2–0 is much more critical in terms of winning a game than going from 8–0 to 9–0.

Eloratings.net used a similar approach where their scaling reduces the weightings for goal differences of two and three. However, for goal differences of four upwards their scale (intentionally or unintentionally) becomes linear and from then on applies equal weightings to each additional goal scored. Instead, I have used a sigmoid function to smoothly reduce the weightings of each goal scored to create the curve shown in Figure 5, which is then used to produce the scaling factors shown in Table 1

Figure 5: Goal difference scaling factor smoothed using a sigmoid

Goal Difference	Scaling Factor
10.00	2.99
9.00	2.88
8.00	2.77
7.00	2.64
6.00	2.49
5.00	2.32
4.00	2.11
3.00	1.85
2.00	1.51
1.00	1.00

Table 1: Goal difference scaling factors

Home Advantage

If two teams with equal Elo ratings play each other then in theory they should both have an identical chance of winning the match; however, in football the home team always has a noticeable advantage.

Looking back at the 2011–2012 English Premier League season, home wins accounted for 47% of results compared with just 24% for away wins. The remainder of the results are draws, which Elo ratings consider to be half a win, so including these gives us a final win expectancy of 61% for the home team and 39% for the away team.

To account for this we can give the home team’s Elo a temporary boost of 75 points. For two equally matched teams this then raises the win expectancy for the home team from 50% to 61%, matching what we see in the English Premier League.

Relegations and Promotions

Another issue to consider is how to deal with relegations and promotions. We could calculate Elo ratings for each tier of the league so that a team already has a rating when it gets promoted or alternatively we could award each promoted team the average Elo rating of 1500. A nice feature of Elo ratings is that they are self-correcting so although these arbitrary ratings may not be accurate they would gradually alter to the correct level.

This does have the unfortunate side effect of skewing the other team’s Elo values though. The gain and loss of Elo points is zero sum, meaning that for every Elo point a team gains another team has to lose one. So adding in teams with different Elo ratings would distort the values of other the team’s ratings by altering the overall number of points available in the league.

The simplest way to deal with this problem is to give the promoted teams the equivalent relegated team’s Elo rating. So the best promoted team takes the Elo rating of the best relegated team, the second best promoted team takes the Elo rating of the second best relegated team, and so on. This then keeps the correct number of Elo points in the league and maintains the parity in points between teams.

Conclusions

Elo ratings are a really quick and easy way to compare teams directly and calculate win expectancies. While techniques like the Pythagorean Expectation looks at how teams perform over a long period of time, Elo ratings can be used to look at teams on a match–by–match basis.

Comments

Lars - February 7, 2013

Thanks for this article.

I invite you to have a look at my website, where I am doing Elo ratings for European club football.

I lot of conclusions I have come to are similar to yours.

My least-squares curve for the weighting factor however looks a bit different, with a minimum at k=20 and not as symmetrical.

Glad to see that the Elo system becomes more and more popular.

Martin Eastwood - February 7, 2013

That is a really nice website Lars :)

It’s also good to see we have come to similar conclusions with regards to k factor etc.

Your use of the Poisson looks interesting too. I have played around with various Poisson models before but I have not tried combining it with the Elo before, an intriguing idea!

Stefan - February 7, 2013

Very interesting article, thanks!

I was wondering about two things:

1) In Figure 3, what are the parameters of the logistic distribution? Is it plotted for the same mean and variance as the normal distribution? Maybe this should be stated in the text.

2) What would happen if you used the actual match result, in particular the fraction of total goals scored, for “Ea” instead of just 0, 0.5, 1 for loss, draw, or win, respectively? For example, if the match ended 2-1, team 1 would have scored 66% of the goals, so the actual outcome would be 0.66 (and 0.33 for team 2). This would also naturally account for the weighting of goal differences, since the fraction of goals scored is much more similar when comparing 5-1 and 6-1 wins than for 2-1 and 3-1 wins. However, then one obviously has to deal with results where only one team scores and the outcome is always 100%/0% regardless of the goal difference.

I remember that the European eSports Leage ESL uses a similar ranking system for online games, see: http://www.esl.eu/eu/faq/rankmodules/ (interestingly also for the FIFA games^^) As far as I know, they added an offset of 1 to the match result, such that 1-0 is actually interpreted as 2-1, 2-0 as 3-1, and so on.

Martin Eastwood - February 7, 2013

1) Yes, the two distributions should be comparable with each other

2) That is a really interesting idea, I will have a play around with that and see if it works.

Thanks for the comments Stefan!

Stefan - February 7, 2013

If I had to guess, it could be that this underestimates the value of a close win (and I assume the majority of wins are with one goal difference). If this is the case you could play around with a sigmoid that pushes the actual match outcome away from 0.5 towards 0/1, but still allows for some gradual changes. Intuitively speaking, it should make sense to include the full information of the match score, which is more than just a ternary event (win, loss, or draw).

On a related note, it would be interesting to analyze whether the absolute goal difference or the goal ratio carries more predictive information.

Ian - September 2, 2013

Hi Martin, Can you please explain in layman’s terms how you calculated the optimal K value?

Same thing for the scaling factor – if it’s not too difficult.

Martin Eastwood - September 15, 2013

Most stats packages will have some sort of optimisation routines built in e.g optim in R or solver in Excel. These will iteratively work through a range if numbers looking to minimise or maximise some value for you. So for example you would look to minimise error or maximise likelihood to get the optimal value.

Ian - October 19, 2013

Thanks Martin, Just saw your reply now.

Can you (or Mick) run me through how to use solver to get the optimal K value? I’d really appreciate it.

Currently I’m regressing every team’s rating to the mean after each season. Would you guys recommend that I continue to do that after finding the optimal K value? Or would that become unnecessary?

Michael Podger - October 16, 2013

Hello Martin, great article

I analysed 4 years of A-League football (about 550 matches), using Solver to minimise the average error between the predicted and actual margin. The optimum K value was 75! Can you think of any reason why our K values would be so different ?

Thanks Mick

Martin Eastwood - October 16, 2013

Some leagues do seem to optimise to different k values. I found MLS to be quite different too due to its high level of parity so perhaps something similar with A league?

Mick Podger - October 18, 2013

Thanks Martin, thats probably it. A-league has a lot of equalisation measures which create more variability from year to year than you’d probably get in Premier League.

Nick - January 14, 2014

Hello. I don’t understand your probability formula. EA + EB = 1, but we have three possible results.

Martin Eastwood - January 20, 2014

Elo ratings only have two outcome – win and loss – so the draw probability gets merged between the two

Adam - April 12, 2014

I noticed that the Ea formula here is different from that in Wikipedia:

http://en.wikipedia.org/wiki/World_Football_Elo_Ratings

Any suggestions?

Martin Eastwood - April 12, 2014

If I remember correctly, my version is a modification to the original elo to use logistic regression to improve its accuracy a little bit more.

Adam - April 12, 2014

Thank you for your kind reply Martin! I’ve got another question if you do not mind.

Does Rb stands for the Away team and Ra for the Home team?

Martin Eastwood - April 12, 2014

No problem Adam, yes Rb should be the away team

Adam - April 12, 2014

Many thinks Martin!

Nathan Hause - August 12, 2014

First off I’d like to say great series of articles. I found them extremely helpful and surprisingly easy to understand.

I am commenting to inquire as to your success predicting wins/losses/draws with this model. After some minor tinkering with your Ea formula, some of the major statistics you outlined (home win %, away win %) and with some help from another site that had Elo ratings I was able to fashion together a prediction model. It was able to accurately predict the outcome for 11 of the final 14 matches in the EPL season including draws (I didn’t have enough data to go back further and test it). I am wondering if this a relatively high percentage or if you have been able to make a model that has more success.

Your input would be greatly appreciated, thanks.

Martin Eastwood - August 13, 2014

Hi Nathan – yes that looks a very high success rate! It’s a small sample size though so really you need to test it over more matches to see whether that level of accuracy is sustainable. Let me know how you get on with it :-)

Understanding Elo Ratings

Martin Eastwood — Thu, 31 Jan 2013 19:30:00 +0000

What are Elo Ratings?

The Elo rating system was originally devised by its creator Arphad Elo as a way to calculate the average skill levels of two chess players. Although the system was created specifically for chess it has also been adapted to many other games and sports, including international football.

How Do They Work?

The fundamental principle behind Elo ratings is that the performance of a team in each match can be considered a random variable sampled from a normally distributed population centred on the team’s true skill level. Although performances will vary from match-to-match, the true skill level of the team is likely to only change slowly over time so can be considered to be the mean value of all their performance values.

For example, Figure one shows a team with an Elo rating of 1500. On any given day their actual performance could vary from anywhere below 1000 to above 2000. But over a reasonable period of time their performances will average out to 1500.

Figure 1: Possible performances for a team with Elo of 1500

Why are Elo Ratings Useful?

Elo ratings have no units and taken in isolation their specific values are of little interest. However, they become useful when comparing teams together as they can be used to determine the expected outcome between two teams based on the difference between their Elo ratings.

The range used for Elo ratings is somewhat arbitrary with Elo himself suggesting they should be scaled so that a difference of two hundred points equates to the higher ranked team having a win probability of 75%. In addition, Elo ratings are generally scaled so that an average team has a rating of 1500.

Predicting Match Results Using Elo

Plotting two team’s Elo distributions together gives a nice way of visualizing their expected performances. Figure 2 shows Team 1 with an Elo rating of 1100 compared with Team 2 with an Elo of 1500. The most likely outcome is that both teams will play to their average ratings and so Team 2 will win overall as they have the higher ranking. However, both team’s performance distributions overlap each other, so it is possible for Team 1 to out perform Team 2 and win the match.

Figure 2: Comparision of two team’s Elo performance probabilities

The more these performance distributions overlap then the greater the chance of the lower placed team winning the match. The actual probability of victory can then be calculated from these two distributions by subtracting one from the other to get the normal difference distribution between the two (Figure 3).

Figure 3: Probability of Elo differential occurring

The centre of this new distribution is equal to the difference between the two ratings (1500 – 1100), meaning the most likely outcome is that Team 2 play like a team with an Elo rating of four hundred higher than Team 1. As we move further to the left the difference between the two teams decreases until we reach a negative differential at which point Team 1 actually start to play better than Team 2, albeit with a low probability of occurrence.

The actual probability of this occurring can be plotted using cumulative frequency to show the overall chance of winning based on the Elo differential (Figure 4). So for our example above, Team 1 with its differential of -400 actually has around a 9% chance of winning the match while Team 2 has a 91% chance of winning.

Figure 4: Probability of winning based on Elo differential

So now we understand the theory behind Elo ratings, my next post will look at how they can be calculated and applied to football teams.

Comments

Lars - February 1, 2013

“two hundred points equates to the higher ranked team having a win probability of 75%. Following this advice means that an average team should have an Elo rating of 1500.”

That conclusion is false. The win probability would be exactly the same if the average Elo rating was 2500 or 9500. The Elo scale is not absolute, only the difference between teams’ Elo values is relevant.

Apart from that very good introduction, congratulations. I am really looking forward to the sequel.

Martin Eastwood - February 2, 2013

Good point Lars, I have re-phrased that sentence to avoid any confusion.

Thanks for the feedback!

Predicting Football Matches Using Shot Data Part Two

Martin Eastwood — Fri, 25 Jan 2013 19:30:00 +0000

Introduction

Having found that the correlation between goals scored and shots on target was the strongest of the various shooting variables I had available to me, I decided to see how well they could predict the outcome of a football match.

Creating The Model

The obvious approach would have been to just do a linear regression for goals scored against number of shots on target and then predict the average number of goals each team would be expected to score. This doesn’t provide much insight though. The average score line might be of interest if each team was going to play each other 20 or 30 times a season but for a single game it is pretty much irrelevant.

What is of more use is to predict the actual odds for each possible outcome between the teams. In other words what is the probability of each team winning, drawing or losing?

To do this I looked at how many shots on target each team achieved and conceded each match compared with the league’s average to estimate how many they would be expected to have against each other. This was then mapped to the distribution of their shot on targets over the season so far and their shot conversion rate used to calculate the probabilities of the different number of goals they could score. Each match was then played one million times as part of a Monte Carlo simulation to see what the likely outcomes was.

Are the Predictions Accurate?

One difficulty with a model like this is to assess its accuracy. With a traditional linear model you can just look at the $r2$ value to see how well you predictions match the actual results. The higher the $r2$ value then the better your model is.

But with a probability model this doesn’t work. For example take the situation where the probability model predicts Team A have a 75% chance of beating Team B. Even if the model has calculated these odds perfectly then Team A will still lose 25% of the time, making it look like the prediction was incorrect.

One alternative is to identify what the most probable outcome for each match was – win, draw or loss – and compare that with what actually happened to see if they match. To do this I applied the model retrospectively to all the matches from the 2011–2012 English Premier League season and overall the proportions of outcomes predicted did match closely what actually happened (Figure 1).

Figure 1: Proportion of outcomes predicted compared with actual results for 2011–2012 English Premier League season

Another test we can do is to compare the Shot on Target model with other models to see how well they compare. Again I picked the most probable outcome from my odds and this time compared it with those from Bet365 for the entire 2011–2012 English Premier League season. I also randomly guessed the outcome for each match by chance to see how the model compared with pure luck too.

Prediction Results

Overall, the Shot on Target model’s most probable outcome correctly matched what actually happened for 43% of the matches tested compared with 52% for Bet365 and 33% from randomly guessing.

Interestingly, even the bookies only managed to get the odds correct for around half the matches so the Shot on Target model is doing pretty well at 43% and isn’t that far behind the professional odds compilers. Also, this is only the first stage of the model, there are still plenty of ways it can be tweaked to try and improve its accuracy further.

Comments

Laurie - January 31, 2013

Reading your blog it seems that shots on goal is a good thing to study to determine a win/draw/lose prediction of a game.I have an ELO system on excel and also power ratings.However i want to add shot:shots on target:goals to my excel system,but am unsure how to go about doing this.Im hoping you can help.Maybe send me an email with some advice.I’d be very grateful for your help.

Cheers

Laurie

BernieW - February 4, 2013

I did notice that Barca did not perform as well when their shots & shots on target were below their average PLUS the opposition had their shots & shots on target above the average. Do you use clear chances as a metric? I cannot find this statistic recorded by any site and I get the “feeling” that this would be more accurate a predictor. It would be great to have this for all the 5/6 major leagues for a few seasons.

Martin Eastwood - February 4, 2013

It is an interesting idea. I am looking at ways to improve the model by adding in extra metrics so I’ll take a look at it if i can find any data available

David - February 3, 2014

Great post..I have always wanted to use something other than linear regression. Could you make a short tutorial or send me an email on how you did the part where you create the model.

How do you estimate how many shots on target they are expected to have against each other and how do you end up with the score probabilities.

I would love to be able to recreate this model you have made.

Looking forward to your reply and keep up the good work.

Nick - April 16, 2014

I came across a bog post (http://www.soccerstatistically.com/blog/2011/11/9/how-to-succeed-in-the-epl-chances-created-and-chance-convers.html) that analyzes chances created and goals scored. There is a correlation (adding in the chance conversion rate). The author uses data from Opta.

Martin Eastwood - April 16, 2014

Thanks for the link!

Nick - April 16, 2014

I have thought about the same – goal chances although a little bit subjective and more difficult to define than shots on target (is a shot from 30 yards that blazed over the bar a goal chance?) should be a much better predictor than shots on target because shots on target do not tell you anything about the quality of those shots and exclude great gola chances that resulted in shots off target or no shot at all.

Nick - June 20, 2014

This may be interesting for you: http://www.pinnaclesports.com/online-betting-articles/05-2014/world-cup-total-shots-ratio.aspx

Martin Eastwood - June 21, 2014

Thanks for the link Nick!

What Is The Chance of Bradford City reaching Wembley?

Martin Eastwood — Tue, 22 Jan 2013 19:30:00 +0000

With League Two’s Bradford City only one match away from playing at Wembley in the League Cup final I thought it would be interesting to see what the chances were of them getting this far.

It has been an unbelievable cup run for Bradford City as they have had to play against teams in higher divisions nearly every step of the way. The first round of the cup pitted them against League One team Notts County who they managed to defeat in extra time. This was then followed a couple of weeks later with a 2–1 victory away at Championship side Watford.

The third round was somewhat kinder to them as they played at home against Burton Albion, a team from their own division. However, having seen off Burton the fourth round then took them to Premier League side Wigan Athletic who they managed to defeat on penalties.

The quarter final again put Bradford against another Premier League team, with Arsenal this time Bradford’s victims. Next up was Aston Villa who lost 3–1 to Bradford at the Valley Parade although Villa did come away with an away goal, which could prove to be critical for them.

To work out the probability of Bradford’s cup run I collected the odds for each match from www.oddsportal.com and removed the overround to get the true odds. The overround is the bookies profit margin created by offering odds lower than the actual true odds of the event occurring. To remove it we just need to scale the odds by the excess so that they add up to exactly 100%.

Once we have the true odds we can them work out the cumulative probability of the cup run by multiplying the odds together (note that bookies odds generally refer to what happens over the first ninety minutes of the match so Bradford City beating Notts County in extra time is actually classed as a draw rather than an away win).

$Prob(Cup Run) = prob draw with Notts County * prob beating Watford * prob…$

Overall, the probability of Bradford City’s current cup run so far is 0.008%. If we take into account tonight’s match then the chances of Bradford’s cup run taking them all the way to Wembley is around 0.001% or 1 in 100,000. It’s not quite a lottery win, which is around 100 times less likely again, but it is a fantastic achievement for Bradford City and is likely to be a once in a lifetime experience for their fans.

Comments

Predicting Football Matches Using Shot Data

Martin Eastwood — Mon, 21 Jan 2013 19:30:00 +0000

Introduction

Previously on this blog I have discussed my attempts at using the Poisson distribution to predict the number of goals scored in football matches. So far, the results have been disappointing as the mathematical model I constructed under-predicted the number of draws that occurred. This is something I intend to go back and address at some point by adding in the Dixon and Coles adjustment but in the meantime I thought I would try predicting the outcome of matches using shots instead.

Shot Data

There were a number of reasons for working with shots instead of using goals directly. First of all, shots and goals are inherently linked together. For every goals scored there has to be a shot taken. Secondly, not every shot taken leads to a goal, giving us a much larger data set to work with compared with just goals alone. Thirdly, the number of shots taken in a match is pretty much normally distributed (Figure 1) whereas the number of goals scored is closer to a Poisson distribution. This is useful as many statistical tests rely on a normal distribution of data.

Figure 1: Frequency of total shots in English Premier League matches 2009–2012

The first stage of developing the model was to determine what variables to use for it. Looking at data over a whole season showed a decent correlation between goals scored and total shots taken ($r2$=0.62), shots on target($r2$=0.76), shots blocked ($r2$=0.59) and shots wide ($r2$=0.32; Figure 2).

Figure 2: Correlation between goals scored and various shooting parameters for the 2011–2012 English Premier League Season

Variability

Unfortunately when you start looking at the data match-by-match the correlations become much weaker. Over the course of an entire season a lot of the variability in the data starts to even out but over a single match it is not the case and variables such as luck can play a much bigger role. For example it is likely that the teams with the most shots on target will score the most goals overall per season as skill would start to dominate over luck. However, this isn’t always the case for an individual game – we have all seen matches where a team has scored a lucky goal and then managed to hold on for the win even though the opposition has showered their goal with shots for ninety minutes.

Because of this I decided to exclude many of the variables as they have little value over a single match. Instead, I focussed on using just shots on target data as this had the highest correlation with goals match-by-match. As with the total number of shots taken, the data is also roughly normally distributed although it is skewed towards zero (Figure 3) as obviously no matter how bad a team is it cannot achieve less than zero shots on target in a match (although Blackburn Rovers come close by managing to go the entire match against Tottenham Hotspur in 2012 without taking even a single shot, let alone managing to get one on target!)

Figure 3: Frequency of Shots on Target in English Premier League matches 2011–2012

In my next post I will explain more about how the Shot on Target model works and discuss its accuracy.

Comments

Ilia - July 30, 2013

Hi Martin,

I’m curious how you got such a high r2 value for shots on goal. When I try to do the same calculations on the EPL I can’t get more then 0.17. Any idea on what I might be doing wrong?

Martin Eastwood - July 30, 2013 I aggregated shots on goal with goals scored over a full season. Maybe you are looking at individual matches in which case the $r2$ will likely but much lower?

Predicting The Premier League Using The Refined Pythagorean Equation

Martin Eastwood — Fri, 18 Jan 2013 19:30:00 +0000

Introduction

New article for Betting Expert looking at the current Premier League standings compared with the predictions from my refined version of the Pythagorean Expectation.

Comments

How Early In The Season Can Pythagorean Predictions Be Made?

Martin Eastwood — Wed, 02 Jan 2013 19:30:00 +0000

Introduction

The next stage for developing my refined version of the Pythagorean equation (known as the MPE) is to characterise how much data it actually needs to make accurate football predictions.

Methodology

To investigate this I selected Manchester City, Swansea City and Wolverhampton Wanderers from the English Premier League’s 2011–2012 season. The reason for choosing these teams was that they represented the top, middle and bottom of the league so I could test the MPE equation across teams of varying quality and league position.

I then used the MPE equation to predict the total points at the end of the season for each team week-by-week to see how the prediction changed throughout the year. Figure 1 shows the difference between the predicted points and the actual points achieved at the end of the season for each of the three teams.

Figure 1: Difference Between Predicted and Actual Points

Results

The prediction settled down very quickly for Manchester City and from match three onwards the root mean square (RMSE) of the error was just 1.96 points. This means that after just three games the MPE equation was accurately predicting how many points Manchester City would have at the end of the season to within two points.

For Swansea City the prediction was slightly more problematic as they didn’t score during their first four matches and the MPE equation needs goals to have been scored before a valid prediction can be made. Swansea City finally scored in their fifth match in a 3–0 victory over West Bromwich Albion and from then on the prediction steadily improved and was within three points of their actual total after their next six matches.

Wolverhampton Wanderers’ season was an interesting one to predict as they had a very misleading start with two wins and a draw in their first three matches giving a predicted point total of 83. At this point though it all went disastrously wrong for them and they lost their next five matches on the run by which time their predicted points had dropped all the way down to 30. Wolverhampton Wanderers eventually finished bottom of the league with 25 points.

Overall, the MPE equation appears to give stable results and the only real requirement is that goals have been scored. Based on the data in Figure 1 accurate predictions can be made early in the season as there is very little change in the predictions from week ten of the season onwards.

Comments

amir - March 23, 2013

That is very interesting.

What Has Caused Dimitar Berbatov’s Recent Lack of Goals?

Martin Eastwood — Sat, 15 Dec 2012 19:30:00 +0000

Introduction

Up until week 12 of the season, Dimitar Berbatov was one of the English Premier League’s top goal scorers and goal creators. However, since then he has gone 450 minutes without registering either a goal or an assist, coinciding with Bryan Ruiz’s injury. Check out my guest article in which I analyse the effect the absence of Ruiz has had on Berbatov’s performances here

Comments

Using the Pythagorean Expectation Across Leagues Wordwide

Martin Eastwood — Mon, 10 Dec 2012 19:30:00 +0000

Introduction

I showed in my last post that my initial version of the Pythagorean Expectation (MPE) predicted total points for the English Premier League (EPL) pretty well, with an RMSE of approximately four points over the course of a whole season (see here for an explanation of using RMSE to measure the error of the predictions). The next stage for the equation’s development is to see whether it can be applied to other leagues too. Having one MPE equation that could be used globally across leagues is preferable to having to create specific equations for each league.

The Eredivisie

At the recommendation of Scoreboard Journalism's Simon Gleave I started with the Eredivisie, the top flight division in Holland. The reason for choosing the Eredivisie is that it is a unique league, with high rates of goal scoring and a number of results in recent years that appear as potential outliers. For example, in the 2009–2010 season Ajax scored 43 goals more than Twente and conceded three fewer yet still finished second to them in the league. At the other end of the table Willem II finished 15th in 2007–2008 with a goal difference of -9 while the two teams immediately above them had goal differences of -30 and -24, respectively. These sort of results make the Eredivisie difficult to predict and so provide a good stress test for the MPE equation.

Applying the MPE to the final Eredivise standings from 1999–2000 to 2011–2012 worked surprisingly well, with an overall RMSE of 4.35 points. It is slightly higher than the 4.08 previously obtained for the EPL but this is perhaps to be expected since the original MPE equation was generated using just data from the EPL.

To see whether the Dutch league needed its own version of MPE I recreated the equation based on just Eredivisie data and the overall error dropped to 4.21, a decrease of around 3%. Such a minor improvement suggests that the equation maybe stable across leagues and so we will not need league-specific versions.

To test this hypothesis further I collected 223 league tables from around the world and optimised the MPE against this larger data set. The reason for this was three-fold. Firstly, the original equation I published was created just from EPL data so any peculiarities to the EPL could bias results for other leagues.

Secondly, the previous data set was smaller so any outliers in the data could have a large effect on the finalised results. By using a larger data set the influence of any outliers will be minimised.

Thirdly, and perhaps most importantly, this gave enough data to cross-validate the equation by randomly splitting the league tables up into training and validation sets. Initially, the MPE had been trained and tested using the same data. Now it has been tested on different data to which it was optimised against, reducing the risk of Type III errors errors occurring.

Figure One shows the RMSE for the predictions for fifteen leagues randomly selected as a validation set. The overall RMSE across the entire validation set is 3.88 points and is plotted as the vertical dotted line. The overall RMSE is now reduced to below four points and this new version of the MPE equation appears suitable for use globally across different leagues.

Figure 1: Results For Validation of MPE Equation

The finalised MPE Pythagorean Expectation is shown in Figure 2. Based on the data shown here this new version of the MPE equation is suitable for use across multiple leagues worldwide, with an average error of less than 4 points per season.

$predicted points = (goalsfor^{1.2299}/(goalsfor^{1.16793} + goalsaway^{1.20053})) * 2.29761 * numberofgamesplayed$

Figure 3: MPE Equation

Comments

Applying the Pythagorean Expectation to Football: Part Two

Martin Eastwood — Mon, 03 Dec 2012 19:30:00 +0000

Introduction

In my previous article, I discussed how to apply the baseball Pythagorean expectation to football and how to measure the error of the predictions using RMSE. This second article will demonstrate how to optimize the equation further to improve its accuracy.

Accuracy

One of the major reasons for the error in the predictions is the occurrence of draws in football. The Pythagorean expectation only looks at wins and losses and presumes that if a team scores zero goals then it will achieve zero points. This is of course incorrect, it is perfectly feasible for a team to fail to score but still gain a point through a nil-nil draw so we need to take this into account.

Howard Hamilton of Soccermetrics has published an updated Soccer Pythagorean equation that does just that, and it does a good job of it. For the 2011–2012 season, Howard Hamilton reports an RMSE of 3.81 compared with the RMSE of 5.65 I reported for my previous version of the Pythagorean equation. The downside to Howard Hamilton’s equation though is that it is rather complicated. While the original Pythagorean equation is simple enough to be used by any football fan, Howard Hamilton’s equation requires a decent understanding of mathematics to use it.

Because of that, I thought I would tweak the original Pythagorean formula a bit further to try and improve its accuracy without adding too much extra complexity to it. One easy way to do this is to scale the points scored per match to take into account the occurrence of draws. Applying least squares to this reduces the RMSE for the 2011–2012 season to 4.04 points, just 6% higher than Howard Hamilton’s equation. This is based on only one season’s data though so to get a true idea of how well my enhanced Pythagorean expectation works (abbreviated to MPE) I optimized the equation based on a much larger data set and applied it to the last 10 English Premier League (EPL) seasons ( Figure 1).

Figure 1: MPE Prediction by Season in the EPL

The MPE works well, with an average residual (the difference between predicted points and actual points) of 4.08 points. This compares nicely with Howard Hamilton’s published value of 3.81 and is less than half of the error the original Pythagorean Expectation equation gave. It is also worth noting that Howard Hamilton’s RMSE of 3.81 is for just one season, and of the ten seasons analysed here using the MPE, two actually have an RMSE lower than 3.81.

Plotting the MPE predicted points versus actual points for the last ten EPL seasons shows visually how well the MPE equation works (Figure 2). The correlation between the predicted and actual points scored is excellent, with an an $r2$ value 0.938 (Figure 2).

Figure 2: MPE Predicted Points Versus Actual Points in the EPL

So based on the initial work so far I am pleased that the MPE version of the Pythagorean expectation gives results comparable to Howard Hamilton’s more detailed and advanced derivation but without quite as much added complexity. The final equation for anybody who wants to give it a try is shown below in Figure 3.

$predicted points = (goalsfor^{1.22777}/(goalsfor^{1.072388} + goalsaway^{1.127248})) * 2.499973 * numberofgamesplayed$

Figure 3: MPE Equation

Comments

mark - December 3, 2012

Hi Martin,

nice, clear explanation.

Have you looked at using pythag to predict future games rather than explain previous points totals ? I’ve always thought that the drive to reduce rmse for predicted vs actual points can lead to overfitting of the non repeatable luck driven component of the actual games.

I’ve looked at pythag match ups as predictors of future games, but only in the NFL ,not soccer and pythag league points from one year to predict total points in the next season for soccer here http://thepowerofgoals.blogspot.co.uk/2012/11/a-predictive-pythagorean-for-football.html

Game by game pythag is also a novel twist.

Interesting subject, but a bit confused as to where it’s going at the mo. Posts like yours will go a long way towards clarifying things. Once again, nice read.

Mark

admin - December 3, 2012

Hi Mark,

Thanks for the comments, they are really interesting points. I agree about the rmse, football is so variable and luck-driven that an error of zero is unrealistic and probably not particularly useful for making predictions from as will not reflect the future accurately.

I am interested in looking at the predictive power of the Pythagorean and will be investigating that further. I am also interested though in what else the Pythagorean shows, such as how much Everton would need to improve in terms of goals scored / conceded to really challenge for a top four place or for a relegation-threatened team to stay up etc. Hopefully over the next few weeks I will be able to find out how useful the Pythagorean can be for this sort of thing in football.

Thanks again!

Martin

Jonas - December 12, 2012 Thanks for an interesting blog. I have just one question. I see that you have fitted two different exponent for “GoalsFor”, have you tried fitting the formula with the same parameter for “GoalsFor” in both the numerator and denominator? This would seem more intuitive (not that I find the pythogarean very intuitive), but then again you would probably get larger RMSE.

admin - December 12, 2012

Yes you could simplify it by using one exponent but the RMSE will go up.

Andrew Ferris - December 17, 2012

Applying your equation to the current league table for the premiership on a team by team basis, the total points would equal 450, as opposed to the real points total currently 451 (correct 17/12/12), which is amazing accuracy. However, it is out by 2.7 points per team on average, and out by 11 points for Manchester United, which would be great in real life as I detest them! The league table would look as follows:

Team	Points
Manchester City	34
Manchester United	31
Chelsea	28
Arsenal	28
Everton	27
Tottenham Hotspur	25
Swansea City	25
West Bromwich Albion	25
Stoke City	25
West Ham United	23
Liverpool	23
Fulham	22
Norwich City	19
Sunderland	19
Newcastle United	18
Southampton	17
Aston Villa	16
Reading	15
Wigan Athletic	15
Queens Park Rangers	14

Max Steele - October 4, 2013

Hi Martin,

I am slightly confused by the last graph. Surely it is more helpful to measure deviation from the line y=x rather than the line of best fit? I don’t really see what success a high correlation coefficient has if for example the best fit line was vertical. To highlight this, realise that you could achieve this with r^2 = 1 if you just predicted every team to have the same points total.

Using the line y=x would have the added benefit of allowing you to see where your model is performing better/worse (i.e. low point scorers underestimated etc.) with systematic deviations from the line y=x.

Thanks, Max

M - December 5, 2013

Why the formula has GoalsAway instead of GoalsAgainst?

Martin Eastwood - December 29, 2013

Yes, perhaps GoalsAgainst would have been a better name so in case it isn’t clear to anybody else, I am referring to goals conceded

Rick Tee - September 6, 2014

I have been working on this for a month or so and what i’ve found is accuracy drops from about 60% in the BPL to 40% in the Championship and lower, I also noticed a strange phenomenon where the outcome was the reverse of the test result i.e if team B were deemed the winner then team A would actually be the winner.

Current result	%
BPL	70%
Cha	60%
LG1	60%
LG2	55%
CNF	70%

I should add I am also counting draws in the accuracy quoted above. I will continue to test but as the season is still ‘finding its feet’ i’m sure thing will change.

Rick Tee - September 6, 2014

Thought I would add, for the draws i am using a variable x and y, these are set differently for each league, eg. BPL x=75 y=84, CNF x=57.25 y=55.

Hw = 70, Aw = 75, D = 80 If the points fall between these two numbers then a draw is the predicted outcome.

I know the numbers are widely different but the system is essentially the same. I just thought i would share some information on how i’ve been calculating a draw. If anyone finds it useful I will try to explain my system in more detail.

Applying the Pythagorean Expectation to Football: Part One

Martin Eastwood — Mon, 26 Nov 2012 19:30:00 +0000

Introduction

The Baseball Pythagorean Expectation is a formula originally derived by Bill James to estimate how many games a baseball team could be expected to win over a season based on the number of runs they score and concede (Figure 1). Teams winning fewer games than their Pythagorean prediction are considered to have been unlucky while those outperforming the prediction are thought to have had luck on their side.

$wins = runs scored^2 / (runs scored^2 + runs allowed^2)$

Figure 1: The Baseball Pythagorean Expectation

The formula works well for baseball, giving predictions generally within three games of what actually happens. The Pythagorean expectation has also been applied successfully to other sports, including American football and basketball. However, so far the equation has not worked particularly well for predicting football matches.

Table 1 shows goals scored and conceded in the English Premier League (EPL) during the 2011–2012 season, along with the actual points and Pythagorean predicted points. Looking at the difference between predicted and actual points it is clear that the Pythagorean expectation is over-predicting at the top of the table and under-predicting at the bottom.

Team	GF	GA	Pts	Pythag Pts
Manchester City	93	29	89	104
Manchester United	89	33	89	100
Arsenal	74	49	70	79
Tottenham Hotspur	66	41	69	82
Newcastle United	56	51	65	62
Chelsea	65	46	64	76
Everton	50	40	56	70
Liverpool	47	40	52	66
Fulham	48	51	52	54
West Bromwich Albion	45	52	47	49
Swansea City	44	51	47	49
Norwich City	52	66	47	44
Sunderland	45	46	45	56
Stoke City	36	53	45	36
Wigan Athletic	42	62	43	36
Aston Villa	37	53	38	37
Queens Park Rangers	43	66	37	34
Bolton Wanderers	46	77	36	30
Blackburn Rovers	48	78	31	31
Wolverhampton Wanderers	40	82	25	22
			RMSE	8.4

Table 1: Pythagorean Expectation for the EPL 2011–2012

We can quantify this error by calculating the root-mean-square error (RMSE). This technique basically squares the difference between the predicted and actual points and then takes the square root of the average. It sounds complicated but all the squares and square roots do is make all the numbers positive. Imagine if we predicted just two values and were -10 points out for the first and +10 points out on the second. If we just averaged these two numbers then the average error would be zero, making it look like our prediction was perfect when it obviously was not. Instead, if we square the numbers first and then take the square root of the average we get the correct error of ten points. Doing this calculation for Table 1 gives us a RMSE of 8.4 points meaning that on average the Pythagorean expectation was eight points out for the 2011–2012 season.

The more accurate the predictions are then the lower the RMSE will be. One way to improve the prediction is to alter the exponent used in the equation. In other words, instead of raising goals scored and conceded to the power of two we use different values. Figure 2 shows what happens to the RMSE as the exponent is changed from 0.1–3. Looking at the chart, the RMSE is lowest using an exponent of 1.35, giving an average error of 5.75, nearly three points lower than before.

Figure 2: Effect of Altering Exponent on RMSE

The next logical step to improve the prediction further is to try using a different exponent for each part of the equation. This makes the formula harder to optimize but by applying a technique called least squares to it we come up with optimal exponents of 1.39, 1.43 and 0.98. Unfortunately this has little effect on the RMSE though, reducing it just 0.1 to 5.65 points.

So far the predictions are still nearly six points out but in part two of this article I will discuss why the error is high and show how to improve it further to increase the accuracy of the predictions.

Comments

Disparity in European Football Leagues

Martin Eastwood — Tue, 20 Nov 2012 19:30:00 +0000

Introduction

Having mentioned the effect disparity plays on determining the league champions in previous posts I thought it would be interesting to look at the actual levels of disparity currently present in football.

English Premier League

I started off looking at the English Premier League (EPL) over the past decade and plotted the points achieved each season as a Tukey Box-and-Whiskers plot (Figure 1). Looking at Figure 1, the spread of points across the league each season is broadly consistent. There have been a few years where individual teams have done particularly well, such as Chelsea in 2004-2005, or particularly badly, such as Derby County in 2008, but there are no obvious changes over time.

Figure 1: Points Scored in EPL Per Season

One noticeable feature, however, is that the median value for every season (the thick black line in the middle of each box) is lower than the overall average (plotted as the horizontal dotted line), suggesting the data is skewed. Looking at the 2010–2011 season as an example, half the teams scored less than 47 points while half scored 47 or more. In comparison, the average points scored that season was 51.5. This means that an average mid-table EPL team is closer to relegation than it is to winning the league. To put it into perspective, West Ham finished bottom that season scoring just 18.5 points less than the average while Manchester United won the league with 28.5 points more than the average.

Other European Leagues

A similar pattern can be seen across all the major league in Europe (Figure 2) where the median points achieved was also lower than the average. The median points for the Budesliga and Eriedivise were furthest from the average but it is worth bearing in mind that these two league player fewer matches than the EPL, La Liga and Ligue 1 so this is perhaps to be expected.

Figure 2: Points Scored in 2010-2011

La Liga and Ligue 1 both show two teams that are classified as statistical outliers. It is no surprise that the two outliers in La Liga are Real Madrid and Barcelona who both finished more than twenty points ahead of the rest of the league. In the case of Ligue 1, the champions Montpellier and relegated Arles-Avignon are both classed as outliers. A major the reason for this is how close the middle of Ligue 1 finished that season – Monaco were relegated with 44 points, only seven points less than Bordeux who finished in seventh place.

Since leagues play different numbers of matches it is difficult to compare them directly so I also looked at the difference in points scored per match by the top team and the middle team, and the middle team and bottom team (Table 1). The results show that La Liga was the most uncompetitive of the leagues, with the champions scoring 1.737 points more per match than the bottom team. The EPL came out as the most competitive league, with the lowest difference between the top and bottom teams. However, Ligue 1 appears the most balanced, with the smallest difference between the top and bottom of the league compared with the middle. Interestingly, the Eriedivisie appears unbalanced in the opposite way to most other leagues, with the bottom team further away from mid-table than the champions are from mid-table.

League	Top/Middle	Middle/Bottom	Total
EPL	0.868	0.368	1.237
Ligue 1	0.711	0.763	1.474
La Liga	1.289	0.447	1.737
Bundesliga	0.912	0.441	1.353
Eriedivisie	0.765	0.941	1.706

Comments

Analysis of André Villas-Boas Vs Harry Redknapp

Martin Eastwood — Sun, 18 Nov 2012 19:30:00 +0000

Introduction

Since taking over as manager of Tottenham Hotspur, André Villas-Boas has been trapped in former Spurs manager Harry Redknapp’s shadow. Every tactical decision or team selection Villas-Boas makes is seemingly compared with Redknapp’s previous achievements. And after Tottenham’s apparent slow start to the season, Villas-Boas has come under heavy criticism from the media whose narrative seems to be that Tottenham are performing poorly. But is this criticism fair and are Tottenham really performing any worse than last season under Harry Redknapp?

Find out by reading the rest of this article here.

Comments

Effect of Season Length on Deciding the League Champion

Martin Eastwood — Mon, 12 Nov 2012 19:30:00 +0000

Introduction

In my previous article I looked at the interplay between luck and skill in determining the league champions. There is another parameter though that also interacts with luck and that is the structure of the league itself. How many times have you heard the same tired, old cliché from football managers about how luck evens itself out over a season? But does it? Is a football season really long enough for the effects of chance to be cancelled out?

I used the same mathematical model as before to simulate 10,000 seasons of a league containing 20 teams. Skill levels were randomly assigned to each team from a normally distributed population with a mean of 0.5 and a standard deviation of 0.1. The length of the season was then altered to see how frequently the team with the highest skill level won the league depending on the number of matches played – teams either played each other once, twice, four times or eight times per season.

Number of Teams	Frequency Teams Meet	Mean Win %	Best Team Win %
20	1	60.2	32.11
20	2	75.3	45.9
20	4	82.2	48.5
20	8	85.2	50.8

Table One: Effect of Season Length on League Champions

Results

The results in Table One show that as the length of the season increases the probability of the team with the highest skill rating winning the league increases too. The Champions also win a greater percentage of their matches too. Therefore, the more matches that are played the less of an influence chance seems to play in determining the overall league champion.

The second row of Table One matches the structure of four of the major leagues in Europe – Premier League, Serie A, La Liga and Ligue 1 – which all contain 20 teams that play each other twice per season. The Eriedivisie and Bundesliga only contain 18 teams though, so what affect does this have? Rerunning the mathematical model with 18 teams gives a lower frequency for the best team winning the league of 28.8%. This suggests that the smaller size of these two leagues makes them somewhat more competitive as there are fewer matches for luck to be evened out.

The Scottish Premier League (SPL) is smaller again, containing just 12 teams. The structure of the league is fairly unique in Europe, with teams playing each other three times, either twice away and once at home or vice versa. The league then splits in half and teams play a further match against the remaining five teams in their half of the league. If we apply the mathematical model to this structure then we come out with a frequency of 19.3% for the best team winning the league. This means the SPL should be one of the most competitive leagues in Europe, yet it has only ever been won by two teams – Celtic and Rangers. The reason for this is likely due to the large disparity in talent between Glasgow’s two largest teams and the rest of the league cancelling out the effect of chance.

Interestingly, with Rangers now relegated from the SPL for financial irregularities, the league is the closest it has ever been. It was thought that without Rangers present in the SPL, Celtic would go on to dominate a very one-sided division. Yet with Hibernian currently sitting top of the league, the reduction in disparity from the loss of Rangers may actually make it the most competitive and exciting year in the SPL’s history.

Comments

How Often Does The Best Team Win The League?

Martin Eastwood — Thu, 08 Nov 2012 19:30:00 +0000

Introduction

How often does the best team win the league? Probably not as often as you think as it is not just talent that is required for success; a decent amount of luck is needed too.

Methodology

To investigate how big a role luck plays compared with ability I created a mathematical simulation based on the English Premier League (EPL) containing 20 teams that play each other twice per season. Each team was randomly assigned a skill level drawn from a normally distributed population with a mean of 0.5 and a standard distribution linked to the spread of talent across the league so that the disparity between the top and bottom clubs could be controlled. The simulation was then run for 10,000 seasons at various disparity levels and the number of times the team with the highest skill level won the league was measured.

Mean Skill Level	Disparity	Mean Win %	Best Team Win %
0.5	0	65.5	0
0.5	0.02	65.9	10.3
0.5	0.04	67.4	22.7
0.5	0.06	69	32.4
0.5	0.08	72.1	41.2
0.5	0.1	76.4	46.4

Table 1: Effect of Disparity on League Champions

The first row in Table One shows what would happen if all teams in the league were identical. Each team has a skill level of 0.5, meaning that they would each be expected to win 50% of their matches and lose 50% (ignoring draws to keep the model simple). Due to random chance though some teams will win more than 50% and some will lose more than 50%. You can see that the average number of matches won by the league champions over 10,000 seasons was 65.5% so in an evenly matched EPL you would just need to be lucky enough to win an extra 15% of matches to be champions.

As the disparity increases though, the influence of chance decreases and the best team goes on to win the league more often and more of their matches in the process. Take a look at this season’s EPL and while it is possible that QPR could go on to fluke wins in all their remaining matches and go on to win the league it would take a colossal amount of luck compared to say the amount of luck Manchester City would need to finish ahead of Manchester United since their skill levels are closer.

This leads to the question though of what is preferable, an evenly matched, competitive league in which luck is a major determining factor in winning or a league that is perhaps fairer as it has enough disparity that the best team is predominantly likely to win?

Comments

The Poisson Model So Far

Martin Eastwood — Fri, 02 Nov 2012 19:30:00 +0000

Introduction

In my last article I wrote about my experiences using the Poisson distribution to predict the outcome of football matches. The results so far have been rather disappointing so I thought I would have a look at where things were going wrong.

Probabilities

The first place I decided to look was at the probabilities generated for the matches predicted correctly compared with those predicted incorrectly. I suspected that maybe the model was struggling with matches between more evenly matched teams. For example, for last week’s match between Stoke and Sunderland the predicted outcome was a home win with a probability of 51%. This still leaves us with a 49% chance though that the game will finish with an away win or a draw instead making it potentially difficult to predict accurately.

Overall, the average probability for games correctly predicted was 64% compared with 56% in the games where the prediction failed. At first look it would therefore appear that the model does struggle somewhat with games between more closely matched teams. However, when you look at the variability in the data it is not possible to discern between the two percentages (Figure 1). In fact comparing the data sets using analysis of variance (ANOVA) gives a p-value of 0.32 suggesting no statistical difference between the two percentages based on the current data.

Figure 1: Average probabilities of matches correctly / incorrectly predicted by the Poisson model

Next I looked at which outcomes were being incorrectly predicted and a problem immediately became apparent. So far the model has predicted 50 matches of which 58% were predicted to be home wins, 34% as away wins and 8% as draws. Looking at what really happened though, of those 50 matches 42% were actually home wins, 30% away wins and 28% were draws (Figure 2). This suggests the model is under-predicting the likelihood of draws by quite a large margin and is actually predicting them as home wins.

Figure 2: Proportion of Match Outcomes - Poisson vs Actual

Conclusions

A quick Google revealed two possible fixes. Karlis and Ntzoufras recommend replacing the independent Poisson with a bivariate Poisson to add an element of correlation between the home and away team’s scores. However, even with this they still needed to inflate the diagonal of the score matrix to try and improve the prediction of draws, suggesting that moving to the bivariate Poisson is not necessarily much of an improvement. An alternative proposal by Dixon and Coles was to stick with the two independent Poisson calculations but add in an additional parameter to modify the probabilities of 0-0, 1-1, 1-0 and 0-1 scores occurring.

So where does this leave the current Poisson model? For me, it is time to move on to other ideas. The Poisson model is one the most widely used models for predicting football outcomes so I will return to it in the future to try out the Karlis and Ntzoufras and Dixon and Coles adjustments but I gave a few other ideas to write about first.

Comments

Using Poisson to Predict Football Matches

Martin Eastwood — Mon, 29 Oct 2012 19:30:00 +0000

Introduction

The Power Of Goals recently blogged about using the Poisson distribution to predict the outcome of football matches. I have been evaluating the predictive ability of the Poisson for the English Premier League (EPL) this season so I thought I would share my experiences too.

For anyone who is unaware, the number of goals scored by each team in a football match roughly follows a Poisson distribution. As you can see in Figure 1, it is not exact though as the Poisson distribution underestimates the likelihood of no goals being scored and overestimates one, two and three goals being scored. By four goals and upwards the Poisson starts to underestimate again. The actual difference between the Poisson and what is observed in the EPL is reasonably small though so it just requires a small fudge factor to bring the two into line.

Figure 1: Poisson Distribution vs Observed

To carry out the predictions I have written a script in R that scrapes the Premier League table directly from the BBC’s website. The script then calculates attack and defence coefficients for each team by comparing their goals scored and conceded with the overall EPL average home and away. The predicted number of goals scored in a particular match can then be calculated by scaling the EPL’s average goals by the two team’s attack and defence coefficients. This can then be mapped to the Poisson distribution to generate a probability matrix for each particular score line (Table 1). From this, the probabilities can be summed to find the odds that each match will end as a home win, draw or away win.

Goals	0	1	2	3	4	5	6	7	8
0	1.96	4.08	4.24	2.94	1.53	0.64	0.22	0.07	0.02
1	3.63	7.56	7.86	5.45	2.83	1.18	0.41	0.12	0.03
2	3.36	7.00	7.27	5.04	2.62	1.09	0.38	0.11	0.03
3	2.08	4.32	4.49	3.11	1.62	0.67	0.23	0.07	0.02
4	0.96	2.00	2.08	1.44	0.75	0.31	0.11	0.03	0.01
5	0.36	0.74	0.77	0.53	0.28	0.12	0.04	0.01	0.00
6	0.11	0.23	0.24	0.16	0.09	0.04	0.01	0.00	0.00
7	0.03	0.06	0.06	0.04	0.02	0.01	0.00	0.00	0.00
8	0.01	0.01	0.01	0.01	0.01	0.00	0.00	0.00	0.00

Table 1: Example Goal Probabilities (%)

Since the predictions are based on past performance this season, I waited until week five of the EPL to start testing it so I had at least a month’s worth of previous results to work with. The first week went well, with the model correctly predicting the outcome of six of the ten matches that weekend. Table 2 shows the predicted probabilities (%) of the home team winning each match. From this I also calculated the odds and compared mine with those available from Betfair to see how well they compared.

Home	Away	Prediction	Probability (%)	Odds	Betfair	Result
Swansea	Everton	HOME	56.3	1.78	3.35	AWAY
Chelsea	Stoke City	HOME	63.4	1.58	1.39	HOME
Southampton	Aston Villa	AWAY	49.2	2.03	3.1	HOME
West Brom	Reading	HOME	41.1	2.43	1.82	HOME
West Ham	Sunderland	HOME	35.7	2.80	2.24	DRAW
Wigan	Fulham	AWAY	40.1	2.49	3.25	AWAY
Liverpool	Man Utd	AWAY	75.6	1.32	2.82	AWAY
Newcastle	Norwich	HOME	82.9	1.21	1.84	HOME
Man City	Arsenal	AWAY	37.1	2.70	1.78	DRAW
Tottenham	QPR	HOME	41.1	2.43	1.51	HOME

Table 2: EPL Week 5 Predictions

Since then, the performance of the Poisson has between correctly predicting between 30-60% of matches each week (Figure 2). So far, the average accuracy is 46%, which is slightly higher than the 33% we could expect from randomly guessing each result.

Figure 2: Weekly Performance of Poisson Predictive Model

I am hopeful the model’s success rate will improve over the course of the season as it gets more data to work with. There are also further improvements that can be made as well. For example, the model currently considers the goals scored by each team to be independent events. However, it may be that the two should be correlated together as it would seem intuitive that the more goals one teams scores the less likely the opposition is to score. At the moment though I wouldn’t place too much faith in the Poisson model.

Comments

richard vadoret - August 13, 2013

Hi Martin ,

your graph in figure 1 show that your poisson distribution can predict Under/Over 2.5 final soccer score with good accuracy. isn’t it ?

thanks

richard

Adarsh - October 11, 2014

Try poisson distribution on France Ligue 2… quite accurate…

Martin Eastwood - October 15, 2014

Cool, will take a look!

Influence Of Clean Sheets

Martin Eastwood — Fri, 26 Oct 2012 19:30:00 +0100

Introduction

To make much sense of the statistics available for football we need to have an understanding of their context so I am planning on starting off simple by looking at baselines for various events and statistics while I build up the information required to start a mathematical model.

Clean Sheets

While most football analytics seems to focus heavily on goals, I am going to start off with defending and the all important clean sheet. Clean sheets have been fairly consistent throughout the English Premier League’s (EPL) history, occurring in around 27% of matches between 1993 and 2011 (Figure 1). The data shows some variability around the mean with perhaps the slightest hint of an upwards trend, but in general the total number of clean sheets per season has remained constant.

Figure 1: Total English Premier League Clean Sheets

Home And Away

If we split the data by home and away then we can immediately see a significant difference (Figure 2; p>0.001). On average, the home team will keep a clean sheet 33% of the time while the away team will only manage it in 22% of their matches. Interestingly, both sets of data appear to follow broadly similar patterns with peaks and troughs occurring in the same years. I hope to explore this in more detail in the future.

Figure 2: Clean Sheets Home and Away

Clean sheets are valuable commodities as they guarantee you a minimum of one point. As the cliche goes, if you keep a clean sheet you cannot lose. Looking back over the EPL’s history shows that a clean sheet at home is actually worth 2.1 points on average. This means that over the course of a season obtaining a clean sheet in 33% of matches would be expected to generate 13.2 points. Away from home a clean sheet is of lower value, generating just 1.8 points each. Over the course of a season this would therefore bring in an additional 7.5 points.

The English Premier League

We can use these baselines to examine how teams are performing in terms of clean sheets home and away. Figure three shows the proportion of matches in which each team in the EPL obtained clean sheets for the 2011-2012 season. The teams in the upper right quadrant all acheived an above average number of clean sheets both home and away. In comparison, West Brom’s defence performed very well at home yet they struggled to obtain clean sheets away from the Hawthorns. Liverpool were the opposite of West Brom, keeping clean sheets away from Anfield but struggling at home. Bolton, Blackburn and Wolves all generated very low numbers of clean sheets home and away and were all relegated from the EPL. Norwich are an interesting exception as they possessed the worst away record for clean sheets yet managed to finish in a respectable 12th position last year.

Figure 3: Proportion of Matches With Clean Sheets Home and Away

League Position

If we carry out linear regression on 2011-2012’s data (Figure 4) we can see the correlation between the number of clean sheets a team kept over the season and their final league position. The r2 value of 0.72 for the regression shows that the two are strongly correlated with each other so any team not keeping clean sheets could be expected to finish lower down the league table. This does not bode well for current champions Manchester City, who have conceded goals in seven of their eight EPL matches this season.

Figure 4: Correlation of Final League Position to Number of Clean Sheets 2011/2012

Comments

Football + Mathematics

Martin Eastwood — Mon, 22 Oct 2012 19:30:00 +0100

There are plenty of analytical football blogs already out there on the internet so I thought long and hard about whether to bother with pena.lt/y. I didn’t want to add to the high level of mundane background noise that already pervades the web and I am not out to compete with anyone to have the biggest, greatest or most popular blog in history.

Instead, pena.lt/y is somewhere for me to jot down thoughts and ideas that interest me before I forget them. I am a data scientist and spend a large portion of my time analysing data to try and work out what is happening and why. Because of this, I have a fascination with numbers and data and trying to glean as much knowledge as possible from them.

This is a hugely exciting time for football analytics. There has never been as much data available as there is now and no one truly knows what it really means, how it all links together or what to do with it. The first step in understanding it all is to set out what we want to achieve. Is our aim to be able to predict the outcome of matches in advance, optimize tactics, to identify player’s strengths, analyse the opposition and develop strategies to beat them, spot players at risk of injury, etc, etc?

Understanding football is also an interesting intellectual challenge too. While some sports, such as baseball, lend themselves well to statistical analysis, football appears to be inherently more complicated. With 22 people running distance of up to 10 kilometres each and making hundreds of passes just to score a single goal it is difficult to identify the key data that explains the final outcome. How many passes does it take to score a goal? How many sideways passes have the same value as a forward pass? Does having a high possession percentage improve your chances of winning?

I am also interested in the influence of random, or luck on football. The difference between a nil-nil draw and winning one-nil can be down to a scuffed shot trickling across the line so instinctively it feels like luck has a large role. Yet when you look at international competitions, such as the World Cup, where you would expect luck to have a larger effect it seems to be the same handful of teams winning. Does skill really outweigh luck and if so by how much?

There is a huge amount of potential for statistics and analysis to influence the sport; we just need to untangle the data. I don’t expect to unearth some magical formula that will answer all our questions but hopefully this blog will satisfy my curiosity and maybe provide some interesting insights for other people into football.

Comments

It's Fergie Time

Martin Eastwood — Fri, 19 Oct 2012 19:30:00 +0100

Introduction

After Manchester United’s recent defeat to Tottenham, Sir Alex Ferguson was once again furious about the amount of injury time played. He even went as far as to claim the four minutes Chris Foy added was an ‘insult’.

They gave us four minutes, that’s an insult to the game

However, while it is a common theme in the United manager’s post-match interviews that Manchester United do not receive enough added time, many other football fans think the opposite. In many people’s opinion Sir Alex Ferguson constant complaints pressures referees to award United excessive stoppage time.

I wanted to see which was correct so I started off by looking at the amount of injury time added to all the premier league matches played this season and calculated the average for each team. This is total injury time so accounts for time added to both the first and second halves of the match.

Figure One: Average stoppage time in seconds added so far in the Premier League

As you can see above, Manchester United are towards the top of the list but there are still three teams that on average receive more injury time in their matches. Fans of West Ham perhaps get the best value for money for their match tickets as they are top of the list with just under 500 seconds added to each match.

Somewhat surprisingly Manchester City and Arsenal are both near the bottom, with 335 and 366 seconds, respectively. This suggests that so far in the season, bigger clubs are not necessarily receiving any bias based on their status.

As we have only had seven weeks of the Premier League played so far we cannot draw too many firm conclusions from this relatively small sample set. Currently though there is no statistical difference (p > 0.05) between the amount of stoppage time Manchester United have received compared with any other team in the Premier League. Maybe this is why Sir Alex Ferguson is getting so worked up…

Comments

Jonas - December 12, 2012

More Or Less on BBC radio 4 did a piece on this a couple of weeks ago. They looked closer at differences when Manchester is under or draw, and also compared them to the other teams i EPL. If I recall correctly their conclusion was that better teams get more injurytime, and that there was some difference wether they played at home or away.

http://www.bbc.co.uk/programmes/b01ny0fc

Martin Eastwood - December 12, 2014

Interesting, thanks for the link Jonas!

Hello World

Martin Eastwood — Thu, 18 Oct 2012 19:30:00 +0100

Hello!!

Hopefully this blog will be live soon once I have worked out how to use it…