If you want to get in touch then you can either contact me using the comment form below, leave a comment on a blog post or find me on twitter @penaltyblog. If you want to contact me privately then you can use the form below.

Hi Martin,

I've been looking through your website and I really appreciate the work you have done. It's all great work!

I'm working on my own personal project where I want to investigate predicting football matches using the Dixon and Coles model vs Neural Networks to see which approach yields more accurate predictions.

Now for the Dixon and Coles model, I have read through the paper but I'm not entirely sure how they calculate the attack and defence rankings. Would you be able to shed some light on this? Also I have done my development so far in Java, would you say that Java is not suitable to implement the model and I would be better with using R (mind, that I have no experience of using R).

Thanks,

Aneesh

Thanks for the comments Aneesh!

There is an example of how to calculate the Dixon and Coles Attack strengths in my presentation here that should help get you started

In theory anything you can do in R you could also do in Java but it would likely mean writing a lot more code as R comes with all the statistical functions you need built in. Personally, I'd recommend taking a look at R or even Python. If you already have some experience at coding you'll soon pick them up and you may find it easier to achieve what you want to do.

I look forward to seeing how your neural net does!

Thanks,

Martin

Hello Martin,

I would like to know what are your thoughts on calculating a team's attack and away strengths using a simple calculation as follows vs the Dixon & Coles approach (their approach isn't explained in their paper)?

Calculate the average number of goals scored by a team / average number of goals scored in a league by all teams

Calculate the average number of goals conceded by a team / average number of goals conceded in a league by all teams

So team's average divided by "average team's" average

Hi Ian

You'll get much better results it you calculate the team strength using a Poisson regression so that you can account for team's oppositions as that will impact how goals they will have been expected to score. You'll also need to account for home / away as homefield advantage has a notable effect too.

Thanks,

Martin

Thanks for the reply Martin,

Do you have any examples for doing a Poisson regression to calculate strengths? Not code examples but actual mathematical formulae examples?

Thanks,

Ian

Hi Ian

The best place is the paper published by Dixon and Coles in which they discuss their methodology for using the Poisson to predict football matches - http://www.math.ku.dk/~rolf/teaching/thesis/DixonColes.pdf

Thanks,

Martin

Thanks for the link to the paper.

After reading through it, it doesn't seem they state how they calculate the attack and defence strengths nor the home advantage (alpha, beta and gamma values). I understand how they improve their model to account for low scoring matches and a time weighting function as that is all detailed in the paper but the initial calculation of alpha, beta and gamma is not.

Ian

Hi Ian

This may have more what you are looking for then - http://www1.maths.leeds.ac.uk/~voss/projects/2010-sports/JamesGardner.pdf

Martin

Thanks for the link Martin, it was very useful.

I'm in the process of implementing the Dixon and Coles model and seeing if I can improve it by add extra factors (morale, weather, etc.) to improve predictions. I have the model almost implemented, just need to do some final tweaks but I would like to create predictions every week.

So say I have a season's worth of historic data, I use it to calculate the attack/defence strengths to make predictions for week 1 matches. So when I want to make predictions for week 2, I'd use the historic data plus actual results for week 1 to calculate new strengths for predictions and then repeat the process every week. However I don't think this is necessarily optimal. To simplify, say you had ten weeks of data and you used those ten weeks to calculate the strengths for week 11. Now it comes to week 12, rather than using weeks 1-11 to calculate 12's strengths, couldn't you just use the strengths that have already been calculated (and stored) for weeks 1-10 and then just do a calculation for week 11 only and then combine the calculated strengths to get the strengths for week 12?

Ian

Hi Ian

Sure, there's no intrinsic reason why you couldn't update your model rather than recalculate everything. In my experience though people do recalculate all the data as it's often simpler to structure the code that way.

Martin

Hi Martin,

I have been producing some predictions using the Dixon and Coles model using independent Poisson regression as you've demonstrated in one of your posts but I have also added the adjustment for low scoring matches.

Now for the current Premier League season, I have passed all the results up to today to the model to produce predictions. I have also looked at providing past results for only the two teams involved in a match to the model rather than for all teams because if we have a match between Team A and Team B, surely only the pasts results for those teams matter and we don't care about the results for Team Z who hasn't played A and B. However with these two different approaches, I have noticed differences in predictions. Sometimes the probabilities change a bit to reinforce a particular result but in some cases the predicted result is completely different.

Would you have any idea for why this may be the case?

Ian

Hi Ian

Sounds like you are training your model on two different sets of data? If so, that will be giving you different coefficients and cause the difference you are seeing.

Martin

Hi Martin,

I'm trying two different approaches.

The first approach is including results for all teams so when predicting the result of Team A vs Team B, you will use all the previous matches played by Team A, B, C, D…X, Y, Z.

So Team C vs Team J would be included when predicting Team A vs Team B. This is the standard approach I have seen.

The alternative I am suggesting is to only use past results which involve Team A and Team B. So we would have results of Team A vs Team E, Team A vs Team P and Team B vs Team Y, Team B vs Team U. We don’t care about Team N vs Team R or Team W vs Team K for example.

Wouldn’t this work better? How would the result of Team U vs Team O influence Team A vs Team B? Surely we only care about the matches Team A and Team B have played against other teams?

Ian

Hi Ian

Yep, the main reason people fit it all into one model is the convenience of not having to do all the different combinations of model fits, which can be quite time consuming and more fiddly to implement. You'll get slightly different coefficients that way but generally it doesn't change too much. If you are seeing noticable differences then I wonder if you have a small sample size for a team somewhere. Have you validated the two different methods though to see which provides the best accuracy?

Martin

Hi Martin,

Was reading some posts on your blog, really interesting posts your write.

I have a question, maybe idea about new blog post.

Was looking online how to calculate league strength.

For example, England, premier league, was looking at something to read past league results, each single game, round position, to add some weight on it, and to try to get "league strength" for past seasons, then to try to predict round, final standings? What do you think, what are pros and cons ? How would you calculate league strength ?

Thank you for your time. Have a nice day.

Hi Ivan

League strength is something I really want to look at, I'd like to create some sort of weighting so can compare different leagues against each other. For example, how much harder is English Premier League compared with English Championship or Dutch Eredivisie? I haven't found a nice way of doing this yet but is on my To Do list!

Martin

Hi Martin,

I'm a keen follower of your posts and the football analytics industry as a whole. I'm currently studying economics at Newcastle University which involves me undertaking a dissertation in applying classic economic theories to football analysis. As it stands, my work primarily revolves around the use of Shapley values to gauge players' marginal contributions on the pitch. In particular I was wondering if this is something you've done any research into or if there's currently the market for football clubs wanting this type of information. I'm aware of the work you've done in producing a mathematical attribute model and would really appreciate being able to understand more about your work and how closely it would link with mine in the use for comparing and evaluating players.

Thanks for your time.

Regards,

Liam Kirk

Hi Liam,

Sounds interesting. I've never looked into Shapley values specifically but it's something that's on my todo list when I get enough free time! I don't think too many clubs will be using them at the moment but there's no reason they wouldn't be interested provided you can make the output relevant to them and simple to understand - the majority of people within football don't have particularly mathematical backgrounds so you need to package the results in a way that resonates with them.

Hope that helps!

Martin

Hey,

Maybe you can help me.

I have a question about ELO Ratings. I want to change the S. Normally it is 1, 0 or 0,5. But win or loss is only half truth. I want to take the shots on target as S. Example: Team A has 7 shots on target, the probability of the shots are maybe 54% toghether. Then would 0.54 my S.

What do you think, a good approach?

Hi Sing,

Go for it and see what happens. Maybe try using the total shot ratio as S, would be interesting to see the results

Martin

Hi Martin, Baseball stats guru Tom Tango pointed me in your direction. I am thinking about building a version of his Fans' Scouting Report (http://www.tangotiger.net/scout/) for football and he suggested you would be the one to know if it has already been done. My Googling efforts haven't turned up anything, but I am hampered by the football/American football dilemma; any search that combines 'football' and 'scouting' gives me thousands of results about the NFL combine. Do you know if such a thing (using the wisdom of the crowd to evaluate football players) exists? I'd love to get any other feedback you have on the idea, especially any suggestions for ratings categories. On the off chance you'd be interested in being a part of the project, I'd welcome that too! Thanks, Jake A bit of background about me: I'm a PhD student in archaeology at the University of Washington with a good deal of experience with Python, R, and SQL (and HTML, but I guess that's not as useful in the Age of Wordpress). I am a devoted follower of Arsenal and Columbus Crew. With the way Arsenal have unraveled as of late, I've been more seriously pursuing some data-driven side projects that I've had in mind for a while as a way to distract myself from the annual implosion.

Hi Jacob,

As far as I'm aware, this hasn't been done before for association football (soccer). It's an interesting idea though and I'm definately happy to help out if you want - I can imagine using the qualitative data generated by this sort of thing to help form Bayesian priors etc.

Martin

Hi! First of all thank you for an interesting blog! My name is Viking and I'm currently looking into predictions for football, in particular the EPL. My plan is to evaluate a number of different systems such as the Elo type system used by the Euro Club Index as well as later systems such as Glicko 2 and TrueSkill from Microsoft that also models the std deviation of a team's skill and uses a Bayesian approach. I find your Eastwood index very interesting as well, would you by any chance be interested in sharing your work on the Eastwood index in more detail? I also noticed you seem to have stopped using it? Any reasons why? Best regards, Viking Jacobsson

Hi Viking,

I stopped running my Eastwood Index as I started using a Poisson model based on a Dixon and Coles style method instead - this has the advantage of being able to calculate odds for things like over / unders etc while rather than just 1x2

I do have plans to bring my Eastwood Index back though as a way of ranking and comparing football teams. Will probably be next season though by the time I publish anything on it though.

Martin

Hey, Great work, only just found this website and its become my new favorite. will you be releasing the code for player ratings soon or are you keeping it private. because i would love to see it and test it myself. thank you and keep up the great work

Thanks for your message James, the code will not be released now but I'm hoping to make the data available soon - may be a fun project for over the summer while there's no football on!

Hi, would it be possible to see your code for player ratings for educational purposes. I am doing a dissertation in my economics course about the football industry, i would like to include a section about player ratings and player market value (using transfermarkt.co.uk) and the large disparity between skill and value. I promise to reference you in my work, your code would be an incredible help. thank you

Thanks for your message Alan, you are welcome to discuss the model in your dissertation but unfortunately I will not be releasing the code.

Hi Martin,

Before anything, sorry for my poor english.

My name is Thomas and I\'m an french apprentice data scientist.

Like you, I really love football, and two months ago I decided to make my own model to predict games\' scores.

I have already scraped a lot of datas from several website, and painstakingly joined together.

I have the 3 past seasons of the french Ligue 1, (scores, stats from the match, players rating, etc...).

I use the seasons 2013, 2014, and half of the 2015 season as train base, and I use the second half of the 2015 season as test base. Because I implement each model for each new day, my models are always readjusted before the day I want to predict.

My point is that even if I have a lot of data, only a few of them seem to be significant and I\'m able to predict only about 52% of games result (for the last 19 days, 10 matchs per days). I tried Poisson model, LDA model, Multinomial regression, and SVM model.

If you were fine giving me a piece of advise for helping me, I would be glad to give you a piece of my data to show what I have, and then maybe you could tell me what I could do.

Anyway, thanks for your time, and sorry for being quite rude maybe :).

PS: I use R for EVERYTHING.

Hi Thomas,

The best advice I can give you is to read the paper by Dixon and Coles - http://www.math.ku.dk/~rolf/teaching/thesis/DixonColes.pdf.

Their technique works much better than the standard Poisson model and provides a good basis for developing further.

Thanks for answering so quickly !

Unfortunately I already read this paper but even I can understand almost every single formula, it's quite hard to translate it in R language.

But thank you anyway, I will focus on this then. ;)

Hi Martin,

Just wanted to let you know I\'ve riffed off your linear programming post to perform (approximately) the same procedure.

You can see the result here - https://github.com/pearcedom/FF/blob/master/16-17/lineup-select/linearp-selection.md

Thanks for the groundwork!

Awesome, thanks Dominic I'll take a look!

Hi, I would like to thank you and congratulate you for all the work you have been doing in this blog, and sport analytics in general. I would like to recommend you a post, were you gather sources where you get your data, and why not some pool of academic references for different concepts and algorithms you mention from time to time. Keep up the good work.

Thanks Nick! It's difficult to write about where to get data from without upsetting the commercial data providers whose data I take...

I do like the idea about writing more about some of the academic references and algorithms I use those so I'll add it to my todo list!

Hi Martin,

I stumbled across your blog yesterday and have really enjoyed reading your work. I\'m actually in the process of trying to switch into data science (after a decade in finance) and am going on a bootcamp in the states in a couple of months (Metis) to help me acquire the skills and make the switch.

I remember reading the Dixon and Coles paper a few years ago and thinking...I wish I knew HOW to analyse this kind of thing in practice. The reason I found your blog was because I was thinking about ideas for my final project (a presentation of the skills learned) and had an idea of looking at football in some way.

Do you mind pointing me in the direction of where you scrape your data from?

Regards,

Matt

P.S - can you run a player comparison on N\'Golo Kante? As a Leicester fan, him leaving has had a devastating impact on the team. Leicester (who do a lot of technical scouting) have now signed two players to try and replace him (Mendy and Ndidi) and it would be interesting to see if your model also identifies these two players.

Thanks for the kind words Matt!

Squawka and whoscored are great sources of data, as is the BBC's website. Based on Kante's performances last season, the closest few players are Francis Coquelin, Daniel Drinkwater, Isaac Cofie and Asier Illarramendi.

Hi Martin,

Some great work. I am pretty new to it all, hope you have a second to answer some questions.

So if I want to use expected goals on my next scouting video, I saw your equation, but is there actually a standard model? I am having trouble finding one online. I ve seen various goal zone maps and equations but I work with an agent and presenting players with any old model is going to be confusing considering not many folks know what they re looking at.

Is that just the way it is? I read you mention that most dont show their model and is that still true? is it a work in progress or are everyone just in the mode of individualising their own 'special' approach and not agreeing to use 1 simple model?

I definitely want to get on board, is there somewhere I can pay to guarantee the correct model or is there no correct model?

Matt

I hope you can understand my queries, cheers

Hi Adam,

There is no 'correct' expected goals model at the moment as people have tended to create their own model's using a variety of techniques, often with questionable mathematical rigour.

People have also been rather secretive about how they've done things so no standard approach has evolved and frustratingly different models often give different results to each other.

If you are looking for a commercial package then Opta are probably the best people to speak to as they calculate expected goals themselves. Otherwise, you are welcome to freely use the equations I've published on my blog here.

Martin

Hi Martin,

I have recently had a link to your site forwarded to me and having had a brief read of some of your blogs I just wanted to drop you a message to say how much I appreciate you making all of this widely available.

Not that you need a random performance analyst giving you a big 'Well done', but as much as you are probably aware of the quality of this site and your work, it's always nice to hear from someone who appreciates it.

It gives great insight which has certainly given me food for thought on my own work with elite sports analysis.

Thanks again,

Amber

Thank you for the kind words Amber, I really appreciate your message and hope you find the info here useful for your work :)

Martin

"This training set was created from approximately 30,000 shots taken from the English Premier League\'s 2012/2013 and 2013/2014 seasons."

Could you please tell where you got this data and if possible provide a link for the same? I\'m working on a similar project. Thanks!

All the shot data was collected from whoscored and squawka. However, as I don't own the rights to the data I am unable to share it with you

Martin