I've discussed various techniques for ranking football teams on my blog before, such as using Massey Ratings to account for strength of schedule, but I've not covered the most famous ranking algorithm of them all yet, Google's PageRank.
The PageRank algorithm (Figure One) was initially developed by Larry Page and Sergey Brin back in the mid nineties whilst working on a research project at Stanford University. When Page and Brin later founded Google, PageRank became the cornerstone for how their search engine ranked webpages and determined the most relevant set of results for a user's query.
$PR(A) = (1-d) + d(PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))$
where:
PR(A) is the PageRank of page A
PR(Ti) is the PageRank of pages Ti which link to page A
C(Ti) is the number of outbound links on page Ti
d is a damping factor ranging between 0-1*
Figure One: Google's PageRank Algorithm
Google's search algorithm has evolved considerably over the years since then, with updates such as Panda, Hummingbird and RankBrain brought in to help deal with content farms and to better understand ambiguous queries. However, PageRank still remains the central method for determining a web page's rank in the search results.
The PageRank algorithm essentially counts links on the web and treats them like votes of support. The more links there are leading to a specific web page then the more votes there are for that page being of high quality. Not all votes are counted equally though. Votes coming from pages themselves considered high quality count for much more than from pages with few links leading to them.
This puts us in a bit of a tricky situation though as it means the rank of a web page is dependant on the ranks of all the pages linking to it, which are themselves dependant on the ranks of all the pages pointing to them and so on. Plus, when you factor in that two web pages can both link to each other then you end up with enough circularity to make this initially seem an impossible calculation.
It turns out that this problem can be solved fairly easily though through brute-force iteration. We start off by giving each webpage a default score we can use to start calculating its PageRank and then iteratively move through the system updating all the pages' ranks based on the ranks of the all pages linking back to them until the whole system converges and the ranks settle down to their true values (or at least close enough we don't mind the remaining error).
If you want to be more elegant though, you can actually skip this brute force approach and solve the whole system using linear algebra but I'm going to leave that for a future article.
Instead of using using links as votes of support from one web page to another we can use goals as votes of support from one team to another, where the more goals a team concedes then the stronger their vote of support for the opposition that scored against them.
A handy feature of the PageRank algorithm is that web pages only get one vote that ends up being shared out equally between all the other pages they are linking to. When we apply this to football it means that the more goals you score against a team, the greater the share of their vote you receive.
Table One below shows the rankings for the top 100 European teams over the past year based on domestic league and European fixtures (domestic cup competitions are not currently included) as calculated using Google's PageRank algorithm.
Rank | Team |
---|---|
1 | paris saint-germain |
2 | fc barcelona |
3 | bayern münchen |
4 | atlético madrid |
5 | borussia dortmund |
6 | real madrid |
7 | sevilla fc |
8 | manchester city |
9 | arsenal fc |
10 | olympique lyon |
11 | valencia cf |
12 | vfl wolfsburg |
13 | tottenham hotspur |
14 | juventus |
15 | sl benfica |
16 | villarreal cf |
17 | cska moskva |
18 | athletic bilbao |
19 | chelsea fc |
20 | shakhtar donetsk |
21 | as roma |
22 | celta vigo |
23 | inter |
24 | ssc napoli |
25 | bayer leverkusen |
26 | as monaco |
27 | bor. mönchengladbach |
28 | stade reims |
29 | as saint-étienne |
30 | lazio roma |
31 | fc lorient |
32 | zenit st. petersburg |
33 | fk krasnodar |
34 | 1899 hoffenheim |
35 | toulouse fc |
36 | acf fiorentina |
37 | everton fc |
38 | hellas verona |
39 | malmö ff |
40 | sampdoria |
41 | olympique marseille |
42 | lille osc |
43 | rsc anderlecht |
44 | celtic fc |
45 | rb salzburg |
46 | dinamo moskva |
47 | olympiakos piräus |
48 | real sociedad |
49 | afc ajax |
50 | liverpool fc |
51 | montpellier hsc |
52 | manchester united |
53 | fc porto |
54 | fc augsburg |
55 | psv eindhoven |
56 | club brugge kv |
57 | leicester city |
58 | asteras tripolis |
59 | west ham united |
60 | sassuolo calcio |
61 | southampton fc |
62 | girondins bordeaux |
63 | crystal palace |
64 | lokomotiv moskva |
65 | brøndby if |
66 | molde fk |
67 | dinamo kiev |
68 | fc københavn |
69 | fenerbahçe |
70 | paok saloniki |
71 | ogc nice |
72 | fc schalke 04 |
73 | werder bremen |
74 | stoke city |
75 | terek grozniy |
76 | sporting cp |
77 | espanyol barcelona |
78 | kaa gent |
79 | bsc young boys |
80 | torino fc |
81 | galatasaray |
82 | dnipro dnipropetrovsk |
83 | hamburger sv |
84 | stade rennes |
85 | levante ud |
86 | fc twente |
87 | rosenborg bk |
88 | gd estoril |
89 | ac milan |
90 | fc basel |
91 | fc nantes |
92 | hannover 96 |
93 | panathinaikos |
94 | az alkmaar |
95 | sunderland afc |
96 | sm caen |
97 | rubin kazan |
98 | fk ural |
99 | 1. fsv mainz 05 |
100 | standard liège |
Table One: Top 100 European Teams As Ranked By The Google PageRank Algorithm
The initial results seem pretty feasible, with the top five spots comprising Paris Saint-Germain, Barcelona, Bayern München, Atlético Madrid and Borussia Dortmund.
Olympique Lyon are somewhat of a surprise though in position ten but they beat Paris Saint-Germain earlier in the season and have been playing in the Champions League so perhaps they are better than my pre-conceptions? FK Krasnodar also appear higher than I was expecting but looking back at their results they did quite well in the Europa League this season, including beating Borussia Dortmund, so it's perhaps not unreasonable for them to appear in the top half of the rankings too.
The are a number of ideas for improving this concept further, such as adding a decay into the data so more recent results carry greater importance in the rankings or adding in home field advantage (which is currently missing), so no doubt there'll be an update to this blog in the future once I've refined things.
Submit your comments below, and feel free to format them using MarkDown if you want. Comments typically take upto 24 hours to appear on the site and be answered so please be patient.
Thanks!