Introduction

Football analytics has come a long way in recent years, moving from simple league tables to more sophisticated methods of quantifying team performance. If you’ve ever looked at Elo ratings or FIFA rankings, you know that rating systems attempt to provide a clearer picture of how good a team really is, beyond just the wins and losses. But are these systems as accurate as they could be?

Imagine two teams: Team A beats Team B 1-0 in a closely fought match, while Team C thrashes Team D 5-0. Should Team A and Team C gain the same rating boost? Many traditional rating systems don't differentiate much between these results, even though one clearly signals a more dominant performance. This is where Pi Ratings come in — a dynamic rating system designed to better reflect team ability by considering score discrepancies, home vs. away performances, and recent form.

Pi Ratings were first introduced by Constantinou & Fenton in their research on dynamic football team ratings. Their study showed that Pi Ratings not only provided a more accurate measure of team strength compared to traditional systems like Elo, but also demonstrated profitability against bookmaker odds over five English Premier League seasons.

In this article, we’ll explore:

Where traditional rating systems, like Elo, lack the nuance for football’s specific demands
How Pi Ratings improve upon them
Some real-world results comparing the two
How you can use Pi Ratings in your own football analytics work

If you’re interested in football (soccer) data or just want a better way to evaluate your team’s strength, this is for you. Let’s dive in.

Why Traditional Ratings Systems (Like Elo) Fall Short

Rating systems play in important role in football analytics, providing a way to compare teams and predict future performance. One of the most widely used methods is the Elo rating system, originally developed for ranking chess players and later adapted for sports, including football. While Elo has proven useful in capturing team strength over time, it has several key limitations when applied to football.

Elo Ratings: A Brief Overview

Elo ratings work by assigning a numerical value to each team, which is adjusted after every match based on the result. If a higher-rated team wins, it gains only a small increase in its rating, whereas an underdog victory leads to a more significant rating adjustment. The formula accounts for the expected probability of winning, meaning that an upset results in a greater shift than an expected victory.

The appeal of Elo lies in its simplicity: teams are ranked on a single scale, and their relative strength is updated dynamically based on match outcomes. However, despite its widespread use, Elo has several shortcomings that reduce its effectiveness in football.

Draws Are Not Handled

An significant limitation of the standard Elo system is that it only accounts for wins and losses — it doesn't natively handle draws. While this works well in sports like chess or tennis, it's a major drawback in football, where draws are a common and meaningful outcome.

Some football adaptations of Elo attempt to account for this by treating draws as "half a win," but prediction-wise, Elo doesn't model draws probabilistically — it simply doesn't output a draw probability. We'll return to this point later when we compare how Pi and Elo predictions perform head-to-head.

Score Margins Are Ignored

Elo ratings consider only whether a team wins, loses, or draws, but do not take into account the margin of victory. A 1-0 win is treated the same as a 5-0 win, even though the latter provides a much stronger indication of dominance. Since goal differences carry valuable information about team strength, ignoring them can lead to inaccurate assessments of performance.

Home and Away Performances Are Not Handled Separately

Football teams often exhibit significantly different performances at home and away due to factors such as crowd support, pitch familiarity, and travel fatigue. Traditional Elo ratings apply the same formula regardless of match location, failing to account for these home and away discrepancies. Some adaptations of Elo introduce a fixed home advantage adjustment, but this is often static and uniform across teams, whereas in reality, home advantage could vary by club and competition.

Slow Adaptation to Recent Form

Elo ratings update dynamically, but the changes are incremental and cumulative. This means that a team experiencing a sudden surge or decline in form may not have its rating adjust quickly enough to reflect its current strength. For instance, a team suffering from injuries to key players or undergoing a managerial change may take several matches before its Elo rating adequately reflects the shift in performance.

What Are Pi Ratings?

Pi Ratings (Constantinou & Fenton (2013)) are a dynamic rating system designed specifically for football, addressing several key shortcomings of traditional methods like Elo.

Understanding Pi Ratings

Pi Ratings are built on three key principles:

Score Margins Matter: A team winning 5-0 should receive a greater rating boost than a team winning 1-0, as the larger margin suggests a stronger performance. Conversely, a narrow loss may not indicate a major decline in ability, whereas a heavy defeat should trigger a more significant rating drop.
Home and Away Ratings Are Separate: Instead of using a fixed home advantage adjustment, Pi Ratings maintain distinct ratings for home and away performances. This allows the system to recognize that some teams perform significantly better at home than away, while others are more consistent across venues.
Recent Performance Is More Important: A team’s current form is more relevant than performances from months ago. Pi Ratings incorporate a learning rate that ensures recent matches influence a team’s rating more strongly than older results - you can think of this as being somewhat similar to the Dixon and Coles decay rate we discussed in my previous article on predicting football match outcomes.

Interpreting Pi Ratings

Each team begins with a rating of 0, which represents the level of an average team in the data. The system is zero-centered, meaning that when one team gains rating points, the other team loses the same amount. This ensures that all ratings are relative — a team with a rating of +1.0 is one goal better, on average, than the typical opponent. This property also makes it possible to compare teams across leagues or seasons.

How Pi Ratings Are Calculated

Pi Ratings update dynamically after each match by comparing what was expected to happen with what actually happened — specifically, in terms of the goal difference. Here's how the process works:

All teams start with a neutral rating of 0, representing the average team in the dataset. Ratings rise or fall based on performance over time
Before a match, an expected goal difference is calculated using the home team’s Pi Rating and the away team’s Pi Rating
After the match, the actual goal difference is compared to this expectation. If a team overperforms or underperforms significantly, the rating adjustment is larger.
The greater the surprise, the bigger the adjustment. For example, if a team was expected to lose 2–0 but instead wins 3–0, that’s a major signal of strength and leads to a sharp increase in rating.
Home and away performances are rated separately. A strong home result boosts the home rating, and vice versa for away games. However, each update also slightly nudges the other (home affects away, and away affects home), allowing the model to learn cross-context performance with a catch-up learning rate.

Now that we’ve covered the fundamentals of how Pi Ratings work, let’s see them in action. In the rest of the article, we’ll apply the Pi system to recent football seasons and compare its performance to an Elo model, to see whether Pi Ratings provide a more accurate representation of team strength.

Installing the penaltyblog Python Package

If you've not used the penaltyblog Python package before, you can install it using pip.

pip install penaltyblog

Calculating Pi Ratings Using `penaltyblog`

Let’s start by downloading some historical match data using penaltyblog, which provides an easy way to import football results from football-data.co.uk.

import penaltyblog as pb
import matplotlib as mpl
import matplotlib.dates as mdates
import matplotlib.pyplot as plt


df = pd.concat([
    pb.scrapers.FootballData("ENG Premier League", "2018-2019").get_fixtures(),
    pb.scrapers.FootballData("ENG Premier League", "2019-2020").get_fixtures(),
    pb.scrapers.FootballData("ENG Premier League", "2020-2021").get_fixtures(),
    pb.scrapers.FootballData("ENG Premier League", "2021-2022").get_fixtures(),
    pb.scrapers.FootballData("ENG Premier League", "2022-2023").get_fixtures(),
    pb.scrapers.FootballData("ENG Premier League", "2023-2024").get_fixtures(),
    pb.scrapers.FootballData("ENG Premier League", "2024-2025").get_fixtures(),
])

df = df[df["date"] < "2025-04-01"]

Next, we'll loop through all the historical results from oldest to latest and update each team's Pi Rating based on the score line.

pi_ratings = pb.ratings.PiRatingSystem()

for idx, row in df.iterrows():
    goal_diff = row["goals_home"] - row["goals_away"]
    pi_ratings.update_ratings(row["team_home"], row["team_away"], goal_diff, row["date"])

Manchester City's Pi Rating

Let's take a look at Manchester City's Pi Rating over time.

ratings = [x for x in pi_ratings.rating_history if x["team"] == "Man City"]
ratings = pd.DataFrame(ratings).sort_values("date")

plt.figure(figsize=(10, 5)) 
plt.plot(ratings["date"], ratings["home_rating"],  label="home", linewidth=3)
plt.plot(ratings["date"], ratings["away_rating"], label="away", linewidth=3)
plt.xlabel("Date", fontsize=14)  
plt.ylabel("Pi Rating", fontsize=14) 
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(fontsize=14)
plt.legend()
plt.tight_layout()
plt.show()

Pelican

Figure 1: Manchester City's Pi Rating over time

We can see that it took about a year of data for Manchester City’s Pi Rating to stabilise after starting at zero, followed by a long stretch of dominance. A noticeable dip occurred during the first half of the 2023/2024 season, before a strong run of form saw them recover and eventually claim the title.

The dramatic downturn in the 2024/2025 season — at least by Manchester City’s usual standards — is clearly reflected in the ratings, with the lowest point arriving at the end of 2024. As someone who’s watched a lot of Manchester City over the years, the ratings align closely with how the team’s form felt to me watching.

Chelsea's Pi Rating

Let's take a look at Chelsea's Pi Rating next and overlay the dates they changed managers.

ratings = [x for x in pi_ratings.rating_history if x["team"] == "Chelsea"]
ratings = pd.DataFrame(ratings).sort_values("date")

plt.figure(figsize=(10, 5)) 
plt.plot(ratings["date"], ratings["home_rating"],  label="home", linewidth=3)
plt.plot(ratings["date"], ratings["away_rating"], label="away", linewidth=3)
plt.xlabel("Date", fontsize=14)  
plt.ylabel("Pi Rating", fontsize=14) 
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(fontsize=14)

date_num = mdates.datestr2num("2024-07-01") 
plt.vlines(x=date_num-5, ymin=-0.25, ymax=1.5, linestyles="dotted", colors="black")
plt.text(date_num, plt.ylim()[1]-0.1, "Maresca", ha="left", va="top", rotation=90) 

date_num = mdates.datestr2num("2023-07-01") 
plt.vlines(x=date_num-5, ymin=-0.25, ymax=1.5, linestyles="dotted", colors="black")
plt.text(date_num, plt.ylim()[1]-0.1, "Pochettino", ha="left", va="top", rotation=90) 

date_num = mdates.datestr2num("2023-04-06") 
plt.vlines(x=date_num-5, ymin=-0.25, ymax=1.5, linestyles="dotted", colors="black")
plt.text(date_num, plt.ylim()[1]-0.1, "Lampard", ha="left", va="top", rotation=90) 

date_num = mdates.datestr2num("2022-09-08") 
plt.vlines(x=date_num-10, ymin=-0.25, ymax=1.5, linestyles="dotted", colors="black")
plt.text(date_num, plt.ylim()[1]-0.1, "Potter", ha="left", va="top", rotation=90) 

date_num = mdates.datestr2num("2021-01-26") 
plt.vlines(x=date_num-15, ymin=-0.25, ymax=1.5, linestyles="dotted", colors="black")
plt.text(date_num, plt.ylim()[1]-0.1, "Tuchel", ha="left", va="top", rotation=90) 

date_num = mdates.datestr2num("2019-07-04") 
plt.vlines(x=date_num-15, ymin=-0.25, ymax=1.5, linestyles="dotted", colors="black")
plt.text(date_num, plt.ylim()[1]-0.1, "Lampard", ha="left", va="top", rotation=90) 

plt.legend()
plt.tight_layout()

Pelican

Figure 2: Chelsea's Pi Rating over time

Figure 2 tells a familiar story for Chelsea fans: manager changes often followed periods of declining performance, as reflected in sharp dips in the Pi Ratings. The chart clearly shows how Graham Potter and Frank Lampard oversaw particularly difficult spells, with both home and away ratings dropping significantly. Mauricio Pochettino, by contrast, made a noticeable impact with the Pi Ratings climbing steadily during his tenure. Based on how quickly Pochettino turned Chelsea's rating around, it's perhaps a shame that he was only able to stay for one season.

Using Pi Ratings to Predict Matches

Beyond rating teams, Pi Ratings can also be used to estimate the probability of a home win, draw, or away win between two sides. In my previous article, we explored how well more detailed goals-based models performed on this dataset. Now, let’s put Pi Ratings to the test on the same matches and see how their predictive power compares.

Let's start off by downloading the same data we used in the previous article.

import numpy as np
import pandas as pd
import penaltyblog as pb
from tqdm import tqdm

df = pd.concat([
    pb.scrapers.FootballData("NLD Eredivisie", "2015-2016").get_fixtures(),
    pb.scrapers.FootballData("NLD Eredivisie", "2016-2017").get_fixtures(),
    pb.scrapers.FootballData("NLD Eredivisie", "2017-2018").get_fixtures(),
    pb.scrapers.FootballData("NLD Eredivisie", "2018-2019").get_fixtures(),
    pb.scrapers.FootballData("NLD Eredivisie", "2019-2020").get_fixtures(),
    pb.scrapers.FootballData("NLD Eredivisie", "2020-2021").get_fixtures(),
    pb.scrapers.FootballData("NLD Eredivisie", "2021-2022").get_fixtures(),
    pb.scrapers.FootballData("NLD Eredivisie", "2022-2023").get_fixtures(),
    pb.scrapers.FootballData("NLD Eredivisie", "2023-2024").get_fixtures(),
    pb.scrapers.FootballData("NLD Eredivisie", "2024-2025").get_fixtures(),
])

df = df.sort_values('date').set_index('date', drop=False)
ftr_map = {"H": 0, "D": 1, "A": 2}
df['ftr_numeric'] = df['ftr'].map(ftr_map)

Next, we loop through the test data, using the Pi Ratings to predict each fixture's outcome. After making a prediction, we update the model with the actual result. To evaluate how well the model performs, we use the Ranked Probability Score (RPS), which measures the accuracy of probabilistic predictions.

I explained more about the code and the RPS metric in my previous article , so I won’t repeat everything here — but if you're new to these concepts, I definitely recommend giving it a read first.

start_date = df.query("season == '2023-2024'")["date"].min()
run_dates = df["date"][df["date"] >= start_date].unique()

pi_ratings = pb.ratings.PiRatingSystem()

predictions = []
observed = []

for date in tqdm(run_dates, desc="Processing dates"):
    test = df[df["date"] == date]
    for i in range(len(test)):
        try:
            homes = test["team_home"].values
            aways = test["team_away"].values
            fthg = test["fthg"].values
            ftag = test["ftag"].values
            outcomes = test["ftr_numeric"].values

            probabilities = pi_ratings.calculate_match_probabilities(homes[i], aways[i])
            predictions.append([probabilities["home_win"], probabilities["draw"], probabilities["away_win"]])
            observed.append(outcomes[i])

            goal_diff = fthg[i] - ftag[i]
            pi_ratings.update_ratings(homes[i], aways[i], goal_diff)
        except Exception as e:
            continue  

rps = pb.metrics.rps_average(predictions, observed)
print(f"RPS: {rps}")

RPS: 0.19905621086882397

The average RPS for Pi Ratings on this dataset was 0.199, which gives us a solid benchmark to compare against other models. While this number doesn’t mean much in isolation, lower values indicate more accurate probabilistic predictions — so the real test is how it stacks up next to Elo and the goals-based models from the previous article.

Comparing Elo and Pi Ratings

As mentioned earlier, standard Elo Ratings only handle wins and losses. This makes them less suitable for football, where draws are common and carry meaningful information — especially when it comes to making probabilistic predictions.

To allow for a fair comparison with Pi Ratings, I extended the standard Elo system in penaltyblog to handle draws in both rating updates and match predictions. In the rating updates, draws are treated as being worth half a win (0.5), a common adaptation in football Elo models.

To handle draws — which are common in football but not natively supported by standard Elo — I added a Gaussian-shaped draw probability model. This approach assumes draws are most likely when teams are closely matched and become less likely as the rating gap widens.

The draw probability is computed as:

$$\text{p_draw} = \text{draw_base} \cdot \exp\left(-\frac{(\text{elo_diff})^2}{2 \cdot (\text{draw_width})^2}\right)$$

Where:

$\text{draw_base}$ is the peak probability (default 0.3)
$\text{elo_diff}$ is the difference in Elo ratings between the two teams.
$\text{draw_width}$ is the width of the Gaussian distribution, controlling how quickly the probability of a draw decreases with increasing Elo rating difference.

start_date = df.query("season == '2023-2024'")["date"].min()
run_dates = df["date"][df["date"] >= start_date].unique()

elo_ratings = pb.ratings.Elo()

predictions = []
observed = []

for date in tqdm(run_dates, desc="Processing dates"):
    test = df[df["date"] == date]

    for i in range(len(test)):
        try:
            homes = test["team_home"].values
            aways = test["team_away"].values
            fthg = test["fthg"].values
            ftag = test["ftag"].values
            outcomes = test["ftr_numeric"].values

            probabilities = elo_ratings.predict_match_outcome_probs(homes[i], aways[i])
            predictions.append([probabilities["home_win"], probabilities["draw"], probabilities["away_win"]])
            observed.append(outcomes[i])
            elo_ratings.update_ratings(homes[i], aways[i], outcomes[i])
        except Exception as e:
            continue    

rps = pb.metrics.rps_average(predictions, observed)
print(f"RPS: {rps}")

RPS: 0.2041672568256303

The results show that Pi Ratings outperform Elo when it comes to predicting football matches. While both systems are simple and fast, Pi Ratings offer several advantages that make them better suited to the dynamics of football. They were designed specifically for the sport, natively handle draws, and update team ratings based on score margins — capturing not just whether a team won or lost, but how convincingly.

Pi Ratings also separate home and away performances and adapt more quickly to changes in form. These enhancements result in more accurate probability estimates, as reflected in the lower Ranked Probability Score. For anyone working with football data — especially in situations where simplicity and speed matter — Pi Ratings offer a powerful upgrade over traditional Elo.

Comparing Ratings Models to Full Statistical Models

While Pi Ratings offer a clear improvement over traditional Elo in the context of football, it's worth asking how they stack up against more advanced, purpose-built statistical models. These models — like Dixon and Coles, Bivariate Poisson, and Zero-Inflated Poisson — are designed specifically for predicting football scores, often incorporating detailed assumptions about goal distributions, time decay, and team-specific strengths.

Let’s compare the predictive accuracy of these models against both Pi Ratings and Elo, using the same dataset and evaluation metric (Ranked Probability Score).

Model	Ranked Probability Score (RPS)
Dixon and Coles	0.19137780685608083
Weibull Count	0.19141358825225932
Poisson	0.19154229559464445
Zero-inflated Poisson	0.19154043298013113
Negative Binomial	0.19155750459845977
Bivariate Poisson	0.19161764011301444
Pi Ratings	0.19905621086882397
Elo Ratings	0.2041672568256303

Table 1: Rank Probability Scores for the different model types

While the statistical models clearly lead in predictive accuracy, this comes with trade-offs. They often require more data, greater model complexity, and longer computation times — along with domain-specific assumptions that aren’t always easy to explain or adapt.

Pi Ratings, by contrast, strike an effective middle ground. They’re more accurate than Elo, require only basic match results, and are fast and interpretable — making them ideal for many real-world applications where simplicity, speed, and solid predictive power matter more than squeezing out marginal gains.

Conclusions

If you’re rating chess players, Elo is exactly what you need — it’s simple, elegant, and perfectly suited to win/loss outcomes. But football is a different game: draws are common, scorelines matter, and form fluctuates quickly. That’s where Pi Ratings shine.

They offer a smarter, football-specific alternative to Elo — capturing performance nuances, handling home and away differences, and adapting more quickly to change. While not as accurate as heavyweight statistical models, Pi Ratings strike a valuable middle ground: fast, interpretable, and amazingly powerful for something so simple.

If you're building a football model, running simulations, or just want a better way to track your team’s strength — start with Pi.

Pi Ratings: The Smarter Way to Rank Football Teams