Better Metrics for Football Forecasts: Moving Beyond the Ranked Probability Score

Published on May 01, 2025 by Martin Eastwood

Introduction

What if the metric you've been using to evaluate football forecasts is quietly rewarding worse predictions?

Probabilistic forecasting is a crucial tool in sports analytics - whether it's to inform betting strategies, guide player and tactical decisions, or simply enhance the way we follow the sport. As these models have grown in popularity, however, one important question has often been overlooked: how do we evaluate the quality of our forecasts?

Despite the increase in use of football (soccer) prediction models, the discussion around how best to score and compare them has been relatively limited. In many cases, evaluation methods are taken for granted without much scrutiny. This is a significant gap because without the right tools for measuring forecast quality, it's easy to draw misleading conclusions about model performance.

On my blog, I've often used the Ranked Probability Score (RPS) to assess the accuracy of predictive models. RPS has an intuitive appeal as it rewards forecasts not just for getting the result right, but also for placing higher probability on outcomes close to the actual result. A forecast that leaned towards a draw when the match ended as a narrow home win was treated better than one that backed an away win.

For a long time, that logic seemed sound. However, I've recently been questioning whether RPS is really the best tool for evaluating football forecasts - especially when our ultimate goal is to identify the most informative models as efficiently and fairly as possible.

In this article, I’ll explain why RPS might not be the optimal choice, introduce alternative scoring metrics like Log Loss (also known as Ignorance Score) and the multiclass Brier score, and share some experiments / ideas I've tested to explore which metrics are best suited for evaluating football predictive models.

Background: How We Score Probabilistic Football Forecasts

When evaluating a model that outputs probabilities, such as predicting a 45% chance of a home win, 30% draw, and 25% away win, we need a way to judge how good those probabilities actually were once the match result is known. That's where scoring rules come in.

A scoring rule is simply a mathematical method that assigns a numerical score to a forecast based on the eventual outcome. A good scoring rule rewards forecasts that were well-calibrated and honest, and penalizes forecasts that placed high confidence on the wrong results.

The Ranked Probability Score (RPS)

The Ranked Probability Score (RPS) has become a popular choice for evaluating football forecasts. Its appeal lies in how it measures how close a forecast was to the actual outcome, taking the ordering of results into account.

In football, match outcomes - home win, draw, away win - can be thought of as ordered, with a draw being closer to a home win than an away win is. The RPS rewards forecasts that spread probability sensibly across outcomes near the correct result, not just those that hit the exact outcome.

Technically, RPS calculates the squared differences between the cumulative forecast probabilities and the cumulative actual outcomes across all possible results. It sums these differences to produce a final score, with lower scores indicating better forecasts.

In summary:

  • Correct predictions get rewarded.
  • Predictions that almost got it right (e.g., favouring a draw when it was a home win) are penalized less harshly than predictions that were way off (e.g., favouring an away win).

This sensitivity to distance has often been cited as a key advantage of RPS but, as we will explore, it also introduces potential problems.

Alternative Scoring Rules: Brier Score and Log Loss

While RPS rewards proximity to the correct outcome, there are other scoring rules that take a different approach:

The Brier Score (Multiclass Version)

The Brier score measures the squared difference between the predicted probability and the actual outcome, across all possible results but without considering the ordering between outcomes. A wrong prediction for an away win is penalized the same as a wrong prediction for a draw if the match ends in a home win.

In formula terms, it’s simply the mean squared error between the forecast probabilities and the actual result (coded as 1 for the outcome that occurred, 0 for those that did not).

Like RPS, lower Brier scores indicate better forecasts. However, the Brier score is insensitive to distance as it treats all incorrect outcomes equally wrong.

Log Loss (Ignorance Score)

The log loss, also known as the ignorance score, takes an even sharper approach. It focuses only on the probability assigned to the outcome that actually occurred, ignoring how probability was distributed across other outcomes.

Log loss measures how surprised the forecast was by the actual result using concepts from information theory.

  • If you assigned a high probability to the correct result, you get a low (good) score.
  • If you assigned a low probability to the correct result, you're heavily penalized.

Smaller log loss values (closer to zero) indicate better forecasts. Importantly, log loss is a local scoring rule meaning it doesn't reward or penalize probability placed on outcomes that didn’t happen.

Key Properties of a Good Scoring Rule

When choosing how to evaluate football forecasts, a few key properties matter:

  • Propriety: A scoring rule is proper if it encourages honest forecasting - assigning probabilities that truly reflect beliefs.
  • Locality: A scoring rule is local if it only depends on the probability of the outcome that occurred (log loss is local; RPS and Brier are not).
  • Sensitivity to Distance: Some rules (like RPS) reward probability placed on outcomes "near" the correct one; others (like Brier and log loss) do not.

These properties have deep implications for how fairly and effectively different metrics evaluate forecasts, and, as we'll see, they raise important questions about whether RPS is really the best choice.

What’s Wrong with the Ranked Probability Score?

At first glance, the RPS seems like an ideal tool for evaluating football forecasts. It rewards accurate predictions and gives partial credit to forecasts that are almost right, reflecting the natural ordering of outcomes like home win, draw, and away win.

However, when we dig deeper into the properties of RPS, issues emerge especially when we think carefully about what we actually want a scoring rule to do.

Sensitivity to Distance Isn't Always Useful

RPS assumes that outcomes have a natural order, and that being close to the right result is better than being completely wrong. But in practice, this sensitivity to distance may not make sense for evaluating probabilistic forecasts.

Consider a match where a team wins 3-0 at home. From a probability standpoint, a forecast that heavily backed a draw is no closer to reality than one that backed an away win. The match was decisively a home win, and both alternative forecasts were wrong. Yet RPS would treat the forecast favoring a draw more kindly even though both forecasts misrepresented what actually happened.

In probabilistic forecasting, what matters most is the probability placed on the outcome that actually occurred, not how close other outcomes seemed. Sensitivity to distance risks rewarding forecasts that were still wrong, simply because they were less wrong in some arbitrary way.

Non-Locality Dilutes the Signal

A second issue is that RPS is non-local: it takes into account the entire probability distribution, not just the probability assigned to the true outcome.

This means a forecast can achieve a better RPS score by adjusting probabilities on outcomes that did not happen. In theory, a forecast could place relatively little probability on the actual outcome but still score well by distributing probabilities nicely across nearby results.

This dilutes the signal we care about most: how much belief did the forecast put on the correct outcome? A good evaluation metric should focus sharply on that and not get distracted by how probabilities were assigned to events that never occurred.

Inefficiency in Identifying Better Forecasts

From a practical standpoint, another downside of RPS is that it can require more data - as in more matches and outcomes - to reliably identify which forecasting model is actually better.

In simulation experiments (including those I’ll describe later), scoring rules like log loss tend to distinguish better models more quickly and more reliably than RPS. This matters because in the real world, we often have limited sample sizes, especially when evaluating new models, seasonal predictions, or niche competitions.

A scoring rule that uses the available information more efficiently gives us a better chance of recognizing model quality earlier.

RPS Summary

While the Ranked Probability Score has attractive theoretical features, it also carries significant practical and conceptual drawbacks.

  • It can reward forecasts that are still substantially wrong.
  • It spreads focus across outcomes that didn’t happen.
  • It may be slower and less reliable at identifying genuinely better models.

Given these issues, it's worth asking whether we should move beyond RPS when evaluating football forecasts and whether alternatives like log loss (Ignorance Score) and the multiclass Brier score might offer a better foundation.

In the next sections, I'll show some ideas that explore exactly that question.

Experiment One: RPS Can be Slower to Identify Better Forecasts

To compare how well different scoring rules distinguish better forecasts in football, I ran a simple simulation.

For each match:

  • Forecast A represented a "better" forecast - more accurate and better calibrated.
  • Forecast B represented a "worse" forecast - slightly less accurate and less confident.

Specifically:

  • Forecast A assigned probabilities of 60% home win, 25% draw, and 15% away win.
  • Forecast B assigned 50% home win, 30% draw, and 20% away win - a slightly more spread-out, less confident forecast.

In each simulation:

  • I simulated match outcomes by randomly drawing results according to the better forecast (Forecast A).
  • For each match, I scored both forecasts using three different metrics:
    • Log Loss (Ignorance Score)
    • Multiclass Brier Score
    • Ranked Probability Score (RPS)
  • I then compared the total scores for Forecast A and Forecast B.
    • If a metric gave a lower (better) score to Forecast A, it was counted as a "correct" selection for that metric.

I repeated this process 1,000,000 times for each sample size tested - ranging from just 10 matches up to 1000 matches - and calculated the proportion of simulations where each scoring rule correctly identified Forecast A as better.

This setup mimics a realistic football forecasting evaluation:

  • Limited numbers of matches.
  • Models that are not drastically different - a common scenario in practice.
  • Need for scoring rules to efficiently and reliably identify better forecasts even when differences are subtle.

The code for this is shown below. However, please note that you may get different results due to the random sampling. The general pattern should be similar though.

import numpy as np
import penaltyblog as pb


def simulate(sample_size, n_repeats):
    forecast_a = np.array([0.6, 0.25, 0.15])
    forecast_b = np.array([0.5, 0.3, 0.2])

    outcomes = np.random.choice(3, size=(n_repeats, sample_size), p=forecast_a)

    results = {"brier": 0, "logloss": 0, "rps": 0}

    for i in range(n_repeats):
        o = outcomes[i]

        # Broadcast forecasts
        probs_a = np.repeat(forecast_a[np.newaxis, :], sample_size, axis=0)
        probs_b = np.repeat(forecast_b[np.newaxis, :], sample_size, axis=0)

        score_a_brier = pb.metrics.multiclass_brier_score(probs_a, o)
        score_b_brier = pb.metrics.multiclass_brier_score(probs_b, o)

        score_a_logloss = pb.metrics.ignorance_score(probs_a, o)
        score_b_logloss = pb.metrics.ignorance_score(probs_b, o)

        score_a_rps = pb.metrics.rps_average(probs_a, o)
        score_b_rps = pb.metrics.rps_average(probs_b, o)

        results["brier"] += int(score_a_brier < score_b_brier)
        results["logloss"] += int(score_a_logloss < score_b_logloss)
        results["rps"] += int(score_a_rps < score_b_rps)

    for key in results:
        results[key] /= n_repeats

    return results


def run_experiment(sample_sizes, n_repeats=1000000):
    results = {metric: [] for metric in ["brier", "logloss", "rps"]}

    for n in sample_sizes:
        res = simulate(n, n_repeats)
        for metric in res:
            results[metric].append(res[metric])

    return results

run_experiment([10, 25, 50, 100, 200, 500, 1000], 1000000)

Results

Sample Size Log Loss Brier Score RPS
1062.9%63.0%61.3%
2570.4%72.9%67.7%
5076.9%77.1%76.5%
10084.9%87.0%84.1%
20093.3%92.6%92.1%
50099.0%98.8%99.0%
1000100.0%100.0%100.0%

Table 1: Proportion of simulations where each scoring rule correctly identified the better forecast

Interpreting the Results

With one million repetitions per sample size, the results paint a stable and revealing picture. At smaller sample sizes (10–100 matches), RPS consistently underperforms relative to both Log Loss and Brier Score. For example, at just 25 matches, RPS correctly identifies the better forecast only 67.7% of the time, compared to 70.4% for Log Loss and 72.9% for Brier Score.

Throughout most sample sizes, Brier Score slightly edges out Log Loss in terms of raw accuracy, particularly in the 25 to 200 match range. However, Log Loss remains nearly as effective, and its theoretical strengths as a strictly proper local scoring rule make it a more principled choice for model evaluation, especially when distinguishing subtle differences between forecasts.

These results reinforce the idea that RPS is the least efficient of the three: it requires more data to reach the same level of confidence in model comparison, making it a less suitable option for practical use in football analytics.

Experiment Two: RPS Can Favour a Forecast That Believes Less in the Truth

One of the key problems with the Ranked Probability Score is that it's non-local, meaning it evaluates forecasts based on the entire cumulative distribution, not just the outcome that actually occurred. This can lead to unintuitive results.

Results

Forecast Probabilities P(Correct) RPS Log Loss
Home Draw Away
A 0.70 0.10 0.20 0.70 0.065 0.515
B 0.65 0.30 0.05 0.65 0.062 0.621

Table 2: RPS Can Favor a Forecast That Believes Less in the Truth

Interpreting the Results

Although Forecast A assigns a higher probability to the correct outcome (70% vs 65%), the Ranked Probability Score (RPS) gives a better (lower) score to Forecast B. This happens because RPS evaluates the entire distribution of probabilities, and Forecast B’s more even spread is treated as closer to the true outcome in cumulative terms.

In contrast, Log Loss (a local scoring rule) correctly favours Forecast A, since it focuses solely on the probability assigned to the observed result. This example highlights a key flaw of RPS: it can reward worse forecasts simply because they are smoother, even when they express less belief in what actually happened.

Experiment 3: A Real-World Test Using Bookmaker Odds

The previous experiments in this article used synthetic forecasts - carefully constructed examples designed to highlight how different scoring rules behave under controlled conditions. While these are useful for understanding the theoretical properties of metrics like RPS, Log Loss, and Brier Score, real-world forecasting is rarely so tidy.

To evaluate how these scoring rules perform in more practical settings, I next used historical bookmaker odds - widely regarded as one of the most accurate publicly available sources of football match probabilities - and intentionally distorted them to simulate worse forecasts.

The goal was to see which scoring rules could reliably detect the better forecast under more realistic, subtly challenging conditions.

Here's how the experiment was structured:

  • Bookmaker odds from Bet365 (sourced via football-data.co.uk) were converted into probabilities by inverting and normalizing the odds (to remove the overround).
  • These bookmaker-derived probabilities were treated as the "true" forecast (Forecast A).
  • A slightly degraded forecast (Forecast B) was created by distorting the original probabilities using temperature scaling, a method that flattens or sharpens the distribution to simulate reduced model quality.
  • For each match, an outcome was simulated from Forecast A, and both forecasts were scored using:
    • Log Loss (Ignorance Score)
    • Brier Score
    • Ranked Probability Score (RPS)
  • The process was repeated thousands of times, and I tracked how often each scoring rule correctly favoured the higher-quality (bookmaker) forecast.

If you want to have a go at repeating this, code is shown below. However, please note that you may get slightly different results due to the random sampling. The general pattern should be similar though.

import numpy as np
import penaltyblog as pb
import pandas as pd


def distort_probs(probs, temperature=1.25):
    """Distort probability distribution using temperature scaling."""
    scaled = np.power(probs, 1 / temperature)
    return scaled / scaled.sum()

def run_betting_experiment(odds_list, n_repeats=100000, distort_fn=distort_probs):
    """Compare scoring rules on bookmaker vs distorted forecasts."""
    correct = {"logloss": 0, "brier": 0, "rps": 0}

    for _ in range(n_repeats):
        # Sample match odds from the bookmaker data
        odds = odds_list[np.random.randint(len(odds_list))]
        true_probs = pb.implied.shin(odds)["implied_probabilities"]

        # Distort to create worse forecast
        distorted_probs = distort_fn(true_probs)

        # Simulate an outcome based on true_probs
        outcome = np.random.choice(3, p=true_probs)

        # Score both forecasts
        log_a = pb.metrics.ignorance_score([true_probs], [outcome])
        log_b = pb.metrics.ignorance_score([distorted_probs], [outcome])

        brier_a = pb.metrics.multiclass_brier_score([true_probs], [outcome])
        brier_b = pb.metrics.multiclass_brier_score([distorted_probs], [outcome])

        rps_a = pb.metrics.rps_average([true_probs], [outcome])
        rps_b = pb.metrics.rps_average([distorted_probs], [outcome])

        # Compare which forecast scored better
        correct["logloss"] += int(log_a < log_b)
        correct["brier"] += int(brier_a < brier_b)
        correct["rps"] += int(rps_a < rps_b)

    # Normalize to proportion
    for k in correct:
        correct[k] /= n_repeats

    return correct

df = pd.concat([
    pb.scrapers.FootballData("ENG Premier League", "2020-2021").get_fixtures(), 
    pb.scrapers.FootballData("ENG Premier League", "2021-2022").get_fixtures(), 
    pb.scrapers.FootballData("ENG Premier League", "2022-2023").get_fixtures(),    
    pb.scrapers.FootballData("ENG Premier League", "2023-2024").get_fixtures(),
    pb.scrapers.FootballData("ENG Premier League", "2024-2025").get_fixtures(),
])

home = df["b365_h"]
draw = df["b365_d"]
away = df["b365_a"]

odds_list = np.array(list(zip(home, draw, away)))

results = run_betting_experiment(odds_list, n_repeats=1000000)
print(results)

Results

After running the simulation thousands of times, the proportion of cases in which each scoring rule correctly identified the better forecast was:

Scoring Rule % Correct
Log Loss 58.6%
Brier Score 57.8%
RPS 56.1%

Table 3: Proportion of cases in which each scoring rule correctly identified the better forecast.

Interpreting the Results

Even in this more realistic setting, where the differences between forecasts were subtle and the "ground truth" forecasts came from actual bookmaker odds, Log Loss continued to outperform the other metrics. Although all three scoring rules perform similarly, Log Loss was more consistent in identifying the better forecast.

The gap may seem small, but in practice, these marginal improvements in sensitivity and reliability matter, especially when comparing models in leagues with fewer matches or when evaluating forecasts from similar models. RPS, once again, proved less effective at distinguishing higher-quality forecasts, consistent with the findings from the earlier synthetic experiments.

Conclusion

Across both synthetic and real-world tests, Log Loss And Briar Scores consistently proved to be a more sensitive and reliable metric for evaluating football forecasts than the Ranked Probability Score (RPS). While all three scoring rules eventually converge with enough data, RPS lags behind at smaller sample sizes and can even reward worse forecasts under certain conditions as a direct consequence of its non-locality. While Brier Score sometimes outperforms Log Loss in raw accuracy, particularly at moderate sample sizes, Log Loss’s sharper theoretical grounding and consistency make it my preferred choice.

In real football analytics and betting, we rarely have huge sample sizes to evaluate models with. We often deal with limited data, closely matched forecasts, and the need for efficient, trustworthy evaluation tools. Based on this evidence, Log Loss stands out as the most appropriate scoring rule for comparing football predictive models, offering sharper discrimination, stronger theoretical foundations, and more reliable guidance when it matters most.