Introduction
If you've ever converted betting odds to probabilities, you've likely noticed they sum to a value greater than 100%. This surplus is the bookmaker's margin, often called the "overround" or "vig", and it's how they guarantee a profit. For any serious quantitative analysis, removing this margin to find the "true" underlying probabilities is a critical first step.
The challenge, however, is that there is no single, universally agreed-upon way to remove this margin. The process depends on our assumptions about how the bookmaker has applied it. Is it distributed evenly across all outcomes, or is it weighted towards the longshots?
To help analysts tackle this problem with greater flexibility, I've updated to the implied odds functions in the penaltyblog
package. This article explores the new features and takes a deep dive into the different theoretical models you can now use.
What's New in penaltyblog?
The core of this update is a move to a more modern and robust architecture. I've have replaced standard dictionaries with type-safe dataclasses, namely OddsInput
and ImpliedProbabilities
.
This change provides several key benefits for users:
- Clarity and Safety: The new structure makes the API self-documenting. You know exactly what data to provide and what the function will return, allowing tools like
mypy
to catch potential errors before you run your code. - Ease of Use: With type-safe objects, IDEs can provide better auto-completion, making the library more intuitive to work with.
All functionality is now consolidated into a single, powerful function: penaltyblog.implied.calculate_implied()
. This serves as a unified entry point for all the supported methods.
A Deep Dive into Implied Probability Methods
The fundamental question when removing the overround is: how is the margin distributed? To illustrate the different approaches, we'll use a typical 1X2 football market with the following decimal odds: Home: 2.7, Draw: 2.3, Away: 4.4.
The unadjusted probabilities, calculated as $1/odds$, are:
- $1/2.7=0.370$
- $1/2.3=0.434$
- $1/4.4=0.227$
The sum of these is $1.032$, which gives us a bookmaker's margin of $3.2\%$. Each method below attempts to reduce this sum to exactly $1.0$ based on a different set of assumptions.
The Basic Approaches
These two methods are the most common and serve as excellent baselines.
Multiplicative ("multiplicative"
): This is the simplest and most widely used method. It assumes the margin is distributed proportionally across all outcomes and normalises the raw probabilities by dividing each one by their sum.
$$p_i = \frac{1/o_i}{\sum_{j=1}^{n} 1/o_j}$$
Additive ("additive"
): This method assumes the margin is an equal, fixed amount subtracted from each outcome's raw probability.
$$p_i = \frac{1}{o_i} - \frac{M}{n}$$
where $M$ is the total margin and $n$ is the number of outcomes.
Iterative Approaches
These methods are more complex and often require solving for a parameter that normalises the probability distribution.
Power ("power"
): This method raises the raw probabilities to a power, k
, which is solved for iteratively. It provides more flexibility than the basic methods in modelling how the margin is applied.
$$p_i = \left(\frac{1}{o_i}\right)^k$$
Shin ("shin"
): Developed by Hyun Song Shin, this method is derived from a model that assumes a mix of informed and uninformed bettors in the market. It solves for a parameter, z
, which can be interpreted as the proportion of uninformed money. It is a popular method in academic literature for modelling the favourite-longshot bias.
Odds Ratio ("odds_ratio"
): Proposed by Keith Cheung, this method models the relationship between the bookmaker's probabilities and the true probabilities using an odds ratio formulation, solving for a constant c
.
Differential Margin Weighting ("differential_margin_weighting"
): This approach, popularised by Joseph Buchdahl, works on the assumption that the margin is applied proportionally to the odds themselves, which results in a greater proportion of the margin being applied to longshots.
$$\text{fair_odds}_i = \frac{n \cdot o_i}{n - (M \cdot o_i)}$$
Logarithmic ("logarithmic"
): This method converts the bookmaker's probabilities into "log-odds" space and assumes the margin is applied as an equal subtraction from each outcome. It solves for a single constant, c, that is removed from all log-odds values to ensure the final probabilities sum to 1.0.
Putting It All Together: A Practical Example
The new calculate_implied
function makes it easy to apply and compare these methods.
import penaltyblog as pb
# Our example odds for a Home, Draw, Away market
odds = [2.7, 2.3, 4.4]
market_names = ["Home", "Draw", "Away"]
# Calculate using the default Multiplicative method
result_mult = pb.implied.calculate_implied(
odds,
market_names=market_names
)
print(f"Multiplicative: {result_mult.probabilities}")
# >> Multiplicative: {'Home': 0.3587, 'Draw': 0.4211, 'Away': 0.2201}
# Now try Shin's method
result_shin = pb.implied.calculate_implied(
odds,
method="shin",
market_names=market_names
)
print(f"Shin: {result_shin.probabilities}")
# >> Shin: {'Home': 0.3593, 'Draw': 0.4232, 'Away': 0.2174}
An Empirical Test: Which Method is Most Accurate?
While the theoretical differences are interesting, the crucial question is whether one method produces more accurate probabilities than another. To test this, we can take a historical dataset of odds and outcomes and evaluate the accuracy of each method using a proper scoring rule.
For this analysis, I used the closing odds for all 380 matches from the 2024/25 English Premier League season. For each match, I calculated the implied 1X2 probabilities for Bet365's odds using every method. I then scored these probabilities against the actual match outcome (Home Win, Draw, or Away Win) using the Ranked Probability Score (RPS).
The RPS is a measure of the distance between a probabilistic forecast and the observed outcome, where a lower score indicates a more accurate forecast.
import penaltyblog as pb
import pandas as pd
# Get the fixture data for the 2024-2025 Premier League season
df = pb.scrapers.FootballData("ENG Premier League", "2024-2025").get_fixtures()
# Map the full time result to a numerical outcome (0 for Home, 1 for Draw, 2 for Away)
df["outcome"] = df["ftr"].map({"H": 0, "D": 1, "A": 2})
# Define a function to normalise the odds using a specific method
def normalise(x, method):
"""
Calculates implied probabilities from betting odds using a specified method.
Args:
x (pd.Series): A row of the DataFrame containing 'b365_h', 'b365_d', 'b365_a' columns.
method (str): The normalisation method to use.
Returns:
pd.Series: A pandas Series with the normalised probabilities for home, draw, and away outcomes.
"""
res = pb.implied.calculate_implied(
x[["b365_h", "b365_d", "b365_a"]],
method=method
)
# Return the probabilities as a Series with named columns
return pd.Series(res.probabilities, index=['prob_h', 'prob_d', 'prob_a'])
# Define the list of normalisation methods to compare
methods = [
"additive",
"multiplicative",
"power",
"differential_margin_weighting",
"shin",
"odds_ratio",
"logarithmic"
]
results = {}
# Iterate through each normalisation method
for method in methods:
# Apply the normalise function to each row of the DataFrame to get the probabilities
norm_probs = df.apply(normalise, axis=1, args=(method,))
# Calculate the average Ranked Probability Score (RPS) for the current method
rps = pb.metrics.rps_average(norm_probs[["prob_h", "prob_d", "prob_a"]], df["outcome"])
# Store the rounded RPS value in the results dictionary
results[method] = round(rps, 4)
# Print the results
print(results)
Results
The results show a tight race with a compelling conclusion. While the simple multiplicative method was technically the most accurate with the lowest RPS of 0.19724, the differences among the top methods are negligible.
The more theoretically grounded Odds Ratio and Logarithmic methods were virtually indistinguishable in performance, tying for a close second place. This suggests that for a highly efficient market like the English Premier League, several different models for how bookmakers apply their margin can produce similarly accurate probabilities.
Method | RPS |
---|---|
multiplicative | 0.19724 |
logarithmic | 0.19730 |
odds_ratio | 0.19730 |
shin | 0.19731 |
additive | 0.19736 |
differential_margin_weighting | 0.19736 |
power | 0.19739 |
Conclusion
Removing the bookmaker's margin is a foundational step in quantitative sports analysis. With the latest update to penaltyblog
, analysts now have a comprehensive and easy-to-use toolkit to perform this task with greater precision.