Imagine you're watching a soccer match and your team's midfielder has the ball at the halfway line. How dangerous is this position? What about if they dribble forward 10 yards? Or make a pass to the wing? Expected Threat (xT), originally developed Sarah Rudd and popularised by Karun Singh, attempts to answer these questions by quantifying the offensive value of every position on the pitch.
Unlike simpler metrics such as expected goals (xG) that only measure shot quality, xT evaluates both immediate shooting opportunities and the potential for creating future scoring chances. This makes it useful for analyzing buildup play and measuring contributions from those players who don't directly create shots.
Karun Singh's implementation of expected threat relies on iterative calculations that gradually converge to the final xT values. While this approach works, we can actually take the more elegant approach of using linear algebra to calculate the xT values directly. By reformulating the problem as a system of linear equations, we can achieve the following benefits:
In this article, I'll show how to use linear algebra to calculate xT efficiently, and walk through the process in Python, providing a step-by-step guide for implementing this approach in practice.
Let's start off by understanding the theory behind xT, which can be described by the following equation:
At its core, xT breaks down the value of any position on the pitch into two components: the immediate threat of scoring (shot threat) and the potential to create better opportunities through movement of the ball (move threat). Let's examine each component in more detail:
The first term represents the immediate threat of scoring from position $(x,y)$:
The second term captures the potential threat from moving the ball:
So, the equation attempts to measure a fundamental concept in soccer: the value of a position comes both from the immediate scoring opportunity and the potential to create better opportunities through moving the ball. For example, when Phil Foden carries the ball from the wing into the penalty area, he's moving from a low-xT position (around 0.01) to a much more dangerous position (around 0.3).
The recursive nature of this equation, where xT appears on both sides, is what typically necessitates an iterative solution. However, by recognizing xT as a system of linear equations, we can solve it directly using matrix algebra that can be solved efficiently using Python's linear algebra libraries.
To calculate xT, we need event data capturing how the ball moves around the pitch. While the model can incorporate various types of ball movements, for simplicity we'll focus on three fundamental events in this article:
And for each event, we require the following information:
Here's a sample of what my data looks like - note that these were events collected for the English Premier League between 2017 and 2023, giving us around 3,000,000 relevant events to fit the model to. Unfortunately, I can't share the data here but there are plenty of free data sources that you can use to replicate this.
x | y | end_x | end_y | event_type | |
---|---|---|---|---|---|
0 | 45.8 | 21.8 | 53.3 | 26.7 | pass |
3 | 52.3 | 11.3 | 45.6 | 8.6 | pass |
4 | 38.4 | 50.1 | 40.0 | 64.8 | pass |
5 | 2.3 | 23.6 | 32.4 | 17.3 | pass |
6 | 62.9 | 9.4 | 63.8 | 1.6 | pass |
Table 1: Example of the data used here to calculate xT
The first step in calculating xT is dividing the pitch into a grid. This transforms the continuous space of the soccer / football pitch into discrete zones that we can analyze. Each zone in the grid will have its own xT value, representing the expected threat when the ball is in that position.
The original methodology calculates xT through iteration:
While effective, this method requires tracking convergence and is computationally more intensive for larger datasets.
Instead, we can reformulate the problem as a matrix equation:
$$\mathbf{X} = \mathbf{S} + \mathbf{M}\mathbf{T}\mathbf{X}$$
Where:
This approach allows us to solve for all xT values simultaneously, eliminating the need for iteration.
The choice of grid resolution for the pitch depends on the trade-offs between computational efficiency and spatial precision. Some common resolutions and their pros and cons are outlined below:
Resolution | Advantages | Disadvantages |
---|---|---|
$12 \times 8$ (Coarse) | More data per zone, Faster computation, more stable estimates | Less spatial detail |
$16 \times 12$ (Fine) | Better spatial precision, finer tactical insights | Sparser data, Longer computation |
$20 \times 15$ (Very Fine) | Highly detailed spatial patterns | Even sparser data, computationally intensive |
Table 2: Grid resolution trade-offs
For this article, we'll use a $16 \times 12$ grid. This resolution strikes a balance between capturing detailed spatial patterns, such as distinguishing the penalty spot from the edge of the box, and maintaining sufficient data in each zone for reliable estimates. Additionally, the computational requirements are manageable with this resolution.
Next, let's dive into the Python implementation, showing how to construct the required matrices efficiently.
Before we can calculate xT, we need to understand how players use each zone of the pitch. This involves computing three key probabilities:
To calculate xT, we need to map raw pitch coordinates onto our $16 \times 12$ grid system. Here’s how to map any position on the pitch to its corresponding grid cell:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mplsoccer import Pitch
def map_coordinates_to_grid_vectorized(x, y, grid_width=16, grid_height=12):
"""
Vectorized function to convert pitch coordinates (x, y) into grid cell indices,
handling missing values.
Parameters:
x (pd.Series or np.array): X-coordinates on the pitch (normalized to 0-100).
y (pd.Series or np.array): Y-coordinates on the pitch (normalized to 0-100).
grid_width (int): Number of grid cells across the width of the pitch.
grid_height (int): Number of grid cells across the height of the pitch.
Returns:
tuple: Two arrays (grid_x, grid_y) representing grid cell indices. Missing values
will be represented as None.
"""
PITCH_WIDTH = 100
PITCH_HEIGHT = 100
# Handle missing values (NaN or None)
valid = ~pd.isna(x) & ~pd.isna(y)
# Initialize arrays with None for invalid coordinates
grid_x = np.full_like(x, None, dtype=object)
grid_y = np.full_like(y, None, dtype=object)
# Compute grid indices only for valid rows
grid_x[valid] = np.floor(x[valid] * grid_width / PITCH_WIDTH).clip(0, grid_width - 1).astype(int)
grid_y[valid] = np.floor(y[valid] * grid_height / PITCH_HEIGHT).clip(0, grid_height - 1).astype(int)
return grid_x, grid_y
df = pd.read_csv("path/to/my/data.csv")
# Map starting coordinates to grid cells
df["grid_x"], df["grid_y"] = map_coordinates_to_grid_vectorized(df["x"], df["y"])
# Map ending coordinates to grid cells, handling missing values
df["grid_x_end"], df["grid_y_end"] = map_coordinates_to_grid_vectorized(df["end_x"], df["end_y"])
x | y | end_x | end_y | grid_x | grid_y | event_type | |
---|---|---|---|---|---|---|---|
0 | 45.8 | 21.8 | 53.3 | 26.7 | 7 | 2 | pass |
3 | 52.3 | 11.3 | 45.6 | 8.6 | 8 | 1 | pass |
4 | 38.4 | 50.1 | 40.0 | 64.8 | 6 | 6 | pass |
5 | 2.3 | 23.6 | 32.4 | 17.3 | 0 | 2 | pass |
Table 3: Mapping coordinates to grid cells
By mapping our continuous pitch coordinates into discrete grid zones, we’ve converted our raw event data to a structured format suitable for xT calculations. For instance, a position at (50, 50) would map to grid zone (8, 6) in our $16 \times 12$ grid.
Next, we need to calculate three core matrices:
These matrices will be used to calculate the probabilities ($s_{x,y}$, $g_{x,y}$, and $T_{(x,y)\to(z,w)}$) used in the xT computation.
Using our grid coordinates, we'll now populate the matrices that count the number of passes, shots, and goals in each grid zone. The following code achieves this by iterating through the event data and incrementing the corresponding grid cell:
# Initialize the empty grids
GRID_HEIGHT, GRID_WIDTH = 12, 16
pass_count = np.zeros((GRID_HEIGHT, GRID_WIDTH))
shot_count = np.zeros((GRID_HEIGHT, GRID_WIDTH))
goal_count = np.zeros((GRID_HEIGHT, GRID_WIDTH))
# Filter events by type
pass_df = df.query("event_type == 'pass'")
goals_df = df.query("event_type == 'goal'")
shots_df = df.query("event_type == 'shot'")
# Group by grid coordinates and count events
pass_counts = pass_df.groupby(["grid_y", "grid_x"]).size()
shot_counts = shots_df.groupby(["grid_y", "grid_x"]).size()
goal_counts = goals_df.groupby(["grid_y", "grid_x"]).size()
# Fill in the grids using the grouped data
for (y, x), count in pass_counts.items():
pass_count[y, x] = count
for (y, x), count in shot_counts.items():
shot_count[y, x] = count
for (y, x), count in goal_counts.items():
goal_count[y, x] = count
To ensure the data looks correct, let’s visualize the count matrices as heatmaps. Each heatmap provides a spatial representation of the frequency of each event type across the pitch.
statistic = shot_count
y, x = statistic.shape
x_grid = np.linspace(0, 100, x + 1)
y_grid = np.linspace(0, 100, y + 1)
cx = x_grid[:-1] + 0.5 * (x_grid[1] - x_grid[0])
cy = y_grid[:-1] + 0.5 * (y_grid[1] - y_grid[0])
stats = dict(statistic=statistic, x_grid=x_grid, y_grid=y_grid, cx=cx, cy=cy)
pitch = Pitch(pitch_type='opta', line_zorder=2, pitch_color='#22312b', line_color='#efefef')
fig, ax = pitch.draw(figsize=(8, 4.125))
pcm = pitch.heatmap(stats, ax=ax, cmap='viridis')
Figure 1: Shot Count heatmap
statistic = goal_count
y, x = statistic.shape
x_grid = np.linspace(0, 100, x + 1)
y_grid = np.linspace(0, 100, y + 1)
cx = x_grid[:-1] + 0.5 * (x_grid[1] - x_grid[0])
cy = y_grid[:-1] + 0.5 * (y_grid[1] - y_grid[0])
stats = dict(statistic=statistic, x_grid=x_grid, y_grid=y_grid, cx=cx, cy=cy)
pitch = Pitch(pitch_type='opta', line_zorder=2, pitch_color='#22312b', line_color='#efefef')
fig, ax = pitch.draw(figsize=(8, 4.125))
pcm = pitch.heatmap(stats, ax=ax, cmap='viridis')
Figure 2: Goal Count heatmap*
statistic = pass_count
y, x = statistic.shape
x_grid = np.linspace(0, 100, x + 1)
y_grid = np.linspace(0, 100, y + 1)
cx = x_grid[:-1] + 0.5 * (x_grid[1] - x_grid[0])
cy = y_grid[:-1] + 0.5 * (y_grid[1] - y_grid[0])
stats = dict(statistic=statistic, x_grid=x_grid, y_grid=y_grid, cx=cx, cy=cy)
pitch = Pitch(pitch_type='opta', line_zorder=2, pitch_color='#22312b', line_color='#efefef')
fig, ax = pitch.draw(figsize=(8, 4.125))
pcm = pitch.heatmap(stats, ax=ax, cmap='viridis')
Figure 3: Pass Count heatmap
Now that we have the event counts for each grid cell, we can calculate the probabilities of each event type occurring in a given cell. We’ll handle cases where division by zero might occur by replacing NaN or infinite values with 0.
Notice that we also scale the move probability by 0.79
. At the moment, our move probabilities presume a 100% chance that a pass successfully moves the ball to its target location. This is clearly not true as passes can be misplaced or intercepted, so we scale the move probability by the average pass accuracy in the dataset. This prevents us from overestimating the value of a pass when we solve for xT values using the linear algebra approach. There are more sophisticated ways to handle this, but for now, this approach does a good job of balancing the value of a pass versus that of a shot.
with np.errstate(divide='ignore', invalid='ignore'):
move_probability = pass_count / (pass_count + shot_count)
shot_probability = shot_count / (pass_count + shot_count)
goal_probability = goal_count / shot_count
goal_probability = np.nan_to_num(goal_probability, nan=0.0, posinf=0.0, neginf=0.0)
move_probability *= 0.79
We are now ready to calculate the pass transition matrix, which describes the probability of moving the ball from one grid zone to another. This matrix will have dimensions equal to the total number of grid cells (e.g., $192 \times 192$ for a $16 \times 12$ grid). Each element at row $i$ and column $j$ represents the probability of transitioning from grid zone $i$ to grid zone $j$ based on the observed transitions from our event data.
We start off by aggregating the event data by the source and target grid zones. This will give us a count of the number of passes from each source zone to each target zone, which we will use to calculate the transition probabilities between grid zones.
These probabilities are then assigned to the appropriate indices in the corresponding sub-matrix of t_matrices
, mapping the transition probabilities from each starting grid zone to all possible ending zones. This process encodes the spatial patterns of ball movement across the pitch into a format that can be used more easily for the xT computation.
# Aggregate passes by start and end grid locations
pass_end_aggs = (
pass_df
.groupby(["grid_x", "grid_y", "grid_x_end", "grid_y_end"])
.size()
.reset_index(name="count_end") # Count passes ending at each location
)
pass_start_aggs = (
pass_df
.groupby(["grid_x", "grid_y"])
.size()
.reset_index(name="count_start") # Count passes starting from each location
)
# Merge the start and end aggregates and calculate pass probabilities
pass_aggs = pass_end_aggs.merge(pass_start_aggs, on=["grid_x", "grid_y"], how="left")
pass_aggs["pass_prob"] = pass_aggs["count_end"] / pass_aggs["count_start"]
# Drop unnecessary columns
pass_aggs = pass_aggs.drop(columns=["count_end", "count_start"])
# Initialize the transition matrices
GRID_HEIGHT, GRID_WIDTH = 12, 16
t_matrices = np.zeros((GRID_HEIGHT, GRID_WIDTH, GRID_HEIGHT, GRID_WIDTH))
# Populate the transition matrices
for _, row in pass_aggs.iterrows():
start_x, start_y = int(row["grid_x"]), int(row["grid_y"])
end_x, end_y = int(row["grid_x_end"]), int(row["grid_y_end"])
t_matrices[start_y, start_x, end_y, end_x] = row["pass_prob"]
Now that we have the pass transition matrices, we can finally calculate the expected threat (xT) for each grid zone. Rather than using the original convergence method, we will take the more efficient approach of using linear algebra to solve for xT in one step.
The code above constructs a system of linear equations to represent how xT flows across the pitch. Here's how it works at a high level:
t_matrices
, which describe how xT flows from one grid zone to others during possession.np.linalg.solve
to find $\mathbf{x}$, the expected threat (xT) for each grid zone.Our output now represents how both movement and shooting probabilities contribute to the scoring potential at every location on the pitch, producing the xT surface map.
N_CELLS = GRID_WIDTH * GRID_HEIGHT
# Flatten input probabilities
shot_prob_flat = shot_probability.flatten()
goal_prob_flat = goal_probability.flatten()
move_prob_flat = move_probability.flatten()
# Initialize matrix A as an identity matrix
A = np.eye(N_CELLS)
# Compute adjustments to A for movement probabilities and transitions
for i in range(N_CELLS):
# Map flat index to grid coordinates
y, x = divmod(i, GRID_WIDTH)
# Subtract the scaled transition matrix from A
A[i] -= move_prob_flat[i] * t_matrices[y, x].flatten()
# Create vector b (shooting payoff)
b = shot_prob_flat * goal_prob_flat
# Solve the linear system A * x = b
xt_grid_flat = np.linalg.solve(A, b)
# Reshape the result into the grid format
xt_grid = xt_grid_flat.reshape(GRID_HEIGHT, GRID_WIDTH)
Let's now visualize the xT surface map, to check the expected threat at every location on the pitch.
statistic = xt_grid
y, x = statistic.shape
x_grid = np.linspace(0, 100, x + 1)
y_grid = np.linspace(0, 100, y + 1)
cx = x_grid[:-1] + 0.5 * (x_grid[1] - x_grid[0])
cy = y_grid[:-1] + 0.5 * (y_grid[1] - y_grid[0])
stats = dict(statistic=statistic, x_grid=x_grid, y_grid=y_grid, cx=cx, cy=cy)
pitch = Pitch(pitch_type='opta', line_zorder=2, pitch_color='#22312b', line_color='#efefef')
fig, ax = pitch.draw(figsize=(8, 4.125))
pcm = pitch.heatmap(stats, ax=ax, cmap='viridis')
for i, row in enumerate(statistic):
for j, val in enumerate(row):
ax.text(cx[j], cy[i], f'{val:.2f}', ha='center', va='center', color='white', fontsize=6)
Figure 4: xT Grid
The results look sensible, with the highest expected threat being in the zones closest to the opponent's goal and the xT decreasing as you move away from the goal. Let's compare this to the xT grid created using the original convergence method shown on the Soccermatics website. There are some minor differences in values, most likely caused by the different datasets used, but the general shape of our xT surface map is reassuringly similar.
Figure 5: Soccermatics xT Grid calculated using the convergence method
Now that we’ve calculated the xT grid for every cell on the pitch, we can take the analysis a step further by evaluating individual player contributions. By mapping player's actions onto the xT grid, we’ll quantify the offensive impact of each player, quantifying their ability to create and capitalize on scoring opportunities.
There are several ways to approach this analysis. For instance, we could focus solely on successful on-ball events, consider only events with positive xT values, or even calculate the net xT by subtracting the xT of negative events from positive ones. For simplicity, we’ll focus on successful passes with positive xT values. This allows us to quantify each player’s offensive impact while avoiding penalizing players for negative events. Additionally, it accounts for context — teams like Manchester City, known for their high possession and intricate passing style, often make backward passes to maintain control without any negative intent. Including such passes as negative contributions could unfairly skew the analysis.
For this article, we’ll also focus exclusively on passes rather than including goals. While it’s perfectly valid to incorporate goals, our interest here lies in evaluating players involved in the build-up play leading to scoring opportunities. Goals can often be better analyzed using a separate expected goals (xG) model, which is specifically designed to quantify the quality and likelihood of scoring from individual shots. By isolating passes, we can concentrate on the contributions players make during the progression and creation phases of an attack.
Let's take a look at the top players by xT for the English Premier League 2023/24 season.
# Load the data
player_df = df = pd.read_csv("path/to/player/data.csv")
# Map the coordinates to the grid
player_df["grid_x"], player_df["grid_y"] = map_coordinates_to_grid_vectorized(player_df["x"], player_df["y"])
player_df["grid_x_end"], player_df["grid_y_end"] = map_coordinates_to_grid_vectorized(player_df["end_x"], player_df["end_y"])
# Calculate the xT difference per action
player_df["xt_diff"] = player_df.apply(
lambda row: xt_grid[int(row["grid_y_end"]), int(row["grid_x_end"])] - xt_grid[int(row["grid_y"]), int(row["grid_x"])],
axis=1
)
# Filter for positive xT differences and aggregate by player
top_players = (
player_df.query("xt_diff > 0")
.groupby("player_id", as_index=False)["xt_diff"]
.sum()
.sort_values("xt_diff", ascending=False)
.head(10)
)
Name | xT | |
---|---|---|
1 | Pascal Groß | 21.5 |
2 | Rodri | 21.3 |
3 | Bruno Fernandes | 18.7 |
4 | Martin Ødegaard | 18.4 |
5 | Trent Alexander-Arnold | 17.3 |
6 | Kieran Trippier | 15.2 |
7 | Declan Rice | 15.1 |
8 | Bruno Guimarães | 14.3 |
9 | Morgan Gibbs-White | 13.6 |
10 | James Ward-Prowse | 13.2 |
Table 4: Top 10 players by xT for the English Premier League 2023/24 season
By using linear algebra, we’ve introduced an efficient and elegant way to quantify player contributions to their team’s offensive efforts. This approach not only simplifies the computation of xT but also opens the door for more scalable analyses using larger datasets.
It’s also worth noting that this article has been intentionally simplified to (hopefully) make it more accessible and easy to understand. There are numerous ways to refine and expand upon this approach, which I'm planning on exploring in a future blog post.
Submit your comments below, and feel free to format them using MarkDown if you want. Comments typically take upto 24 hours to appear on the site and be answered so please be patient.
Thanks!