Blog List.

Calculating Expected Threat in Python Using Linear Algebra

Introduction

Imagine you're watching a soccer match and your team's midfielder has the ball at the halfway line. How dangerous is this position? What about if they dribble forward 10 yards? Or make a pass to the wing? Expected Threat (xT), originally developed Sarah Rudd and popularised by Karun Singh, attempts to answer these questions by quantifying the offensive value of every position on the pitch.

Unlike simpler metrics such as expected goals (xG) that only measure shot quality, xT evaluates both immediate shooting opportunities and the potential for creating future scoring chances. This makes it useful for analyzing buildup play and measuring contributions from those players who don't directly create shots.

Computational Approach

Karun Singh's implementation of expected threat relies on iterative calculations that gradually converge to the final xT values. While this approach works, we can actually take the more elegant approach of using linear algebra to calculate the xT values directly. By reformulating the problem as a system of linear equations, we can achieve the following benefits:

  • Calculate all values in a single step rather than multiple iterations
  • Eliminate the need to check for convergence
  • Improve processing speed by replacing nested loops with efficient matrix operations
  • Reduce code complexity
  • Express the model in standard linear algebra notation

In this article, I'll show how to use linear algebra to calculate xT efficiently, and walk through the process in Python, providing a step-by-step guide for implementing this approach in practice.

Understanding Expected Threat

Let's start off by understanding the theory behind xT, which can be described by the following equation:

At its core, xT breaks down the value of any position on the pitch into two components: the immediate threat of scoring (shot threat) and the potential to create better opportunities through movement of the ball (move threat). Let's examine each component in more detail:

Shot Threat Component

The first term represents the immediate threat of scoring from position $(x,y)$:

  • $s_{x,y}$ represents the probability of taking a shot from this position
  • $g_{x,y}$ represents the probability of scoring if a shot is taken

Move Threat Component

The second term captures the potential threat from moving the ball:

  • $m_{x,y}$ is the probability of moving the ball (instead of shooting)
  • $T_{(x,y)\to (z,w)}$ represents the probability of moving the ball from position $(x,y)$ to position $(z,w)$
  • $\mathrm{xT}_{z,w}$ is the expected threat at the destination position

So, the equation attempts to measure a fundamental concept in soccer: the value of a position comes both from the immediate scoring opportunity and the potential to create better opportunities through moving the ball. For example, when Phil Foden carries the ball from the wing into the penalty area, he's moving from a low-xT position (around 0.01) to a much more dangerous position (around 0.3).

The recursive nature of this equation, where xT appears on both sides, is what typically necessitates an iterative solution. However, by recognizing xT as a system of linear equations, we can solve it directly using matrix algebra that can be solved efficiently using Python's linear algebra libraries.

Data

To calculate xT, we need event data capturing how the ball moves around the pitch. While the model can incorporate various types of ball movements, for simplicity we'll focus on three fundamental events in this article:

  • Shots
  • Goals
  • Passes

Data Structure

And for each event, we require the following information:

  • Starting position $(x, y)$
  • Ending position of passes $(\text{end_x}, \text{end_y})$
  • Event type (shot, goal, or pass)

Here's a sample of what my data looks like - note that these were events collected for the English Premier League between 2017 and 2023, giving us around 3,000,000 relevant events to fit the model to. Unfortunately, I can't share the data here but there are plenty of free data sources that you can use to replicate this.

x y end_x end_y event_type
0 45.8 21.8 53.3 26.7 pass
3 52.3 11.3 45.6 8.6 pass
4 38.4 50.1 40.0 64.8 pass
5 2.3 23.6 32.4 17.3 pass
6 62.9 9.4 63.8 1.6 pass

Table 1: Example of the data used here to calculate xT

Calculating xT

The Grid Approach

The first step in calculating xT is dividing the pitch into a grid. This transforms the continuous space of the soccer / football pitch into discrete zones that we can analyze. Each zone in the grid will have its own xT value, representing the expected threat when the ball is in that position.

Iterative vs Matrix Approach

The original methodology calculates xT through iteration:

  • Initialize each zone with base shooting probabilities
  • Update xT values by considering contributions from neighboring zones through ball movements
  • Repeat the process until xT values stabilize (typically requiring 5 iterations)

While effective, this method requires tracking convergence and is computationally more intensive for larger datasets.

Instead, we can reformulate the problem as a matrix equation:

$$\mathbf{X} = \mathbf{S} + \mathbf{M}\mathbf{T}\mathbf{X}$$

Where:

  • $\mathbf{X}$: vector of xT values we want to find
  • $\mathbf{S}$: vector representing direct shooting threat for each zone
  • $\mathbf{M}$: Diagonal matrix containing movement probabilities for each zone
  • $\mathbf{T}$: Transition matrix capturing probabilities of moving between zones

This approach allows us to solve for all xT values simultaneously, eliminating the need for iteration.

Choosing Grid Resolution

The choice of grid resolution for the pitch depends on the trade-offs between computational efficiency and spatial precision. Some common resolutions and their pros and cons are outlined below:

Resolution Advantages Disadvantages
$12 \times 8$ (Coarse) More data per zone, Faster computation, more stable estimates Less spatial detail
$16 \times 12$ (Fine) Better spatial precision, finer tactical insights Sparser data, Longer computation
$20 \times 15$ (Very Fine) Highly detailed spatial patterns Even sparser data, computationally intensive

Table 2: Grid resolution trade-offs

For this article, we'll use a $16 \times 12$ grid. This resolution strikes a balance between capturing detailed spatial patterns, such as distinguishing the penalty spot from the edge of the box, and maintaining sufficient data in each zone for reliable estimates. Additionally, the computational requirements are manageable with this resolution.

Next, let's dive into the Python implementation, showing how to construct the required matrices efficiently.

Computing Transition Probabilities

Before we can calculate xT, we need to understand how players use each zone of the pitch. This involves computing three key probabilities:

Required Probabilities

  • Shot Probability ($s_{x,y}$): The likelihood of a shot being taken from each zone, calculated as the ratio of shots taken to all events in that zone.
  • Goal Probability ($g_{x,y}$): The conversion rate of shots from each zone, determined as the ratio of goals scored to shots taken from that zone.
  • Movement Probability ($T_{(x,y)\to(z,w)}$): The likelihood of moving the ball from zone $(x, y)$ to zone $(z, w)$, based on observed transitions in the data.

Mapping to Grid Coordinates

To calculate xT, we need to map raw pitch coordinates onto our $16 \times 12$ grid system. Here’s how to map any position on the pitch to its corresponding grid cell:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mplsoccer import Pitch


def map_coordinates_to_grid_vectorized(x, y, grid_width=16, grid_height=12):
    """
    Vectorized function to convert pitch coordinates (x, y) into grid cell indices,
    handling missing values.

    Parameters:
        x (pd.Series or np.array): X-coordinates on the pitch (normalized to 0-100).
        y (pd.Series or np.array): Y-coordinates on the pitch (normalized to 0-100).
        grid_width (int): Number of grid cells across the width of the pitch.
        grid_height (int): Number of grid cells across the height of the pitch.

    Returns:
        tuple: Two arrays (grid_x, grid_y) representing grid cell indices. Missing values
               will be represented as None.
    """
    PITCH_WIDTH = 100
    PITCH_HEIGHT = 100

    # Handle missing values (NaN or None)
    valid = ~pd.isna(x) & ~pd.isna(y)

    # Initialize arrays with None for invalid coordinates
    grid_x = np.full_like(x, None, dtype=object)
    grid_y = np.full_like(y, None, dtype=object)

    # Compute grid indices only for valid rows
    grid_x[valid] = np.floor(x[valid] * grid_width / PITCH_WIDTH).clip(0, grid_width - 1).astype(int)
    grid_y[valid] = np.floor(y[valid] * grid_height / PITCH_HEIGHT).clip(0, grid_height - 1).astype(int)

    return grid_x, grid_y

df = pd.read_csv("path/to/my/data.csv")

# Map starting coordinates to grid cells
df["grid_x"], df["grid_y"] = map_coordinates_to_grid_vectorized(df["x"], df["y"])

# Map ending coordinates to grid cells, handling missing values
df["grid_x_end"], df["grid_y_end"] = map_coordinates_to_grid_vectorized(df["end_x"], df["end_y"])
x y end_x end_y grid_x grid_y event_type
0 45.8 21.8 53.3 26.7 7 2 pass
3 52.3 11.3 45.6 8.6 8 1 pass
4 38.4 50.1 40.0 64.8 6 6 pass
5 2.3 23.6 32.4 17.3 0 2 pass

Table 3: Mapping coordinates to grid cells

By mapping our continuous pitch coordinates into discrete grid zones, we’ve converted our raw event data to a structured format suitable for xT calculations. For instance, a position at (50, 50) would map to grid zone (8, 6) in our $16 \times 12$ grid.

Next, we need to calculate three core matrices:

  • pass_count: Stores the number of passes originating from each grid zone.
  • shot_count: Records the number of shots taken from each grid zone.
  • goal_count: Tracks the number of goals scored from each grid zone.

These matrices will be used to calculate the probabilities ($s_{x,y}$, $g_{x,y}$, and $T_{(x,y)\to(z,w)}$) used in the xT computation.

Populating Event Count Matrices

Using our grid coordinates, we'll now populate the matrices that count the number of passes, shots, and goals in each grid zone. The following code achieves this by iterating through the event data and incrementing the corresponding grid cell:

# Initialize the empty grids
GRID_HEIGHT, GRID_WIDTH = 12, 16

pass_count = np.zeros((GRID_HEIGHT, GRID_WIDTH))
shot_count = np.zeros((GRID_HEIGHT, GRID_WIDTH))
goal_count = np.zeros((GRID_HEIGHT, GRID_WIDTH))

# Filter events by type
pass_df = df.query("event_type == 'pass'")
goals_df = df.query("event_type == 'goal'")
shots_df = df.query("event_type == 'shot'")

# Group by grid coordinates and count events
pass_counts = pass_df.groupby(["grid_y", "grid_x"]).size()
shot_counts = shots_df.groupby(["grid_y", "grid_x"]).size()
goal_counts = goals_df.groupby(["grid_y", "grid_x"]).size()

# Fill in the grids using the grouped data
for (y, x), count in pass_counts.items():
    pass_count[y, x] = count

for (y, x), count in shot_counts.items():
    shot_count[y, x] = count

for (y, x), count in goal_counts.items():
    goal_count[y, x] = count

Visualizing Event Count Matrices

To ensure the data looks correct, let’s visualize the count matrices as heatmaps. Each heatmap provides a spatial representation of the frequency of each event type across the pitch.

Shot Count heatmap

statistic = shot_count
y, x = statistic.shape
x_grid = np.linspace(0, 100, x + 1)
y_grid = np.linspace(0, 100, y + 1)
cx = x_grid[:-1] + 0.5 * (x_grid[1] - x_grid[0])
cy = y_grid[:-1] + 0.5 * (y_grid[1] - y_grid[0])
stats = dict(statistic=statistic, x_grid=x_grid, y_grid=y_grid, cx=cx, cy=cy)

pitch = Pitch(pitch_type='opta', line_zorder=2, pitch_color='#22312b', line_color='#efefef')
fig, ax = pitch.draw(figsize=(8, 4.125))

pcm = pitch.heatmap(stats, ax=ax, cmap='viridis')

Pelican

Figure 1: Shot Count heatmap

Goal Count heatmap

statistic = goal_count
y, x = statistic.shape
x_grid = np.linspace(0, 100, x + 1)
y_grid = np.linspace(0, 100, y + 1)
cx = x_grid[:-1] + 0.5 * (x_grid[1] - x_grid[0])
cy = y_grid[:-1] + 0.5 * (y_grid[1] - y_grid[0])
stats = dict(statistic=statistic, x_grid=x_grid, y_grid=y_grid, cx=cx, cy=cy)

pitch = Pitch(pitch_type='opta', line_zorder=2, pitch_color='#22312b', line_color='#efefef')
fig, ax = pitch.draw(figsize=(8, 4.125))

pcm = pitch.heatmap(stats, ax=ax, cmap='viridis')

Pelican

Figure 2: Goal Count heatmap*

Pass Count heatmap

statistic = pass_count
y, x = statistic.shape
x_grid = np.linspace(0, 100, x + 1)
y_grid = np.linspace(0, 100, y + 1)
cx = x_grid[:-1] + 0.5 * (x_grid[1] - x_grid[0])
cy = y_grid[:-1] + 0.5 * (y_grid[1] - y_grid[0])
stats = dict(statistic=statistic, x_grid=x_grid, y_grid=y_grid, cx=cx, cy=cy)

pitch = Pitch(pitch_type='opta', line_zorder=2, pitch_color='#22312b', line_color='#efefef')
fig, ax = pitch.draw(figsize=(8, 4.125))

pcm = pitch.heatmap(stats, ax=ax, cmap='viridis')

Pelican

Figure 3: Pass Count heatmap

Calculate Event Probabilities

Now that we have the event counts for each grid cell, we can calculate the probabilities of each event type occurring in a given cell. We’ll handle cases where division by zero might occur by replacing NaN or infinite values with 0.

Notice that we also scale the move probability by 0.79. At the moment, our move probabilities presume a 100% chance that a pass successfully moves the ball to its target location. This is clearly not true as passes can be misplaced or intercepted, so we scale the move probability by the average pass accuracy in the dataset. This prevents us from overestimating the value of a pass when we solve for xT values using the linear algebra approach. There are more sophisticated ways to handle this, but for now, this approach does a good job of balancing the value of a pass versus that of a shot.

with np.errstate(divide='ignore', invalid='ignore'):
    move_probability = pass_count / (pass_count + shot_count)
    shot_probability = shot_count / (pass_count + shot_count)
    goal_probability = goal_count / shot_count
    goal_probability = np.nan_to_num(goal_probability, nan=0.0, posinf=0.0, neginf=0.0)
    move_probability *= 0.79 

Pass Transition Matrices

We are now ready to calculate the pass transition matrix, which describes the probability of moving the ball from one grid zone to another. This matrix will have dimensions equal to the total number of grid cells (e.g., $192 \times 192$ for a $16 \times 12$ grid). Each element at row $i$ and column $j$ represents the probability of transitioning from grid zone $i$ to grid zone $j$ based on the observed transitions from our event data.

We start off by aggregating the event data by the source and target grid zones. This will give us a count of the number of passes from each source zone to each target zone, which we will use to calculate the transition probabilities between grid zones.

These probabilities are then assigned to the appropriate indices in the corresponding sub-matrix of t_matrices, mapping the transition probabilities from each starting grid zone to all possible ending zones. This process encodes the spatial patterns of ball movement across the pitch into a format that can be used more easily for the xT computation.

# Aggregate passes by start and end grid locations
pass_end_aggs = (
    pass_df
    .groupby(["grid_x", "grid_y", "grid_x_end", "grid_y_end"])
    .size()
    .reset_index(name="count_end")  # Count passes ending at each location
)

pass_start_aggs = (
    pass_df
    .groupby(["grid_x", "grid_y"])
    .size()
    .reset_index(name="count_start")  # Count passes starting from each location
)

# Merge the start and end aggregates and calculate pass probabilities
pass_aggs = pass_end_aggs.merge(pass_start_aggs, on=["grid_x", "grid_y"], how="left")
pass_aggs["pass_prob"] = pass_aggs["count_end"] / pass_aggs["count_start"]

# Drop unnecessary columns
pass_aggs = pass_aggs.drop(columns=["count_end", "count_start"])

# Initialize the transition matrices
GRID_HEIGHT, GRID_WIDTH = 12, 16
t_matrices = np.zeros((GRID_HEIGHT, GRID_WIDTH, GRID_HEIGHT, GRID_WIDTH))

# Populate the transition matrices
for _, row in pass_aggs.iterrows():
    start_x, start_y = int(row["grid_x"]), int(row["grid_y"])
    end_x, end_y = int(row["grid_x_end"]), int(row["grid_y_end"])
    t_matrices[start_y, start_x, end_y, end_x] = row["pass_prob"]

The xT Surface Map

Now that we have the pass transition matrices, we can finally calculate the expected threat (xT) for each grid zone. Rather than using the original convergence method, we will take the more efficient approach of using linear algebra to solve for xT in one step.

The code above constructs a system of linear equations to represent how xT flows across the pitch. Here's how it works at a high level:

  • Input probabilities: the probability matrices are flattened into vectors, representing the likelihood of events at each grid zone.
  • Matrix Setup ($\mathbf{A}$): the identity matrix ($\mathbf{A}$) is initialized, representing the default relationship where each grid cell depends only on itself. This is then adjusted by subtracting scaled transition probabilities from t_matrices, which describe how xT flows from one grid zone to others during possession.
  • Payoff Vector ($\mathbf{b}$): the payoff vector ($\mathbf{b}$) is constructed by combining shooting and goal probabilities, reflecting the immediate scoring threat for each grid cell.
  • Solving the System: The equation $\mathbf{A} \cdot \mathbf{x} = \mathbf{b}$ can now be solved using NumPy's np.linalg.solve to find $\mathbf{x}$, the expected threat (xT) for each grid zone.
  • Reshaping the Result: Finally, the solution vector $\mathbf{x}$ is reshaped back into a 2D grid format, aligning with the original pitch layout.

Our output now represents how both movement and shooting probabilities contribute to the scoring potential at every location on the pitch, producing the xT surface map.

N_CELLS = GRID_WIDTH * GRID_HEIGHT

# Flatten input probabilities
shot_prob_flat = shot_probability.flatten()
goal_prob_flat = goal_probability.flatten()
move_prob_flat = move_probability.flatten()

# Initialize matrix A as an identity matrix
A = np.eye(N_CELLS)

# Compute adjustments to A for movement probabilities and transitions
for i in range(N_CELLS):
    # Map flat index to grid coordinates
    y, x = divmod(i, GRID_WIDTH)

    # Subtract the scaled transition matrix from A
    A[i] -= move_prob_flat[i] * t_matrices[y, x].flatten()

# Create vector b (shooting payoff)
b = shot_prob_flat * goal_prob_flat

# Solve the linear system A * x = b
xt_grid_flat = np.linalg.solve(A, b)

# Reshape the result into the grid format
xt_grid = xt_grid_flat.reshape(GRID_HEIGHT, GRID_WIDTH)

Let's now visualize the xT surface map, to check the expected threat at every location on the pitch.

statistic = xt_grid

y, x = statistic.shape
x_grid = np.linspace(0, 100, x + 1)
y_grid = np.linspace(0, 100, y + 1)
cx = x_grid[:-1] + 0.5 * (x_grid[1] - x_grid[0])
cy = y_grid[:-1] + 0.5 * (y_grid[1] - y_grid[0])
stats = dict(statistic=statistic, x_grid=x_grid, y_grid=y_grid, cx=cx, cy=cy)

pitch = Pitch(pitch_type='opta', line_zorder=2, pitch_color='#22312b', line_color='#efefef')
fig, ax = pitch.draw(figsize=(8, 4.125))

pcm = pitch.heatmap(stats, ax=ax, cmap='viridis')

for i, row in enumerate(statistic):
    for j, val in enumerate(row):
        ax.text(cx[j], cy[i], f'{val:.2f}', ha='center', va='center', color='white', fontsize=6)

Pelican

Figure 4: xT Grid

The results look sensible, with the highest expected threat being in the zones closest to the opponent's goal and the xT decreasing as you move away from the goal. Let's compare this to the xT grid created using the original convergence method shown on the Soccermatics website. There are some minor differences in values, most likely caused by the different datasets used, but the general shape of our xT surface map is reassuringly similar.

Pelican

Figure 5: Soccermatics xT Grid calculated using the convergence method

Calculating Player xT

Now that we’ve calculated the xT grid for every cell on the pitch, we can take the analysis a step further by evaluating individual player contributions. By mapping player's actions onto the xT grid, we’ll quantify the offensive impact of each player, quantifying their ability to create and capitalize on scoring opportunities.

There are several ways to approach this analysis. For instance, we could focus solely on successful on-ball events, consider only events with positive xT values, or even calculate the net xT by subtracting the xT of negative events from positive ones. For simplicity, we’ll focus on successful passes with positive xT values. This allows us to quantify each player’s offensive impact while avoiding penalizing players for negative events. Additionally, it accounts for context — teams like Manchester City, known for their high possession and intricate passing style, often make backward passes to maintain control without any negative intent. Including such passes as negative contributions could unfairly skew the analysis.

For this article, we’ll also focus exclusively on passes rather than including goals. While it’s perfectly valid to incorporate goals, our interest here lies in evaluating players involved in the build-up play leading to scoring opportunities. Goals can often be better analyzed using a separate expected goals (xG) model, which is specifically designed to quantify the quality and likelihood of scoring from individual shots. By isolating passes, we can concentrate on the contributions players make during the progression and creation phases of an attack.

Let's take a look at the top players by xT for the English Premier League 2023/24 season.

# Load the data
player_df = df = pd.read_csv("path/to/player/data.csv")

# Map the coordinates to the grid
player_df["grid_x"], player_df["grid_y"] = map_coordinates_to_grid_vectorized(player_df["x"], player_df["y"])
player_df["grid_x_end"], player_df["grid_y_end"] = map_coordinates_to_grid_vectorized(player_df["end_x"], player_df["end_y"])

# Calculate the xT difference per action
player_df["xt_diff"] = player_df.apply(
    lambda row: xt_grid[int(row["grid_y_end"]), int(row["grid_x_end"])] - xt_grid[int(row["grid_y"]), int(row["grid_x"])],
    axis=1
)

# Filter for positive xT differences and aggregate by player
top_players = (
    player_df.query("xt_diff > 0")
    .groupby("player_id", as_index=False)["xt_diff"]
    .sum()
    .sort_values("xt_diff", ascending=False)
    .head(10)
)
Name xT
1 Pascal Groß 21.5
2 Rodri 21.3
3 Bruno Fernandes 18.7
4 Martin Ødegaard 18.4
5 Trent Alexander-Arnold 17.3
6 Kieran Trippier 15.2
7 Declan Rice 15.1
8 Bruno Guimarães 14.3
9 Morgan Gibbs-White 13.6
10 James Ward-Prowse 13.2

Table 4: Top 10 players by xT for the English Premier League 2023/24 season

Conclusion

By using linear algebra, we’ve introduced an efficient and elegant way to quantify player contributions to their team’s offensive efforts. This approach not only simplifies the computation of xT but also opens the door for more scalable analyses using larger datasets.

It’s also worth noting that this article has been intentionally simplified to (hopefully) make it more accessible and easy to understand. There are numerous ways to refine and expand upon this approach, which I'm planning on exploring in a future blog post.

Comments

Get In Touch!

Submit your comments below, and feel free to format them using MarkDown if you want. Comments typically take upto 24 hours to appear on the site and be answered so please be patient.

Thanks!

About

Pena.lt/y is a site dedicated to football analytics. You'll find lots of research, tutorials and examples on the blog and on GitHub.

Social Links

Get In Touch

You can contact pena.lt/y through the website here