Introduction

I'm excited to announce a new feature for matchflow: direct integration with the Stats Perform (Opta) Soccer API!

For a while now, matchflow (part of my penaltyblog package) has been a powerful tool for building data pipelines from local files. But what if you could start those pipelines directly from Stats Perform's Opta API?

That's exactly what this new addition enables. If you have access to the Stats Perform Opta API, you can now build lazy, powerful matchflow pipelines straight from their feeds. This update is all about letting you focus on your analysis, not on the data engineering "grunt work" that comes with using a complex API.

The "Why?": Handling the Hard Parts for You

Let's be honest: while the Opta API is an incredibly rich data source, it's not always the easiest to work with. You have to manage:

Authentication tokens
Building the correct URL for each of the 20+ feeds
Handling paginated endpoints (and knowing which ones are paginated)
Parsing the deeply nested, complex JSON responses
Flattening statistics and un-nesting event data

The new opta connector, available in penaltyblog.matchflow, handles that for you.

Lazy Execution: It builds a matchflow plan. No API calls are made until you run .collect(), .to_pandas(), or write to a file.
Automatic Pagination: Feeds like matches or venues are automatically paginated. You get a single, clean stream of all records, whether it takes 1 request or 100.
Smart Parsing: It automatically unnests and flattens data. For example, event data (MA3) is yielded one event at a time, and match stats (MA2) are yielded one player-stat or team-stat record at a time. It turns complex JSON into analysis-ready rows.

Getting Started: A Quick Example

First, let's set our Stats Perform credentials as environment variables:

export OPTA_AUTH_KEY="your_auth_key_here"
export OPTA_RT_MODE="your_rt_mode_here" # (e.g., "b")

Now, we can import the opta object and use it to build a Flow. Let's get a list of all active tournament calendars (Feed OT2).

from penaltyblog.matchflow import Flow

flow = (
    Flow.opta.tournament_calendars(status="active")
    .select("id", "competitionCode", "name")
)

flow.show()

competitionCode	id	name
ACO	ax1yf4nlzqpcji4j8epdgx3zl	Africa Cup of Nations Qualification
17Q	64fygrchlfuz3q4lc7k2ffj84	Africa U17 Cup of Nations Qualification
ACO	7dauoeun2gnkofl7f4y510s4f	Africa U20 Cup of Nations
20Q	4fht4nyqpp5dzzv6ucm057dp0	Africa U20 Cup of Nations Qualification
23Q	27zeqzs85uxv3eej1mhot6gic	Africa U23 Cup of Nations Qualification

The Real Power: Building a Pipeline

The true magic happens when you combine this source with matchflow's other methods. Let's build a common pipeline: getting all shots from a single match.

We'll use Feed MA3 (Match Events) and filter it down.

from penaltyblog.matchflow import Flow, where_opta_event

# The match we want to analyze
FIXTURE_UUID = "zhs8gg1hvcuqvhkk2itb54pg" 

# 1. Start the flow from the Opta MA3 (events) feed
# This is lazy! No data is downloaded yet.
event_flow = Flow.opta.events(fixture_uuid=FIXTURE_UUID)

# 2. Chain matchflow methods to build our pipeline
# We'll use the new 'where_opta_event' helper
shot_flow = event_flow.filter(
    where_opta_event(["Goal", "Attempt Saved", "Miss", "Post"])
)

# 3. Now, execute the full pipeline to 
# download and filter the data
shot_df = shot_flow.to_pandas()

# You now have a clean DataFrame of *only* the shot events
shot_df[['playerName', 'timeMin', 'timeSec', 'x', 'y', 'typeId']].head()

	playerName	timeMin	timeSec	x	y	typeId
0	H. Ekitiké	2	4	76.8	68.7	15
1	Mohamed Salah	3	39	85.6	31.3	15
2	V. van Dijk	4	27	91.4	50.5	13
3	A. Semenyo	5	55	91.1	45	13
4	Evanilson	9	44	94.9	41.1	13

This is what it's all about. You defined your source (opta.events) and your transformation (.filter), and matchflow handled the API call, JSON parsing, and filtering, delivering a clean DataFrame of shots.

Say Goodbye to "Magic Numbers"

You might have noticed in the example above that I filtered for event types like "Goal" and "Miss" directly. If you’ve worked with Opta data before, you know usually you have to memorize specific IDs to do this. You stare at your code wondering, "Is Event Type 15 a Goal or an Attempt Saved? Is Qualifier 1 a Long Ball or a Cross?"

If you are not careful, you end up with code full of "magic numbers" that is hard to read and hard to debug:

# The Old Way: Hard to read, hard to debug
shot_flow = flow.filter(
    lambda x: x['typeId'] in [13, 14, 15, 16]  # Wait, which ID is which?
)

I wanted to fix this developer experience. That's why I’ve included a comprehensive mapping of standard Opta definitions directly into matchflow.

In addition to where_opta_event, I've also added where_opta_qualifier. These helpers let you filter using human-readable names, handling the ID lookups behind the scenes.

Here is how you might filter for specific pass types without needing to look up a documentation PDF:

from penaltyblog.matchflow import where_opta_event, where_opta_qualifier

# The New Way: Readable and explicit
# Get all Passes that are specifically 'Long balls'
long_ball_flow = flow.filter(
    where_opta_event("Pass"),
    where_opta_qualifier("Long ball")
)

These helpers support list inputs for multiple events and value checks for qualifiers, making your analysis code self-documenting and much easier to share with colleagues.

Why matchflow? Cloud Storage & More

While the API integration is the headline feature of this update, it's worth remembering why matchflow exists in the first place. It isn't just a script to download data, it's a complete processing engine for football data.

Because matchflow is built on top of the excellent fsspec library, you aren't limited to saving your data locally. You can stream the API response directly into S3, Google Cloud Storage, or Azure Blob Storage.

This is huge for building data lakes or automated pipelines. You don't need to download the data to memory and then upload it as matchflow handles the data streaming for you.

# Stream Opta events directly to an S3 bucket
Flow.opta.events(fixture_uuid=FIXTURE_UUID) \
    .to_jsonl(
        "s3://my-football-data/raw/events.jsonl",
    )

And because matchflow is lazy, it optimizes your plan before execution. If you chain a .limit(10) to an API call, matchflow knows to stop processing as soon as it has 10 records, saving you time and compute resources.

The "Killer Feature": Joining Streams

If you have worked with football data before, you know the data is often fragmented. You get match events in one feed, but the player details are often in a completely different feed.

Traditionally, linking these together forces you to make a choice:

The RAM Hog: Load everything into Pandas DataFrames and merge them (risking memory crashes with large historical dumps).
The Hard Way: Write complex Python loops to manually look up player IDs as you iterate through events.

matchflow solves this with lazy joins. You can join two data streams just like you would in SQL or Pandas, but it remains a lazy iterator.

Pro Tip: Working with IP Whitelists

Stats Perform feeds are often IP-restricted, which can be difficult if you are working from a laptop or a dynamic cloud environment.

To solve this, matchflow supports standard Python proxy dictionaries. If you have a static IP proxy set up to match your whitelist, you can pass it directly to any opta method:

# Define your secure proxy
proxies = {
    "http": "http://user:pass@10.10.1.10:3128",
    "https": "http://user:pass@10.10.1.10:1080",
}

# Pass it into your flow
event_flow = Flow.opta.events(
    fixture_uuid=FIXTURE_UUID, 
    proxies=proxies
)

What Feeds Are Supported?

I've added support for a wide range of the most popular feeds, with more to come. You can now build flows from:

Tournament Feeds (OT2, MA0): tournament_calendars(), tournament_schedule()
Venue & Area Feeds (OT3, OT4): venues(), areas()
Match Feeds (MA1, MA2, MA3): matches(), match(), match_stats_player(), match_stats_team(), events()
Advanced Match Feeds (MA4, MA5): pass_matrix(), possession()
Player/Team Feeds (PE2, TM1, TM2, TM3, TM4): player_career(), teams(), team_standings(), squads(), player_season_stats(), team_season_stats()
And more! Including referees(), rankings(), injuries(), and transfers().

Conclusion

This latest update transforms matchflow from a file processing tool into an end-to-end solution for working with Opta feeds. By handling authentication, pagination, and complex JSON parsing for you, it removes the friction between you and the insights you're looking for.

Whether you are building a simple shot map or a complex, cloud-based data lake, the new Opta connector is designed to make your life easier.

This feature is available now in the latest version of penaltyblog. I’m excited to see what you build with it - if you have any feedback or run into issues, please let me know!