MatchFlow: A JSON-Native Query Engine for Football Data

Over the years, I’ve spent a lot of time wrangling football data - from open data projects to internal club work. A lot of it comes as JSON these days: events, lineups, matches, tracking, metadata. And while that format is powerful and expressive, the tools we use to work with it often aren’t.

What’s the first step in most workflows?

Flatten everything. Hope the structure is consistent. Write some glue code to handle edge cases. Cross fingers that json_normalize() doesn’t choke. And repeat, every time.

I got tired of that. So I started building MatchFlow, a Python library that lets me work with football data more naturally.

Why I Built MatchFlow

I don’t want to spend hours figuring out how to flatten the data just to get started on a new project. I want to load the data, explore it, and see what’s there. And that’s where MatchFlow really helps.

Show me all the shots in a match
Group events by player
Compute some summary stats
Build a dataset for modelling

The problem is, doing that with nested JSON style data often means jumping through hoops: flattening things, renaming fields, re-shaping arrays, cleaning as you go. It gets in the way of thinking clearly.

MatchFlow is my attempt to fix that, or at least make it less annoying.

It’s built around a simple idea: don’t fight the structure. Just let it flow through, and shape it as you go.

What MatchFlow Does

MatchFlow builds a directed acyclic graph (DAG) of operations - a chain of lazy, composable steps that can be optimized and executed efficiently. This architecture makes MatchFlow fast, composable, and ready for upcoming features like pushdown filtering, transform fusion, and automatic caching.

You can load data from JSON files, folders, or even directly from APIs like StatsBomb, and then apply transformations step by step:

from penaltyblog.matchflow import Flow, where_equals

flow = (
    Flow.from_folder("data/events/")
    .filter(where_equals("type", "Shot"))
    .assign(xT=lambda r: model.predict(r))
    .select("player.name", "xT", "location")
    .to_json("shots.json")
)

Nothing is loaded into memory until you ask for it. You can stream, group, filter, join, and export. And most importantly, you can do all of that without flattening everything first.

It Works with the Data We Actually Get

Football data is full of nested structures:

Events have players, tags, locations
Matches contain teams, lineups, substitutions
Freeze-frames as arrays of objects inside a single field

Flattening all that up front means you lose some of the meaning, or have to write extra code to glue it back together. MatchFlow lets you work with it in-place, using dot notation (player.name, player.location.0, etc.) and stream-like transformations.

MatchFlow isn’t trying to replace dataframes - it’s built for a different job: helping you explore, transform, and build pipelines over nested, unstructured data without flattening everything first.

Whether that means pulling down a bunch of JSON files and exploring them interactively, or turning them into a repeatable data engineering pipeline, MatchFlow makes that easy without needing to reshape everything upfront.

A Quick Example (with StatsBomb)

Here’s how you might pull shots from a StatsBomb match using the built-in API client:

from penaltyblog.matchflow import Flow, where_equals

shots = (
    Flow.statsbomb.events(match_id=22912)
    .filter(where_equals("type.name", "Shot"))
    .select("player.name", "location", "shot.statsbomb_xg")
    .collect()
)

No flattening. No munging. Just a stream of records that you can work with directly.

What's Included in v1

Read from JSON files, folders, generators, or the StatsBomb API
Chainable methods like .filter(), .select(), .assign(), .group_by(), .join()
Export to .to_json(), .to_jsonl(), .to_pandas()
Nested field access using dot notation
Lazy evaluation - everything streams until you call .collect() or .to_pandas()

What’s Coming Next

🚀 Performance

A custom file format with partitioning and fast reading
Predicate pushdown using file-level indexes and Bloom filters
Compilation of query steps into native Python functions for speed
Cython/JIT-optimized inner loops for group-by and sorting

🔧 Features

Support for Wyscout, Opta, and remote data (S3, GCS)
Built-in operations like extract_shots() and extract_passes()
DSL for natural query expressions like flow.query("player.name == 'Messi'")

🛠 Ergonomics

Plotting helpers for quick exploration
Command-line tools for scripting pipelines
Caching for faster re-runs

Try It Out

If you want to give it a spin:

pip install penaltyblog

You can also check out the docs - there’s a guides, an API reference, and examples of some common tasks.

Final Thoughts

I don’t think MatchFlow is for everyone. If you're happy working in SQL or pandas and it's doing what you need then great, stick with it. But if you’ve ever found yourself spending too much time writing flatteners, normalizers, or trying to keep your data in shape just to ask a basic question, MatchFlow might help.

It's early days, and I'm sure there are still edge cases I haven't uncovered. But this approach has saved me a lot of time over the years, and hopefully it can do the same for you.

I'd love to hear how you use it, or where it breaks. That’s how it gets better 😁

Introducing MatchFlow: a JSON-native query engine for football data.