Scraping Football Data Using the penaltyblog Python Package

Introduction

If you're interested in football analytics or trying to predict football results and odds for betting markets then you are going to need data. This article will walk through how to use the penaltyblog python package to scrape football (soccer) data from different websites and how to join data from different sources together.

Installing `penaltyblog`

The first thing you need to do is install the penaltyblog package from pypi using pip. Provided you've got Python 3 and pip already installed then you just need to run the command below in a terminal. If you've not got Python installed yet, then head over to Anaconda and install that first.

pip install penaltyblog

The Scrapers

penaltyblog contains scrapers for a number of different websites, included Understat, ESPN and football-data.co.uk. For consistency, each of the scrapers takes the same arguments when it is created. This is a standardized competition name and season. For example:

import penaltyblog as pb

understat = pb.scrapers.Understat("ENG Premier League", "2021-2022")
fb = pb.scrapers.FootballData("ENG Premier League", "2021-2022")
espn = pb.scrapers.ESPN("ENG Premier League", "2021-2022")

It doesn't matter which data source your are scraping, competition names always comprise the country's three letter code followed by it's name, and the season is start_year-end_year. The scraper then maps this to whatever the data source is calling the competition so you don't need to remember whether you are scraping La Liga, LaLiga, La Liga Primera Division or whatever else a particular website calls it.

You can get a list of available competitions for each scraper by calling its list_competitions() function

import penaltyblog as pb

understat = pb.scrapers.Understat.list_competitions()

['DEU Bundesliga 1',
 'ENG Premier League',
 'ESP La Liga',
 'FRA Ligue 1',
 'ITA Serie A',
 'RUS Premier League']

Fixtures

Let's go ahead and scrape ourselves fixtures from Understat for the English Premier League

import penaltyblog as pb

understat = pb.scrapers.Understat("ENG Premier League", "2021-2022")
under_fixtures = understat.get_fixtures()
under_fixtures.head()

	understat_id	datetime	team_home	team_away	goals_home	goals_away	xg_home	xg_away	forecast_w	forecast_d	forecast_l	season	competition	date
id
1628812800---brentford---arsenal	16376	2021-08-13 19:00:00	Brentford	Arsenal	2	0	1.888180	1.023850	0.6289	0.2287	0.1424	2021-2022	ENG Premier League	2021-08-13
1628899200---burnley---brighton	16378	2021-08-14 14:00:00	Burnley	Brighton	1	2	1.795480	1.685300	0.3894	0.2877	0.3229	2021-2022	ENG Premier League	2021-08-14
1628899200---chelsea---crystal_palace	16379	2021-08-14 14:00:00	Chelsea	Crystal Palace	3	0	1.187090	0.321701	0.6405	0.2822	0.0773	2021-2022	ENG Premier League	2021-08-14
1628899200---everton---southampton	16380	2021-08-14 14:00:00	Everton	Southampton	3	1	2.388630	0.580601	0.8359	0.1234	0.0407	2021-2022	ENG Premier League	2021-08-14
1628899200---leicester---wolverhampton_wanderers	16381	2021-08-14 14:00:00	Leicester	Wolverhampton Wanderers	1	0	0.668082	1.327140	0.1683	0.2750	0.5567	2021-2022	ENG Premier League	2021-08-14

If you look at the table above (you may need to scroll the table horizontally if you're on a small screen) then you'll notice that the column names have a consistent style to them, e.g. all in lowercase, formatted as snake case etc to make them easier to work with.

Where possible, the column names are also consistent between data sources. For example, the column for the home team is always called team_home whatever the site you're scraping. This may seem trivial but it's a huge time saving not having to try and remember what each different data source calls the same things.

Columns are always named as team_home, goals_home, xg_home etc so that if you print out the column names then related columns, such as team_home and team_away, appear next to each other. No more searching through a giant list of columns to try and find the one you want.

The data also comes with an id column as the dataframe's index so every row has a unique key associated with it comprising the timestamp plus the team names.

Merging Data Sources

Combining scraped data from multiple data sources can be tricky but the penaltyblog scrapers can help with this too. As a somewhat contrived example, let's try and combine Understat's xG scores with football-data.co.uk's betting odds for Bet365.

The first thing we need to do is scrape the data from both sites.

import penaltyblog as pb

under = pb.scrapers.Understat("ENG Premier League", "2021-2022")
ufix = under.get_fixtures()
ufix = ufix[["team_home", "team_away", "xg_home", "xg_away"]]

fb = pb.scrapers.FootballData("ENG Premier League", "2021-2022")
fbfix = fb.get_fixtures()
fbfix = fbfix[ ["team_home", "team_away", "b365_h", "b365_d","b365_a"]]

In theory, we should be able to merge the two datasets by joining on the team names. Let's give it a go and see what happens...

merged = ufix.merge(fbfix, on=["team_home", "team_away"], how="inner")
print(merged.shape)

Oh dear, we only have 240 fixtures instead of the 380 we would expect for a full season of Premier League fixtures 😕.

Unfortunately, both data sources use different team names. For example, Understat uses Manchester City whereas football-data uses Man City so we can't join on them since they don't match.

To get around this problem, all of the penaltyblog scrapers come with the ability to remap team names. penaltyblog doesn't have mappings for all the world's football teams (yet..) but there's enough for this example and you can easily extend the mappings with your own team names.

The mappings themselves are just a standard python dictionary, with the key as the team name you want to end up with and the value as a list of possibles choices to remap. The example below maps both Man Utd and Man United to Manchester United

{
  "Manchester United": ["Man Utd", "Man United"],
}

Let's try our join again but with the example team name mappings included. Notice how we can now simply join using the id column since it's unique per fixture and will be identical across both datasets now we've mapped the team names.

import penaltyblog as pb

mappings = pb.scrapers.get_example_team_name_mappings()

under = pb.scrapers.Understat("ENG Premier League", "2021-2022", mappings)
ufix = under.get_fixtures()

fb = pb.scrapers.FootballData("ENG Premier League", "2021-2022", mappings)
fbfix = fb.get_fixtures()

merged = ufix.merge(fbfix, left_index=True, right_index=True)
print(merged.shape)

Success! With just 12 lines of code (including the blank lines) we've scraped both Understat and football-data.co.uk then merged the two data sets together using the unique id the scrapers automatically created for us.

What Else Can the Scrapers Do?

Depending on the data source, the scrapers have additional functions for collecting extra data. For example, the Understat scraper can get you shot data, including the XY coordinates, and the ESPN scraper can get you player and team level data too.

import penaltyblog as pb

under = pb.scrapers.Understat("ENG Premier League", "2021-2022")
shots = under.get_shots("16376")

print(shots.head())

	competition	season	datetime	minute	result	x	y	x_g	player	h_a	player_id	situation	shot_type	match_id	team_home	team_away	goals_home	goals_away	date	player_assisted	last_action
id
1628812800---brentford---arsenal	ENG Premier League	2021-2022	2021-08-13 19:00:00	10	MissedShots	0.913	0.539	0.053	Frank Onyeka	h	9681	OpenPlay	Head	16376	Brentford	Arsenal	2	0	2021-08-13	None	Aerial
1628812800---brentford---arsenal	ENG Premier League	2021-2022	2021-08-13 19:00:00	11	ShotOnPost	0.908	0.315	0.118	Bryan Mbeumo	h	6552	OpenPlay	RightFoot	16376	Brentford	Arsenal	2	0	2021-08-13	Ivan Toney	Throughball
1628812800---brentford---arsenal	ENG Premier League	2021-2022	2021-08-13 19:00:00	21	Goal	0.874	0.698	0.052	Sergi Canos	h	1078	OpenPlay	RightFoot	16376	Brentford	Arsenal	2	0	2021-08-13	Ethan Pinnock	BallRecovery
1628812800---brentford---arsenal	ENG Premier League	2021-2022	2021-08-13 19:00:00	27	MissedShots	0.812	0.478	0.066	Sergi Canos	h	1078	OpenPlay	RightFoot	16376	Brentford	Arsenal	2	0	2021-08-13	Frank Onyeka	Pass
1628812800---brentford---arsenal	ENG Premier League	2021-2022	2021-08-13 19:00:00	29	MissedShots	0.892	0.357	0.081	Bryan Mbeumo	h	6552	OpenPlay	RightFoot	16376	Brentford	Arsenal	2	0	2021-08-13	Kristoffer Ajer	Chipped

Take a look at the penaltyblog documentation for more details and examples of each scraper.

What's Next?

There's plenty of other websites to add to the scrapers, with FBRef and whoscored next up on the TODO list but let me know if there's any others that should be included too.

After that, it's back to the modelling to add more techniques for predicting football results and betting markets.

Thanks for reading!