Blog List.

Scraping Football Data Using the penaltyblog Python Package

Introduction

If you're interested in football analytics or trying to predict football results and odds for betting markets then you are going to need data. This article will walk through how to use the penaltyblog python package to scrape football (soccer) data from different websites and how to join data from different sources together.

Installing penaltyblog

The first thing you need to do is install the penaltyblog package from pypi using pip. Provided you've got Python 3 and pip already installed then you just need to run the command below in a terminal. If you've not got Python installed yet, then head over to Anaconda and install that first.

pip install penaltyblog

The Scrapers

penaltyblog contains scrapers for a number of different websites, included Understat, ESPN and football-data.co.uk. For consistency, each of the scrapers takes the same arguments when it is created. This is a standardized competition name and season. For example:

import penaltyblog as pb

understat = pb.scrapers.Understat("ENG Premier League", "2021-2022")
fb = pb.scrapers.FootballData("ENG Premier League", "2021-2022")
espn = pb.scrapers.ESPN("ENG Premier League", "2021-2022")

It doesn't matter which data source your are scraping, competition names always comprise the country's three letter code followed by it's name, and the season is start_year-end_year. The scraper then maps this to whatever the data source is calling the competition so you don't need to remember whether you are scraping La Liga, LaLiga, La Liga Primera Division or whatever else a particular website calls it.

You can get a list of available competitions for each scraper by calling its list_competitions() function

import penaltyblog as pb

understat = pb.scrapers.Understat.list_competitions()
['DEU Bundesliga 1',
 'ENG Premier League',
 'ESP La Liga',
 'FRA Ligue 1',
 'ITA Serie A',
 'RUS Premier League']

Fixtures

Let's go ahead and scrape ourselves fixtures from Understat for the English Premier League

import penaltyblog as pb

understat = pb.scrapers.Understat("ENG Premier League", "2021-2022")
under_fixtures = understat.get_fixtures()
under_fixtures.head()
understat_id datetime team_home team_away goals_home goals_away xg_home xg_away forecast_w forecast_d forecast_l season competition date
id
1628812800---brentford---arsenal 16376 2021-08-13 19:00:00 Brentford Arsenal 2 0 1.888180 1.023850 0.6289 0.2287 0.1424 2021-2022 ENG Premier League 2021-08-13
1628899200---burnley---brighton 16378 2021-08-14 14:00:00 Burnley Brighton 1 2 1.795480 1.685300 0.3894 0.2877 0.3229 2021-2022 ENG Premier League 2021-08-14
1628899200---chelsea---crystal_palace 16379 2021-08-14 14:00:00 Chelsea Crystal Palace 3 0 1.187090 0.321701 0.6405 0.2822 0.0773 2021-2022 ENG Premier League 2021-08-14
1628899200---everton---southampton 16380 2021-08-14 14:00:00 Everton Southampton 3 1 2.388630 0.580601 0.8359 0.1234 0.0407 2021-2022 ENG Premier League 2021-08-14
1628899200---leicester---wolverhampton_wanderers 16381 2021-08-14 14:00:00 Leicester Wolverhampton Wanderers 1 0 0.668082 1.327140 0.1683 0.2750 0.5567 2021-2022 ENG Premier League 2021-08-14

If you look at the table above (you may need to scroll the table horizontally if you're on a small screen) then you'll notice that the column names have a consistent style to them, e.g. all in lowercase, formatted as snake case etc to make them easier to work with.

Where possible, the column names are also consistent between data sources. For example, the column for the home team is always called team_home whatever the site you're scraping. This may seem trivial but it's a huge time saving not having to try and remember what each different data source calls the same things.

Columns are always named as team_home, goals_home, xg_home etc so that if you print out the column names then related columns, such as team_home and team_away, appear next to each other. No more searching through a giant list of columns to try and find the one you want.

The data also comes with an id column as the dataframe's index so every row has a unique key associated with it comprising the timestamp plus the team names.

Merging Data Sources

Combining scraped data from multiple data sources can be tricky but the penaltyblog scrapers can help with this too. As a somewhat contrived example, let's try and combine Understat's xG scores with football-data.co.uk's betting odds for Bet365.

The first thing we need to do is scrape the data from both sites.

import penaltyblog as pb

under = pb.scrapers.Understat("ENG Premier League", "2021-2022")
ufix = under.get_fixtures()
ufix = ufix[["team_home", "team_away", "xg_home", "xg_away"]]

fb = pb.scrapers.FootballData("ENG Premier League", "2021-2022")
fbfix = fb.get_fixtures()
fbfix = fbfix[ ["team_home", "team_away", "b365_h", "b365_d","b365_a"]]

In theory, we should be able to merge the two datasets by joining on the team names. Let's give it a go and see what happens...

merged = ufix.merge(fbfix, on=["team_home", "team_away"], how="inner")
print(merged.shape)

Oh dear, we only have 240 fixtures instead of the 380 we would expect for a full season of Premier League fixtures 😕.

Unfortunately, both data sources use different team names. For example, Understat uses Manchester City whereas football-data uses Man City so we can't join on them since they don't match.

To get around this problem, all of the penaltyblog scrapers come with the ability to remap team names. penaltyblog doesn't have mappings for all the world's football teams (yet..) but there's enough for this example and you can easily extend the mappings with your own team names.

The mappings themselves are just a standard python dictionary, with the key as the team name you want to end up with and the value as a list of possibles choices to remap. The example below maps both Man Utd and Man United to Manchester United

{
  "Manchester United": ["Man Utd", "Man United"],
}

Let's try our join again but with the example team name mappings included. Notice how we can now simply join using the id column since it's unique per fixture and will be identical across both datasets now we've mapped the team names.

import penaltyblog as pb

mappings = pb.scrapers.get_example_team_name_mappings()

under = pb.scrapers.Understat("ENG Premier League", "2021-2022", mappings)
ufix = under.get_fixtures()

fb = pb.scrapers.FootballData("ENG Premier League", "2021-2022", mappings)
fbfix = fb.get_fixtures()

merged = ufix.merge(fbfix, left_index=True, right_index=True)
print(merged.shape)

Success! With just 12 lines of code (including the blank lines) we've scraped both Understat and football-data.co.uk then merged the two data sets together using the unique id the scrapers automatically created for us.

What Else Can the Scrapers Do?

Depending on the data source, the scrapers have additional functions for collecting extra data. For example, the Understat scraper can get you shot data, including the XY coordinates, and the ESPN scraper can get you player and team level data too.

import penaltyblog as pb

under = pb.scrapers.Understat("ENG Premier League", "2021-2022")
shots = under.get_shots("16376")

print(shots.head())
competition season datetime minute result x y x_g player h_a player_id situation shot_type match_id team_home team_away goals_home goals_away date player_assisted last_action
id
1628812800---brentford---arsenal ENG Premier League 2021-2022 2021-08-13 19:00:00 10 MissedShots 0.913 0.539 0.053 Frank Onyeka h 9681 OpenPlay Head 16376 Brentford Arsenal 2 0 2021-08-13 None Aerial
1628812800---brentford---arsenal ENG Premier League 2021-2022 2021-08-13 19:00:00 11 ShotOnPost 0.908 0.315 0.118 Bryan Mbeumo h 6552 OpenPlay RightFoot 16376 Brentford Arsenal 2 0 2021-08-13 Ivan Toney Throughball
1628812800---brentford---arsenal ENG Premier League 2021-2022 2021-08-13 19:00:00 21 Goal 0.874 0.698 0.052 Sergi Canos h 1078 OpenPlay RightFoot 16376 Brentford Arsenal 2 0 2021-08-13 Ethan Pinnock BallRecovery
1628812800---brentford---arsenal ENG Premier League 2021-2022 2021-08-13 19:00:00 27 MissedShots 0.812 0.478 0.066 Sergi Canos h 1078 OpenPlay RightFoot 16376 Brentford Arsenal 2 0 2021-08-13 Frank Onyeka Pass
1628812800---brentford---arsenal ENG Premier League 2021-2022 2021-08-13 19:00:00 29 MissedShots 0.892 0.357 0.081 Bryan Mbeumo h 6552 OpenPlay RightFoot 16376 Brentford Arsenal 2 0 2021-08-13 Kristoffer Ajer Chipped

Take a look at the penaltyblog documentation for more details and examples of each scraper.

What's Next?

There's plenty of other websites to add to the scrapers, with FBRef and whoscored next up on the TODO list but let me know if there's any others that should be included too.

After that, it's back to the modelling to add more techniques for predicting football results and betting markets.

Thanks for reading!

Comments

Get In Touch!

Submit your comments below, and feel free to format them using MarkDown if you want. Comments typically take upto 24 hours to appear on the site and be answered so please be patient.

Thanks!

About

Pena.lt/y is a site dedicated to football analytics. You'll find lots of research, tutorials and examples on the blog and on GitHub.

Social Links

Get In Touch

You can contact pena.lt/y through the website here