If you're interested in football analytics or trying to predict football results and odds for betting markets then you are going to need data. This article will walk through how to use the penaltyblog
python package to scrape football (soccer) data from different websites and how to join data from different sources together.
penaltyblog
The first thing you need to do is install the penaltyblog
package from pypi using pip. Provided you've got Python 3 and pip already installed then you just need to run the command below in a terminal. If you've not got Python installed yet, then head over to Anaconda and install that first.
pip install penaltyblog
penaltyblog
contains scrapers for a number of different websites, included Understat, ESPN and football-data.co.uk. For consistency, each of the scrapers takes the same arguments when it is created. This is a standardized competition name and season. For example:
import penaltyblog as pb
understat = pb.scrapers.Understat("ENG Premier League", "2021-2022")
fb = pb.scrapers.FootballData("ENG Premier League", "2021-2022")
espn = pb.scrapers.ESPN("ENG Premier League", "2021-2022")
It doesn't matter which data source your are scraping, competition names always comprise the country's three letter code followed by it's name, and the season is start_year-end_year
. The scraper then maps this to whatever the data source is calling the competition so you don't need to remember whether you are scraping La Liga
, LaLiga
, La Liga Primera Division
or whatever else a particular website calls it.
You can get a list of available competitions for each scraper by calling its list_competitions()
function
import penaltyblog as pb
understat = pb.scrapers.Understat.list_competitions()
['DEU Bundesliga 1',
'ENG Premier League',
'ESP La Liga',
'FRA Ligue 1',
'ITA Serie A',
'RUS Premier League']
Let's go ahead and scrape ourselves fixtures from Understat for the English Premier League
import penaltyblog as pb
understat = pb.scrapers.Understat("ENG Premier League", "2021-2022")
under_fixtures = understat.get_fixtures()
under_fixtures.head()
understat_id | datetime | team_home | team_away | goals_home | goals_away | xg_home | xg_away | forecast_w | forecast_d | forecast_l | season | competition | date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
1628812800---brentford---arsenal | 16376 | 2021-08-13 19:00:00 | Brentford | Arsenal | 2 | 0 | 1.888180 | 1.023850 | 0.6289 | 0.2287 | 0.1424 | 2021-2022 | ENG Premier League | 2021-08-13 |
1628899200---burnley---brighton | 16378 | 2021-08-14 14:00:00 | Burnley | Brighton | 1 | 2 | 1.795480 | 1.685300 | 0.3894 | 0.2877 | 0.3229 | 2021-2022 | ENG Premier League | 2021-08-14 |
1628899200---chelsea---crystal_palace | 16379 | 2021-08-14 14:00:00 | Chelsea | Crystal Palace | 3 | 0 | 1.187090 | 0.321701 | 0.6405 | 0.2822 | 0.0773 | 2021-2022 | ENG Premier League | 2021-08-14 |
1628899200---everton---southampton | 16380 | 2021-08-14 14:00:00 | Everton | Southampton | 3 | 1 | 2.388630 | 0.580601 | 0.8359 | 0.1234 | 0.0407 | 2021-2022 | ENG Premier League | 2021-08-14 |
1628899200---leicester---wolverhampton_wanderers | 16381 | 2021-08-14 14:00:00 | Leicester | Wolverhampton Wanderers | 1 | 0 | 0.668082 | 1.327140 | 0.1683 | 0.2750 | 0.5567 | 2021-2022 | ENG Premier League | 2021-08-14 |
If you look at the table above (you may need to scroll the table horizontally if you're on a small screen) then you'll notice that the column names have a consistent style to them, e.g. all in lowercase, formatted as snake case etc to make them easier to work with.
Where possible, the column names are also consistent between data sources. For example, the column for the home team is always called team_home
whatever the site you're scraping. This may seem trivial but it's a huge time saving not having to try and remember what each different data source calls the same things.
Columns are always named as team_home
, goals_home
, xg_home
etc so that if you print out the column names then related columns,
such as team_home
and team_away
, appear next to each other. No more searching through a giant list of columns to try and find the one you want.
The data also comes with an id
column as the dataframe's index so every row has a unique key associated with it comprising the timestamp plus the team names.
Combining scraped data from multiple data sources can be tricky but the penaltyblog
scrapers can help with this too. As a somewhat contrived example, let's try and combine Understat's xG scores with football-data.co.uk's betting odds for Bet365.
The first thing we need to do is scrape the data from both sites.
import penaltyblog as pb
under = pb.scrapers.Understat("ENG Premier League", "2021-2022")
ufix = under.get_fixtures()
ufix = ufix[["team_home", "team_away", "xg_home", "xg_away"]]
fb = pb.scrapers.FootballData("ENG Premier League", "2021-2022")
fbfix = fb.get_fixtures()
fbfix = fbfix[ ["team_home", "team_away", "b365_h", "b365_d","b365_a"]]
In theory, we should be able to merge the two datasets by joining on the team names. Let's give it a go and see what happens...
merged = ufix.merge(fbfix, on=["team_home", "team_away"], how="inner")
print(merged.shape)
Oh dear, we only have 240 fixtures instead of the 380 we would expect for a full season of Premier League fixtures 😕.
Unfortunately, both data sources use different team names. For example, Understat uses Manchester City
whereas football-data uses Man City
so we can't join on them since they don't match.
To get around this problem, all of the penaltyblog
scrapers come with the ability to remap team names. penaltyblog
doesn't have mappings for all the world's football teams (yet..) but there's enough for this example and you can easily extend the mappings with your own team names.
The mappings themselves are just a standard python dictionary, with the key as the team name you want to end up with and the value as a list of possibles choices to remap. The example below maps both Man Utd
and Man United
to Manchester United
{
"Manchester United": ["Man Utd", "Man United"],
}
Let's try our join again but with the example team name mappings included. Notice how we can now simply join using the id
column since it's unique per fixture and will be identical across both datasets now we've mapped the team names.
import penaltyblog as pb
mappings = pb.scrapers.get_example_team_name_mappings()
under = pb.scrapers.Understat("ENG Premier League", "2021-2022", mappings)
ufix = under.get_fixtures()
fb = pb.scrapers.FootballData("ENG Premier League", "2021-2022", mappings)
fbfix = fb.get_fixtures()
merged = ufix.merge(fbfix, left_index=True, right_index=True)
print(merged.shape)
Success! With just 12 lines of code (including the blank lines) we've scraped both Understat and football-data.co.uk then merged the two data sets together using the unique id
the scrapers automatically created for us.
Depending on the data source, the scrapers have additional functions for collecting extra data. For example, the Understat scraper can get you shot data, including the XY coordinates, and the ESPN scraper can get you player and team level data too.
import penaltyblog as pb
under = pb.scrapers.Understat("ENG Premier League", "2021-2022")
shots = under.get_shots("16376")
print(shots.head())
competition | season | datetime | minute | result | x | y | x_g | player | h_a | player_id | situation | shot_type | match_id | team_home | team_away | goals_home | goals_away | date | player_assisted | last_action | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
1628812800---brentford---arsenal | ENG Premier League | 2021-2022 | 2021-08-13 19:00:00 | 10 | MissedShots | 0.913 | 0.539 | 0.053 | Frank Onyeka | h | 9681 | OpenPlay | Head | 16376 | Brentford | Arsenal | 2 | 0 | 2021-08-13 | None | Aerial |
1628812800---brentford---arsenal | ENG Premier League | 2021-2022 | 2021-08-13 19:00:00 | 11 | ShotOnPost | 0.908 | 0.315 | 0.118 | Bryan Mbeumo | h | 6552 | OpenPlay | RightFoot | 16376 | Brentford | Arsenal | 2 | 0 | 2021-08-13 | Ivan Toney | Throughball |
1628812800---brentford---arsenal | ENG Premier League | 2021-2022 | 2021-08-13 19:00:00 | 21 | Goal | 0.874 | 0.698 | 0.052 | Sergi Canos | h | 1078 | OpenPlay | RightFoot | 16376 | Brentford | Arsenal | 2 | 0 | 2021-08-13 | Ethan Pinnock | BallRecovery |
1628812800---brentford---arsenal | ENG Premier League | 2021-2022 | 2021-08-13 19:00:00 | 27 | MissedShots | 0.812 | 0.478 | 0.066 | Sergi Canos | h | 1078 | OpenPlay | RightFoot | 16376 | Brentford | Arsenal | 2 | 0 | 2021-08-13 | Frank Onyeka | Pass |
1628812800---brentford---arsenal | ENG Premier League | 2021-2022 | 2021-08-13 19:00:00 | 29 | MissedShots | 0.892 | 0.357 | 0.081 | Bryan Mbeumo | h | 6552 | OpenPlay | RightFoot | 16376 | Brentford | Arsenal | 2 | 0 | 2021-08-13 | Kristoffer Ajer | Chipped |
Take a look at the penaltyblog documentation for more details and examples of each scraper.
There's plenty of other websites to add to the scrapers, with FBRef and whoscored next up on the TODO list but let me know if there's any others that should be included too.
After that, it's back to the modelling to add more techniques for predicting football results and betting markets.
Thanks for reading!
Submit your comments below, and feel free to format them using MarkDown if you want. Comments typically take upto 24 hours to appear on the site and be answered so please be patient.
Thanks!