My penaltyblog python package has recently been updated to v0.5.1 so let's take a look at some of the new features.
By popular request, penaltyblog
is now compatible with Python 3.7. The main reason for doing this is to allow it to run on Google Colab, which at the time of writing is still stuck on what is now a fairly old version of Python.
penaltyblog
isn't included in Colab by default but can easily be installed via pip
by running the command below within one of your notebook's cells
!pip install penaltyblog==0.5.1
Once it's installed, you can then import penaltyblog
as normal and use all of its functions.
import penaltyblog as pb
understat = pb.scrapers.Understat("ENG Premier League", "2022")
fixtures = understat.get_fixtures()
fixtures.head()
Another exciting update is the addition of a new goals model based on Bayesian hierarchical modelling. I explained the theory behind this approach in a previous article so I won't go into the theory here but this model is now included in the package.
The hierarchical model follows the same API as all the other goals models meaning that you also have the ability to optionally apply a decay weighting to the data so that more recent fixtures are considered more important when fitting the model.
Here's a quick example to get you started
import penaltyblog as pb
fd = pb.scrapers.FootballData("ENG Premier League", "2021-2022")
fixtures = fd.get_fixtures()
fixtures["weights"] = pb.models.dixon_coles_weights(fixtures["date"], 0.001)
model = pb.models.BayesianHierarchicalGoalModel(
fixtures["goals_home"],
fixtures["goals_away"],
fixtures["team_home"],
fixtures["team_away"],
fixtures["weights"],
)
model.fit()
prediction = model.predict("Man City", "Chelsea")
print(prediction)
As well as the hierarchical model, I've also added in a Bayesian bivariate Poisson model as well. I'll probably write a separate article at some point to explain the theory behind the modelling so again I won't go into too many details here.
However, as I've mentioned in previous articles there is a common issue with Poisson-based models where they treat both team's scores as independent from each other.
This doesn't reflect reality though where each team's goals scored / conceded are likely not independent. For example, if the score is 0-0 with 15 minutes to go then the underdog may settle for a draw and not push to score. Or if a team goes a goal down early on they may park the buss to prevent a more humiliating score line.
The bivariate model attempts to account for this by modelling the underlying Poisson distributions as a bivariate function. So instead of having seperate Poisson distributions for the home and away teams, we have one combined distribution.
Here's another quick example to get you started. Notice how similar the code is to the hierarchical example above - all we have to do is change one word to switch out the model, making it easy to try out different approaches.
import penaltyblog as pb
fd = pb.scrapers.FootballData("ENG Premier League", "2021-2022")
fixtures = fd.get_fixtures()
fixtures["weights"] = pb.models.dixon_coles_weights(fixtures["date"], 0.001)
model = pb.models.BayesianBivariateGoalModel(
fixtures["goals_home"],
fixtures["goals_away"],
fixtures["team_home"],
fixtures["team_away"],
fixtures["weights"],
)
model.fit()
prediction = model.predict("Man City", "Chelsea")
print(prediction)
So which model should you use 🤷
Unfortunately, there's no simple answer here. If you want something fast and reliable then go with the Dixon and Coles model. Otherwise, it you've got more time / computational power then try out the Bayesian models. The only real way of knowing though is backtesting them on your data to find out which ones work best for your particular use case.
The scrapers in penaltyblog
have also been updated to include So Fifa. The get_players
function essentially scrapes the front page of the website, which contains top-level player data. You can control the number of pages to scrape and how the data should be sorted to make it easier to just get the top-ranked players if that's all you're interested in.
import penaltyblog as pb
sofifa = pb.scrapers.SoFifa()
player_info = sofifa.get_players(max_pages=2, sort_by="potential")
print(player_info.head())
You can then use the get_player
function to get more detailed stats about players you're interested in based on So Fifa's player ID.
import penaltyblog as pb
from time import sleep
sofifa = pb.scrapers.SoFifa()
player_info = sofifa.get_players(max_pages=1, sort_by="value")
players = list()
for id_ in player_info.index[:5]:
tmp = sofifa.get_player(id_)
players.append(tmp)
sleep(1)
players = pd.concat(players)
print(players)
Remember to scrape nicely though, please don't crash someone's website by scraping too much / too fast.
My TODO list has plenty more modelling approaches to try and more websites to scrape but if there's anything else you think would be good to include then let me know.
Thanks for reading!
Submit your comments below, and feel free to format them using MarkDown if you want. Comments typically take upto 24 hours to appear on the site and be answered so please be patient.
Thanks!