A few years ago, I combined my day job as a data engineer, helping customers engineer data analytics, with my love for  football and set out to beat the bookies. It’s an exciting game to play, and I have learned a lot about the merits of different predictive models that I hope will inspire you too.

Casino gambling has always been a guaranteed way of eventually losing money. Roulette is probably the easiest game for understanding exactly why. There are 36 red and black numbers plus the green number 0. So that’s 37 possibilities in total. When betting on red or black, the odds of choosing correctly are 18/37. That green spot (or spots in American roulette) makes all the difference and guarantees that the casino will be profitable in the long-term. The odds are squarely stacked against you – the house always wins.

A similar bias has always occurred in bookmakers’ odds – from football to horse racing. The bookies try to ensure that the odds are set to their benefit, but calibrating these odds is harder than those for roulette because the calculations are trickier. Some odds are more promising for customers than others, as the bookmakers have to take the market and public opinion into consideration. For example, teams on an unexpected winning streak will often be overestimated in the bookies’ odds, as people will typically underestimate the regression to the mean when it comes to the teams’ performance.

And that raises an enticing possibility. Is it feasible to devise a better and more accurate way to discover inefficient bookmaker odds, where the probability of profit is greater than estimated? In short, is it possible to beat the bookies?

Working out how to beat the bookie

One statistical model that has been tested for sport betting analysis is the Poisson model. The basic assumption of this model is that the number of scored goals in a football match can be described by a Poisson distribution. Based on the average number of goals scored and conceded in previous matches, team attacking strength and team defence strength can be calculated.

These features can be used to calculate the expected number of goals for an upcoming match. Based on this number of expected goals, two independent Poisson distributions (for the home and away team) are used to calculate the probabilities for the different possible outcomes. It’s best to use zero-inflated Poisson distribution when running the analysis, as the normal Poisson distribution underestimates the probability for zero goals.

I use an analytic database (OLAP) because I’m impatient and it allows me to analyse very large datasets quickly on a single computer and run data science libraries and frameworks directly on the data. These are normally prohibitively expensive but you can find free trial versions of analytic databases online, which will allow anyone to analyse up to 200Gb data.

The Poisson distribution shows the probabilities for the possible score lines. Using this theory, the most expected result for the opening match of the 2018/19 Premier League season (Manchester Utd v Leicester City) was a 2-0 win for Manchester United, with a probability of 15% (56%*26%). To get the overall probability for the home-win of Manchester Utd, you just sum up all probabilities for the possible home-win outcome (1-0, 2-0, 2-1, and so on…). So, you get a home-win probability of 68.33%. The actual result was 2-1 to Manchester Utd, and Jamie Vardy only scored for Leicester in the 92nd minute of the game. So, the prediction of the exact score wasn’t too far off, while the prediction of a home win was still accurate.

However, just betting on who will win a match is not the way forward for a profitable bottom line. To have long term success, bets need to be varied and you would need to select the odds where probability of a particular event occurring is higher than the bookmakers suggest, otherwise known as value betting.

By using the Poisson model to identify value bets for Premier League matches over the last 3 years, earnings were in the red overall, but not overwhelmingly so. Overall, 1326 bets of €10 were placed with a yield of -0.7%. Here’s, you can see the diagram showing the profit per calendar week.

So how do you actually beat the bookie?

Unfortunately, the Poisson model suffers several disadvantages when it comes to football analysis. For example, goals in a football match are not independent, and so the use of two independent Poisson distributions is disingenuous. If one team scores a goal, the other team gets motivated to also score a goal. The team attack and defence strength also does not reflect the underlying performance or ‘form’ of a team. This could lead to an over or underestimation of a team, when they are performing particularly well or particularly poorly.

With this in mind a MACD analysis would be more appropriate here – while it’s normally used to identify performance trends of stocks, it could also be used to identify performance trends of football teams.

However, I believe that there’s still some benefits to using the Poisson model. As it is a statistical model and not easily extendable with new features, it is better to use the basic features (parameters) of the Poisson model (team attack and defence strength) and build a new base with Tensorflow. Such a neural network model could be extended with new features to improve the analysis step by step. Neural networks are capable of automatically learning the relations between different features and predict the outcome.

With this method, you would only need to train the model with new features and test it. By transferring the team strength features into a Multi-Layer Perception (MLP), you would find that this model would show an improvement on outcomes for the last 3 seasons.

These features are exactly the same as for the Poisson model, but the Team Strength MLP provides a final yield of 3.6% profit on 1245 bets. This profitability is particularly surprising, especially considering the only information factored in is the number of goals scored and conceded.