This time next week, we will be in the throes of World Cup fever. As we speak, beers are being readied, projection screens are being mounted, and unrealistic levels of national pride and aspiration are mounting. Already, stats and predictions for almost every facet of the event are flooding in; Brazil are expecting 3.7 million visitors, and a $3.03 billion boost to their economy; Panini are expecting £89.1 million in sticker sales in Brazil alone; in the UK, Domino’s Pizza stands to make an estimated £84 million during the World Cup.
The one area where facts and figures seem comparatively scarce is in accurately predicting the World Cup winner. We can estimate how many people will be flying out to Brazil, how many pizzas the Brits will consume in front of their TVs, how many stickers rabid fans will collect; but can we use data to predict who might actually win? We take a look at the sceptics, and those such as Goldman Sachs, who are confident their data-driven models will be successful in predicting the World Cup winner.
The Sceptics: No, We Can’t
In short, football is one of the most challenging sports to accurately apply sports analytics to. As The Economist reported last year, using a ‘Moneyball’ approach in football is incredibly difficult. Whereas baseball is comprised of discrete events which are easy to measure, football is a game where 22 people are continuously in play, constantly moving and interacting with each other in numerous different ways. The dynamic nature of football makes knowing what to measure, and being able to constantly measure it, incredibly challenging.
Although it’s challenging, it’s not impossible; we recently reported that in the equally-dynamic game of basketball, the NBA are using a camera system to unlock intricate data which tells basketball coaches the position of the ball and every player, for every second, in every game of the season. Similar strategies are being rolled in the game of football; companies like ProZone and Opta are tracking a number of metrics- including player position, types of pass and goal opportunities- during football games. They typically garner 2,000 data events per match.
The relative value of this data, however, is yet to be seen. There have been successes and failures when managers have relied on raw data to select players. One success story is the selection of Mathieu Flamini to replace Patrick Vieira on Arsene Wenger’s Arsenal team; he was brought to Wenger’s attention by data showing he ran impressive distances during a match, and has been performing well for Arsenal. On the other side of the coin, Alex Ferguson dropped Jaap Stam because data showed he was tackling less per match than he used to. In spite of the data, Stam transferred to an Italian team and proved to be a valuable asset there.
The implicit problem here is that although data can show you who runs the fastest or farthest, or who tackles most, a good football player is more than a sum of these parts. This data can only go so far to telling you how all these factors will combine together in on-pitch performance.
The Scientists: Yes, We Can
When predicting their winner for the World Cup, Goldman Sachs moved away from the tricky realm of player-specific stats and took a more general approach. They looked at how national teams on the whole had fared in previous world cups, and how the Elo ratings ranked them now, to formulate a predictive model. They explained their methodology:
The predictions for each match are based on a regression analysis that uses the entire history of mandatory international football matches—i.e., no friendlies—since 1960. This gives us about 14,000 observations to estimate the coefficients of our model. The dependent variable in the regression analysis is the number of goals scored by each side in each match. Following the literature on modelling football matches, we assume that the number of goals scored by a particular side in a particular match follows a Poisson distribution.
What their model found was a staggering chance of victory for Brazil- 48.5%. They predicted Brazil would beat Argentina in the final 3-1, with Argentina having a 14.1% chance of victory. Brazil’s impressive odds were attributed to a number of factors, including their Elo system ranking, their relatively strong performance in World Cups compared to other tournaments, and a home country advantage- the home team has won 30% of all world cups since 1930. The model also predicted a 65% chance of a home-continent victory; bad news for Europeans, who have never won a World Cup hosted in the Americas.
Although this model relies entirely on past pedigree and doesn’t factor in future potential, there might be something in it. Goldman Sachs used a model similarly based on past performance to predict the amount of medals Britain would win at the 2012 Olympics. They predicted 30 gold medals and 65 medals overall; Britain walked away with 29 golds and 65 medals overall.
So, can big data help us in predicting the World Cup? The only way to know for sure is to start watching next week, and see if Brazil do in fact claim a convincing victory. And, of course, if England perform disastrously in the penalty shootouts.
(Featured image credit: Francisco Aragão)
Eileen McNulty-Holmes – Editor
Eileen has five years’ experience in journalism and editing for a range of online publications. She has a degree in English Literature from the University of Exeter, and is particularly interested in big data’s application in humanities. She is a native of Shropshire, United Kingdom.
Interested in more content like this? Sign up to our newsletter, and you wont miss a thing!
This is not BIg Data — 14,000 historic results is TINY data. Visicalc on an Apple ][ could handle the regression required. That’s 30+ year old desktop technology!