Infinite Analytics CEO Akash Bhatia and Lead Scientist Joseph Kibe reveal how AI reads Wikipedia index noise to pull out the winners.
With the World Cup well underway, people are finally getting a chance to check the quality of their predictions.
There's a certain amount of hubris in trying to predict the tournament winner.
In football, scoring a point is very difficult relative to other sports, which makes it hard for a single game to consistently figure out which team is better.
Even so, what can we say about individual teams that is perhaps a bit more interesting than comparing how good they are at goal scoring or winning games?
One take, from authors and football analysis gurus Chris Anderson and David Sally in their book The Numbers Game: Why Everything You Know About Soccer is Wrong, contends that football is a 'weak linked' sport -- a team's ability to score or block a point depends a lot on a team effort.
According to this theory, the quality of a team's worst players matters more than how great their star players are: Lionel Messi's prowess cannot overcome the effects of him playing on a team of other relatively weak players.
How true is this, though?
We can begin to pick apart this question by first figuring out how to compare all the players from the different World Cup national teams. It is a little tricky to figure out how good the worst players on a given team are.
Lesser players tend to play on less well-known teams that are less covered.
Fortunately, in the context of the World Cup this problem is somewhat tractable.
For starters, all of the players from the different national teams have a Wikipedia entry.
As a rough proxy for player quality, it turns out that the length of a player's Wikipedia article tends to correlate pretty well with how good that player is.
It's not perfect, of course. The signal gets much noisier as the players get worse.
For our purposes, however, it's a good enough proxy.
Ranking players by Wikipedia article length, for example, we see that the best players do indeed bubble to the top, which is a good gut check.
Player | Wikipedia Index(indexed to Ronaldo) |
---|---|
Cristiano Ronaldo (Portugal) | 100 |
Lionel Messi (Argentina) | 100 |
Luis Suarez (Uruguay) | 61 |
Neymar (Brazil) | 58 |
Eden Hazard (Belgium) | 51 |
Thiago Silva (Brazil) | 49 |
Sergio Aguero (Argentina) | 46 |
While some may quibble about using a Wikipedia article index, we feel this is fine as a tool for exploring this weak link hypothesis to the extent the scores correspond well to player quality on average.
For example, Philippe Coutinho, who earns a hefty 4 million euros annually playing for Barcelona most of the year, ranks 45th by our Wikipedia index.
Not at the very top, but roughly corresponding to his still high level of skill.
With that mechanism, then, we can begin to explore this weak link hypothesis.
What would happen if we compare the top five teams both on the basis of their three best players, compared to their three worst?
In other words, if we rank by how much of a Ronaldo or Messi the average of the top and bottom three players on a given national team represent.
Team | Wikipedia Index(Mean of the three best players) |
---|---|
Argentina | 58 |
Portugal | 47 |
Brazil | 43 |
Uruguay | 37 |
Germany | 36 |
Ranking by the Three Weakest players | |
---|---|
Team | Wikipedia Index (Mean of the three worst players) |
England | 11 |
Spain | 8 |
Germany | 8 |
Brazil | 7.5 |
Portugal | 7.5 |
Perhaps unsurprisingly, the better teams populate the top of the rankings in both lists.
The consistently good German and Brazilian teams make it in to the top five ranking both on the best and worst players using our standardised Wikipedia index.
But there are also some surprises.
The Argentinian team that boasts the excellent Lionel Messi drops from 1st to 13th when we look at the quality of the worst players.
We can get an early 'sneak peek' at how some of these dynamics might be playing out by looking both at early results from the competition so far.
The game between Portugal and Spain played on June 15 offered one early opportunity in the competition to test this framework.
According to the weak link hypothesis, the teams ought to be relatively well-matched.
The two teams' three worst players average about the same Wikipedia index of around eight points.
Whereas, the Portuguese team considerably outranks the Spanish looking at the best players -- Portugal's 47 compared to Spain's 32.
In the end, the Portuguese managed to tie the score in the last minutes of the match, thanks to a last-minute goal from Cristiano Ronaldo.
It does suggest the large advantage of Portugal's superior roster of star players was at not enough to have them totally dominate the Spanish team, which should have ideally been the case.
Probably one the most anticipated upcoming matches is Brazil-Belgium in the quarter finals on Friday.
Interestingly enough both teams placed in the top 10 on our Wikipedia Index List for worst ranked player, with Brazil and Belgium placing 4th and 7th respectively.
With Brazil slightly ahead on our index at 7.52 and Belgium with 7.15.
The average of the best players, according to our index doesn't tell a vastly different story with Brazil still on top.
While we have to give the edge to Brazil according to our analysis, it would be premature to count Belgium out.
Brazil may be the favourite but only by a hair's breadth.
For us the France-Uruguay quarter final will be extremely exciting to watch.
When we look at the the average of the 3 best players, Uruguay clearly beats out France with a Wikipedia Index score of 36.96 to 31.61.
On the flip side we see that when we look at the averages for the worst ranked players on our Wikipedia Index, then France has the edge on Uruguay with a score of 6.56 to 5.05.
This matchup goes to the very core of your outlook on football, it poses an almost philosophical premise of 'are matches decided by the best players on the team or are you only as good as your weakest link?'
We are true believers that football is a team sport. As such we go with the team that has the best of the worst.
Vive la France!
Only time will tell whether the 2018 World Cup further validates or rejects this weak link hypothesis.
Even so, it provides a more interesting almost sociological lens to watch the tournament's final stages.
Appendix A: All Teams, Mean of the 3 Worst Players’ Wikipedia Indices
All Teams, Mean of the 3 Worst Players' Wikipedia Indices | |
---|---|
England | 10.86061861 |
Spain | 7.997310852 |
Germany | 7.962876646 |
Brazil | 7.523860674 |
Portugal | 7.508673065 |
Netherlands | 7.162879442 |
Belgium | 7.155433481 |
France | 6.56515968 |
Mexico | 6.52465043 |
Japan | 6.225279815 |
Serbia | 6.166061582 |
Russia | 5.949725521 |
Argentina | 5.714465431 |
Sweden | 5.607076942 |
Croatia | 5.412460505 |
Switzerland | 5.402890968 |
Iran | 5.349935943 |
Costa Rica | 5.302034494 |
Poland | 5.165561063 |
Denmark | 5.159459138 |
Senegal | 5.123358327 |
Uruguay | 5.048656797 |
Nigeria | 5.021964239 |
Colombia | 5.018442864 |
South Korea | 4.992664251 |
Iceland | 4.987718198 |
Morocco | 4.73022775 |
Panama | 4.590447989 |
Saudi Arabia | 4.482844454 |
Egypt | 4.46448492 |
Peru | 4.435158051 |
Tunisia | 4.103933779 |
Appendix B: All Teams, Mean of the 3 Best Players' Wikipedia Indices
All Teams, Mean of the 3 Best Players' Wikipedia Indices | |
---|---|
Argentina | 58.05064221 |
Portugal | 46.56829935 |
Brazil | 43.59754934 |
Uruguay | 36.96405886 |
Germany | 36.23712346 |
Belgium | 35.79109162 |
Croatia | 31.64710556 |
France | 31.61081659 |
Spain | 31.54756625 |
Colombia | 31.50883112 |
Mexico | 27.99205298 |
England | 27.68945668 |
Netherlands | 21.88711817 |
Poland | 21.19354174 |
Egypt | 20.90742333 |
South Korea | 20.56318878 |
Denmark | 20.31631622 |
Japan | 19.65601814 |
Serbia | 19.44016593 |
Sweden | 19.03579921 |
Nigeria | 18.69151091 |
Switzerland | 18.0436586 |
Iceland | 17.25444056 |
Morocco | 17.00358965 |
Iran | 16.85991219 |
Russia | 16.82942945 |
Costa Rica | 16.30520158 |
Senegal | 15.67810119 |
Peru | 14.94621974 |
Panama | 14.10246072 |
Tunisia | 8.935582497 |
Saudi Arabia | 8.663522697 |