Saturday, August 30, 2014

Athletes vs. data scientists

Competitions among athletes have quite a long history. Armchair sports don't. Chess, which comes to mind first, became an important sport, but only in the 20th century.

An even younger example is data-related competitions. Kaggle, CrowdANALYTIX, and HackerRank are major platforms in this case.

But do data scientists compete as furiously as athletes? Well, in some cases, yes. Here's one example:

(see appendix for how the datasets were constructed)

Merk and Census competitions have about the same number of participants and comparable rewards (but winners for the Census competition were restricted to US citizens only). It may seem surprising that their results look so different. I'll get back to this in the next post on data competitions.

Technically, all the competitions look alike. The lower bound is zero (minutes, seconds, errors), though only the baseline comparison makes sense. Over time, the baseline for sports declined:

(Winning time for 100m. Source.)

A two-second (-18%) improvement in 112 years.

Competitions in a single dataset look like this (more is better):

(Restricted sample taken from

In general, the quality of predictions substantially increase over a few first weeks. Then marginal returns from efforts decrease. That's interesting because participants make hundreds of submissions to beat numbers three places beyond the decimal point. That's a lot of work for a normally modest monetary reward. And, well, the monetary reward makes no sense at all. A prize of $25–50K goes to winners who compete with 250 other teams. These are thousands of hours of data analysis, basically unpaid. This unpaid work doesn't sound attractive even to sponsors (hosts), which are very careful about paying for crowdsourcing. So, yes, it's a sport, not work.

Athletics has no overfitting, but that's an issue in data competitions. For example, comparison between public and private rankings for one of the competitions:

UsernamePublic rankPrivate rank
Jared Huling1283
Attila Balogh3231
Issam Laradji59
Ankush Shah611
Thakur Raj Anand8247
Manuel Días9316

The public rank is computed from predictions on the public dataset. The private rank is based on a different sample unavailable before the finals. The difference is largely attributed to overfitting noisy data (and submitting best-performing random results).

In data competitions, your training is not equal to your performance. That's valid for sports as well. Athletes break world records during training sessions and then finish far away from the top in real competitions.

This has a perfectly statistical explanation, apart from psychology. In official events, the sample is smaller. A single trial, mostly. Several trials are allowed only in non-simultaneous sports, like high jumps. The sample is many times larger during training. And you're more likely to find an extreme result in a larger sample.

Anyway, though these games look like fun and games, they're also simple models for understanding complex processes. Measuring performance has value for human lives. For instance, hiring in most firms is a single-trial interview. And HR folks use simple heuristic rules for candidate assessment. When candidates are nervous, they fail their trial.

Some firms, like major IT companies, do more interviews. Not because they want to help candidates, but because they have more stakeholders whose opinion matters. But this policy increases the number of trials, so these companies hire smarter.

We don't have many counterfactuals for HR failures, but we can see how inefficient single trials are compared to multiple trials in sports.

Appendix: The data for the first graph

This graph was constructed in the following way.

First, I took the data for major competitions:

  • Athletics, 100m, men. 2012 Olympic Games in London. Link.
  • Biathlon, 10km, men. 2014 Olympic Games in Sochi. Link.
  • Private leaderboard. Census competition on Kaggle. Link.
  • Private leaderboard. Merk competition on Kaggle. Link.
Naturally, ranking criteria are different. Minutes for biathlon, seconds for athletics, weighted mean absolute error for Census, and R^2 for Merk. All but Merk use descending ranking, when less is better. I converted metrics for Merk to descending ranking by taking ( 1 − R^2 ). That is, I ranked players in the Merk competition by the variance left unexplained by the models.

Then in each competition, I took the first place's result as 100% and converted other results as percentage of this result. After subtracting 100, I had the graph.

No comments:

Post a Comment