Sunday, August 31, 2014

Answering questions already answered on Stack Exchange

In one of the previous posts, the relative distribution of upvotes on Stack Exchange showed that adding one more answer to a question isn't much demanded by readers because readers give at least half of upvotes to the first-placed answer alone:

(y-axis: the answer's mean fraction of upvotes in a question; x-axis is the position of the answer)

But the absolute number of upvotes tells why answers still appear:

(y-axis: mean of upvotes; other notation is the same)

Questions with few answers happen to be unpopular in general (that's why they have fewer answer in the first place). You can barely notice upvotes for questions with a single answer. But they grow as the graph says. The last frame describes the case of 16 answers, but it's better to be careful beyond that because the original sample has too few questions with many answers.

The bottom line is that answering a question with many answers may be more useful to users than answering questions with no answers at all. That's at least valid under the assumption of unconditional expectations. Controlling for time is the next most useful thing to do to understand how demand and supply operate in Q&A markets and what can be done to make them more efficient.

Saturday, August 30, 2014

Athletes vs. data scientists

Competitions among athletes have quite a long history. Armchair sports don't. Chess, which comes to mind first, became an important sport, but only in the 20th century.

An even younger example is data-related competitions. Kaggle, CrowdANALYTIX, and HackerRank are major platforms in this case.

But do data scientists compete as furiously as athletes? Well, in some cases, yes. Here's one example:

(see appendix for how the datasets were constructed)

Merk and Census competitions have about the same number of participants and comparable rewards (but winners for the Census competition were restricted to US citizens only). It may seem surprising that their results look so different. I'll get back to this in the next post on data competitions.

Technically, all the competitions look alike. The lower bound is zero (minutes, seconds, errors), though only the baseline comparison makes sense. Over time, the baseline for sports declined:

(Winning time for 100m. Source.)

A two-second (-18%) improvement in 112 years.

Competitions in a single dataset look like this (more is better):

(Restricted sample taken from chmullig.com)

In general, the quality of predictions substantially increase over a few first weeks. Then marginal returns from efforts decrease. That's interesting because participants make hundreds of submissions to beat numbers three places beyond the decimal point. That's a lot of work for a normally modest monetary reward. And, well, the monetary reward makes no sense at all. A prize of $25–50K goes to winners who compete with 250 other teams. These are thousands of hours of data analysis, basically unpaid. This unpaid work doesn't sound attractive even to sponsors (hosts), which are very careful about paying for crowdsourcing. So, yes, it's a sport, not work.

Athletics has no overfitting, but that's an issue in data competitions. For example, comparison between public and private rankings for one of the competitions:

UsernamePublic rankPrivate rank
Jared Huling1283
Yevgeniy27
Attila Balogh3231
Abhishek46
Issam Laradji59
Ankush Shah611
Grothendieck750
Thakur Raj Anand8247
Manuel Días9316
Juventino1027
(Source)

The public rank is computed from predictions on the public dataset. The private rank is based on a different sample unavailable before the finals. The difference is largely attributed to overfitting noisy data (and submitting best-performing random results).

In data competitions, your training is not equal to your performance. That's valid for sports as well. Athletes break world records during training sessions and then finish far away from the top in real competitions.

This has a perfectly statistical explanation, apart from psychology. In official events, the sample is smaller. A single trial, mostly. Several trials are allowed only in non-simultaneous sports, like high jumps. The sample is many times larger during training. And you're more likely to find an extreme result in a larger sample.

Anyway, though these games look like fun and games, they're also simple models for understanding complex processes. Measuring performance has value for human lives. For instance, hiring in most firms is a single-trial interview. And HR folks use simple heuristic rules for candidate assessment. When candidates are nervous, they fail their trial.

Some firms, like major IT companies, do more interviews. Not because they want to help candidates, but because they have more stakeholders whose opinion matters. But this policy increases the number of trials, so these companies hire smarter.

We don't have many counterfactuals for HR failures, but we can see how inefficient single trials are compared to multiple trials in sports.

Appendix: The data for the first graph

This graph was constructed in the following way.

First, I took the data for major competitions:

  • Athletics, 100m, men. 2012 Olympic Games in London. Link.
  • Biathlon, 10km, men. 2014 Olympic Games in Sochi. Link.
  • Private leaderboard. Census competition on Kaggle. Link.
  • Private leaderboard. Merk competition on Kaggle. Link.
Naturally, ranking criteria are different. Minutes for biathlon, seconds for athletics, weighted mean absolute error for Census, and R^2 for Merk. All but Merk use descending ranking, when less is better. I converted metrics for Merk to descending ranking by taking ( 1 − R^2 ). That is, I ranked players in the Merk competition by the variance left unexplained by the models.

Then in each competition, I took the first place's result as 100% and converted other results as percentage of this result. After subtracting 100, I had the graph.

Tuesday, August 26, 2014

Stack Exchange and reward for being on the top

As mentioned in the previous posts, Stack Exchange has a very interpretable structure. It's a market in which demand for answering a question meets supply, and supply is paid with upvotes. Such a rude interpretation is necessary for learning how knowledge exchange works.

I once looked into a demand side of Stack Exchange, but now a few points on the supply side. In general, we are interested in efficient allocation of resources. Given the fact that sometimes one answer is enough (especially for software development questions), many answers may be a waste.

And that's the distribution of answers per question:


Well, it's a peak at 2 with a long tail. The details:


number of answers Freq. Percent Cum.
1 2,123 18.35 18.35
2 2,601 22.48 40.83
3 2,138 18.48 59.3
4 1,458 12.6 71.9
5 967 8.36 80.26
6 674 5.82 86.09
7 461 3.98 90.07
8 325 2.81 92.88
9 190 1.64 94.52
10 135 1.17 95.69


About 80 percent of questions end with five answers or less.

The Reward for Being on the Top

But what's the reward for having your answer on the top of the others? These are the means of fractions of total upvotes by the position a given answer occupies:


It says that the answer on the top have an stable advantage over all answers to a given question. You can see that after the fifth answer, adding more answers does not decrease total upvotes given to the existing answers. And the first answer gets no less that half of all upvotes.

That's a huge bonus, since multiple other answers have to split the remaining half of upvotes. That may be discouraging for participants, as competition is high and the winner takes all.

Sample summary statistics



Variable Obs Mean Std. Dev. Min Max
upvotes 45463 5.390141 24.11624 0 1553
downvotes 45463 0.1868992 0.9164532 0 82
net (up - down) 45463 5.203242 23.94713 -19 1552
position 45463 4.449794 6.709625 1 114
total_answers 45463 7.899589 10.94102 1 114
relative position 45463 0.6272573 0.2922457 0.0087719 1
total_uv_b~q 45463 67.75934 221.0013 0 2488
frac_uv 44188 0.2440255 0.3074606 0 1

Monday, August 25, 2014

NYT-speak continued

The New York Times' choice of words tells much about history and the media. As seen before, their Chronicle shows great snapshots of the newspaper's wording evolution.

Here, a few more cases.

Information sourcing

As Robert Fisk once noted, the media now rely more on what officials said, rather than sourcing news by themselves. That's a confirmation:



Money and knowledge in crises



Though in general money and knowledge move in the same directions, money moves in greater magnitudes. Also, notice that in the Great Depression, as well as in the Great Recession, the NYT mentioned money less frequently. And the opposite happened during stagflation in the 70s.

Inflation and unemployment



Mentions of unemployment and inflation went in different directions before the 70s: right until the economy happened to have both. But it was a short period and right now there're no mutual relations (at least, in wording).

In- and equality



Inequality never was an issue for the NYT. Even in the late 20s, when inequality was extremely high. So, it's a new topic. Meanwhile, previous mentions of equality are generally associated with civil rights movements, as in the 60s.

"Make war, not love"




At its peak, war themes took up to 30% of the newspaper materials. But local wars, like Iraq and Afghanistan, never draw so much attention.

Referring to minorities


A similar graph was in the previous post, but here changes in wording are clearer. Especially right after the Civil War, when politicians no longer needed support from the black population, and one hundred years later, when politicians and media had to update their vocabulary.

Sports becoming more popular




Saturday, August 23, 2014

Top 1% on Stack Exchange, or inequality in Q&A markets

Yesterday's review of voting activity showed how Stack Exchange users evaluate each others' contributions. Contrary to some views, users don't go berserk despite anonymity and in general balance their judgment.

Upvotes are a currency in a moneyless economy, like one of Stack Exchange. It doesn't mean people do things for upvotes. In this survey, reciprocity comes first as motivation, while upvotes lag behind. But, like money, upvotes often measure one's contribution to the community. Though it's in unlimited supply, upvoting has its own costs (yes, costs of clicking on buttons), and can measure something. For instance, inequality.

Gini

Here's the distribution of upvotes that a particular user got over time:


The long tail is wagging behind, but the inequality is very high.

The Gini coefficient for Stack Exchange is 0.85 for upvotes. That's higher than in South Africa (0.63), which has the highest inequality among countries.

Top 1% and 0.1%

High inequality leads to the questions what does top 1% of users own. Well, they own 42.8% of all upvotes. Actually, it's just about the same percentage that the 1% of richest Americans now control in the national economy.

Top 0.1% users own 15.4% of all upvotes.

Also, Pareto's law nearly holds. Top 20% of users own 87.7% of upvotes.

How to get rich in this economy

In brief: by being rich. This is why:

(users with less than 10 votes in total are excluded as bypassers)

Until you earn about ten upvotes, the only thing that grows is your downvotes. That's a mentoring stage when novices are downvoted as someone who do not read rules. But after you accumulated this minimum, upvotes grow relatively to downvotes. And they grow rapidly.

This result is robust to including bypassers back in the sample (see how the line changes the slope at ln(downvotes) ≈ 2):


So, in general, you have more when you already have enough.

Data Appendix

Analysis is done with a 1% sample of users. Data is available at data.stackexchange.com, while replication files will soon be posted online.

Friday, August 22, 2014

Unrestricted evil

Online services before Facebook were mostly anonymous. At least, no one required real names and SMS confirmations. You sign up and write anything. I mean anything.

There were scary fairy tales about writing anything without putting your real name on it. Various regulations imposed on the Internet, especially in non-democracies, rely on these tales. But is the opposite—real names and full disclosure—really necessary for a good-standing community?

Let's check. The anonymous culture is still alive on some resources. Today is not about 4chan, but about Stack Exchange, a major Q&A website. Stack Exchange has an open data tool for querying data. The tool is quite useful for testing various hypotheses about human communities. It would be a service to humanity if other web services offered similar openness, but so far we have few.

Stack Exchange (SE) doesn't ask names and so on. Though real names are common and some employers ask candidates for links to their SE profiles, the service is basically anonymous. We expect dirty things to happen.

One dirty thing is excessive downvotes for questions and answers others post. Kind of vandalism. And here's the graph:


The graph shows net votes (upvotes - downvotes) for each user with 10+ upvotes or downvotes from a 1% sample (about 31K users). That small tick on the left is users who mostly downvote.

A slightly different perspective:

(The axes were log-linearized for easy reading.)

User behavior happens to be extremely balanced. Few users tend to upvote or downvote extremes. Most of them try to be honest.

So, showing your name is not necessary for good behavior. Online communities can manage themselves without references to the official world, regulations, and witch-hunting. It only matters what environment people want to be in, and then they'll be able to recreate it online.

Thursday, August 21, 2014

Startups across countries

A few plots in addition to yesterday's post on startups.

Startups and economic development


Sources: CruchBase.com dataset and Penn World Table 7.0.

That's not a bad fit for relations between startups and GDP. The number of startups in the dataset seems to be a good indicator of entrepreneurial activity in general.

Startup nation

Here's an illustration for Dan Senor and Saul Singer's thesis about Startup Nation:


Israel has relatively more startups than the US. Tel Aviv and Silicon Valley drive the numbers for their countries, so it's not exactly a nation-wide phenomenon. You call the book Startup City, though the result is no less impressive.

Web data and language barriers

Like other sources based on voluntary reporting, CruchBase may have data biased on one or another way. For example, it may underrepresent countries, in which English is not a major language. And we expect a bias in favor of bigger firms. And here's the case:


China and Russia indeed either have bigger startups on average or just underreport to CrunchBase. The latter is the case because these are exactly two major countries that stand behind a language firewall. They have their own Facebooks, Twitters, and Amazons. So, we expect them to be less active on CruchBase. More so:


The surprising break after the 90th percentile separate countries into two groups. What are the groups? Look here:

(US and UK are excluded to make the graph readable. 100+ startup countries included.)

Group 1 are countries with < 0.02 startups per 1,000 inhabitants and Group 2 are the rest. And in result Group 2 contains countries with an explicitly high role of English language. So, the break indeed looks like a language thing.

Nevertheless, language per se is not a big factor in development, so it doesn't bias the data on GDP in a systematic way. (You can also control the very first plot for the percentage of English-speaking population.)

Wednesday, August 20, 2014

Investing and failures in startups

The efficient market hypothesis got a bad press after 2008. Not surprisingly. It's a half-truth. For instance, what Robert Shiller identified as genuine mispricing Robert Lucas called a minor deviation. Also, the hypothesis has many interpretations, and here's one of them.


(data link)

On the left we have the mean of money that startups received over their lifetime. On the right is a rude measure of risk: the ratio of acquisitions to closed companies in the respective market. So, enterprise software has three successful acquisitions per one failure. I dropped "operating" startups because it's difficult to interpret their success.

The graph is interesting because clean tech gets much funding but has one acquisition per two failures. Analytics gets small funds (not so sexiest as it was called?), but gives very stable outcomes. These two are exceptions because in general funding match the risk measure. And so in other markets: it's enough for one product (like housing) to have abnormal pricing for the entire market to be under risk.

That is an attempt to make complex things embarrassingly simple, of course. For example, some may insist that average funding is a measure of capital intensity, not of competition among investors. Or what we should honestly calculate returns, as was done here. But it all seems to be half-truths, including this piece. We have to keep watching.

Tuesday, August 19, 2014

The cost of being in the top, or when Zipf's law breaks

Many things in the world have a Zipfian distribution. Xavier Gabaix recently attempted to explain how this pattern may emerge for the case of cities. Lada Adamic covered how Zipf's law goes online. The literature is vast because any field has its own examples.

But Zipf's law is a market-type pattern: the outcome of many agents making decisions independently. When you have a planner, it may break.

Here's an example:


(Data comes from generous Wikipedians.)

Zipf's law implies that here you must see a nice linear relations along the ranks. Instead, you see three lines: top 3 (I), top 10 (II), and 30 songs (III). The slopes of the lines are different in an interesting way.

Line II and III have the same slope and discontinuity between 10 and 11. (Actually, the discontinuity is between 9 and 10, but the 10th song is Katy Perry's 2013 release, while songs around it are older. Kind of an outlier.)

This discontinuity is the bonus for being in Top 10. If any magazine or website mentions YouTube most viewed videos, it's usually something like top 3 or 10 or somethings like that. Other songs don't get much attention, even if they're equally good.

But there's another effect: the bonus for taking the first place. This is what Line I and Line II are about. Psy goes through the roof of Zipf's law just as Bieber did before him.

So, why it's about planning? Well, without the media, there would be no tops. And tops are coordinating devices that say what we should watch, listen, and do first of all.

That is planning.


Sunday, August 17, 2014

Changes in NYT-speak

The New York Times released a tool for tracking words in the newspaper's historical issues. Google had done this for all books published after 1500, but NYT is a reputable source, and it's interesting to compare the results.

Civil Rights Movement in the 60s

The African-American minority remains discriminated and after the 1960, but the media changed their policies in telling stories about it:


Democracy on War

Mentions of "democracy" hike during international conflicts to mobilize the population:


Ideological enemies

You have an instant peak as the enemy appears, but gradually the topic comes out of fashion. Also, Communism lasted a little longer, though it's nothing compared to the mass media's reaction to 9/11. That's interesting because the Cuban Missile Crisis and other Cold War conflicts were far more dangerous than anything Bin Laden could do.


The Times started preferring different wording for Nazis before Hitler disclosed his plans in 1939:



National priorities

After being a marginal topic for almost a century, security took over 12.5% of the Times' publications (while security in fact increased over the same period). Values like freedom remained at stable 2.5% for over 150 years: