Tuesday, September 2, 2014

Pitfalls of rating systems

A few years ago, YouTube changed its rating mechanism from five-star rating to upvote–downvote system. And it makes sense after you look at the typical distribution of ratings:

In most cases, users set either 1 or 5. That's not very informationally efficient, but it's the fact that users were reluctant to rate videos across the entire scale.

This J-shaped distribution creates problems because the mean here makes no sense. When a website reports an average rating of 3.0, it means one of two things. Either one person rated the video at 1 and another one at 5. Or both rated it at 3, which is almost never the case.

In an economy guided by ratings, the difference between these two interpretations is large and unpleasant. Since no one rates stuff around the mean, a decision based on this mean is uninformed. In the end, you watch something that you'd later rate at 1 or 5, not 3. It's like you'd ordered a steak and the waiter brought you sushi.

The worse thing about this risk is that it's implicit. Users look at ratings to reduce the risk of making a wrong choice, but instead they gamble between 1 and 5. Fortunately, the ratings aren't entirely random. They're conditioned on the stuff we observe, like gender, age, and interests. The means then may start working. Just check if those 1s and 5s were set by distinct demographic groups.

Of course, it's now about hundreds of 1s and 5s, because the degrees of freedom go down with each factor we get into the equation. How to get more ratings?

The solution is exactly what YouTube did: replace a five-star scale with a binary choice. Users don't like investing time in thinking about the proper rating, so thumbs up or down helps with decision fatigue.

More ratings allow computing the means for subgroups of users. These subratings become more relevant for those who search stuff by its rating. Though YouTube didn't make customized ratings yet, that's an option for many web services relying on user feedback.

While Uber and Fiverr can improve their rating systems by reducing it to binary choices, a scale is still a good choice for, say, IMDb. When you watch a movie for two hours, you try to rate it better than YouTube's typical three minutes. And then multiple peaks emerge for controversial movies:

You have the mean and median near to each other in a sort of Poisson distribution. The other two peaks are around radical 1 and 10. So, you need more than two grades on a scale.

Conventional hits have the YouTube pattern though:

Which again looks like the Poisson distribution with the disproportionate number of 1s.

In the end, a good rating system has to balance between the desirable number of votes and the size of the scale.

No comments:

Post a Comment