OkCupid Statistical Analyses

by Gilbert Keith

OkCupid posted another big dump of stuff. The post was inspired by a graph showing the average “rating” of a woman and the number of messages she receives in a unit time.

I do find the question somewhat interesting; I would have expected the variance in the rating/message distribution to uniformly decrease as the rating increased. That is, at lower ratings, the distribution would be more normal, whereas at high ratings, the distribution would have a high kurtosis, or a sharper peak. Instead, the graph shows that at lower ratings there is a high positive skew in the number of messages received, and as the ratings increase, the distribution becomes more normal.

The analysis goes on to show that for women with higher rating variances, the number of solicitations they are likely to receive increases several basis points more than the average. The writers then go on to present a regression model that gives us an average number of votes a woman receives compared as a function of the 1 votes, 2 votes, 3 votes, and so on. The exercise gives a formula in terms of 1, 2, 4, and 5 votes (with a constant term at the end.)

The post even explains why the 3 terms are missing:

If You’re Into Algebra

We did a regression on the votes for and messages to a sample of 43,000 women. To keep everything consistent, all the women were straight, between the ages of 20 and 27, and lived in the same city. The formula given in the body of the post was the best-fit we found on our second regression, after dropping the m3 term because its p-value was very near 1.

msgs are the number of messages the woman received during the observation period. The constantk reflects her overall level of site activity. For this equation, R2 = .28, which isn’t great in a lab or on a problem set, but is actually very good in a real-world environment.

The writers then go on to proffer a game theoretic explanation for the results:

Suppose you’re a man who’s really into someone. If you suspect other men are uninterested, it means less competition. You therefore have an added incentive to send a message. You might start thinking: maybe she’s lonely. . . maybe she’s just waiting to find a guy who appreciates her. . . at least I won’t get lost in the crowd. . . maybe these small thoughts, plus the fact that you really think she’s hot, prod you to action. You send her the perfectly crafted opening message.


On the other hand, a woman with a preponderance of ‘4’ votes, someone conventionally cute, but not totally hot, might appear to be more in-demand than she actually is. To the typical man considering her, she’s obviously attractive enough to create the impression that other guys are into her, too. But maybe she’s hot enough for him to throw caution (and grammar) to the wind and send her a message. It’s the curse of being cute.

The overall picture looks something like this:

My take:

All this analysis is well and good, but it is extremely lacking. Notice that both my hypothesis and the data show that the distribution we are trying to deal with is heteroskedastic. The ordinary least squares method stops being the best linear unbiased estimator if the distribution is not homoskedastic.


Heteroskedasticity is all kinds of bad if you’re trying to use simple OLS. To quote this seemingly standard textbook:

…  estimators of variances Var(β) are biased without the homoskedasticity assumption. Since the OLS standard errors are biased directly on these variances, they are no longer valid for constructing confidence intervals and t-statistics. The OLS t-statistics do not have t-distributions in the presence of heteroskedasticity…

In other words, the proposition that the coefficient for m4 is negative and that the coefficient for m3 is not significantly different from zero is suspect. This makes logical sense too. I mean, why would you not message someone you rated 4? or, why would you want to message someone you rated a 1 and not message someone you rated a 3 or a 4?

The regression model presented by this analysis is lacking in answering fundamental questions like that. The game theory model that is proposed to account for the anomalous results seems to me somewhat lacking, but I can’t quite pinpoint what is bothering me about it. If I do, I will update this post.

Lastly, my message to all: don’t take data and statistics at face value. As some random commentator on PZ Myers’ blog once said:

[I’ve] always said for years to pals of mine that statistics, when used alone, are absolute bubkes as evidence of anything.

Note: All pictures and content quoted from OkCupid belongs to OkCupid.