Bad statistics

Data bing, data boom

People love data right now. Data science, or machine learning/AI engineering, is the source of the best job in America, according to Glassdoor's ratings. "Data literacy plus empathy" makes CNBC's list of most important skills for the near future. The word "data" appears forty times on this page. You can bet that your favorite thought leader has "hungry for data" in their Twitter bio. With all this attention on data, the word can lose its meaning. When people talk about data skills, they really mean technical skills plus statistics skills.

Technical skills can be learned quickly--there are tons of programming bootcamps claiming to earn you a developer job in well under a year.* Here's a ranking of forty of them.

*I apologize if I left out your alma mater. You can request a link in the comments.

Statistics tends to be more nuanced. It is the only rigorous** way we can make sense of data and determine truth. In practicality, it is probably the best way to potentially conclude that some observed effect might be true.

**-ish.

Our understanding of statistics also helps us decide what not to believe. It's the reason we ignore cherry-picked special cases, like if I wrote a review on TripAdvisor entitled "Washington, D.C. Is Overrun With Pigeons" with the following image:

3 stars.

You haven't heard any news about pigeons--none more than usual--so you don't believe it. And rightly so; this is just one street corner. You might have even noticed the right edge of the photo, where you can see the sneaker of the guy sacrificing bread to this mosh pit with wings. Here is what DC actually looked like:

Aaaahh! Aaaahhh!!

Setting a bad example
 
We think we have strong defenses against bad statistics. More people are learning to code (strengthening their logical thinking) and there are more data scientists now than ever. So why are misuses of statistics everywhere? Are people still bad at it, and if they know better, how are they getting away with it?

I could bring up the time I witnessed a biology research presentation in which the central statistical test had its null and alternative hypotheses switched (in a legal context, this is the equivalent of "guilty until proven innocent"), but this isn't meant to be a statistics lesson. You can find one on a less entertaining site.

Instead, I'll talk about some places where you usually find poorly thought out stats and figures, and how to recognize them.

Law of averages

There's actually an issue with the very first article I mentioned--and almost every one that shares its theme. Consider the data scientist median base salary of $110k with that of a software engineer ($101k) and mobile developer ($85k). It's significant, but it doesn't reflect the whole truth.

Data science is a new and exploding field, and as such, the largest companies benefit the most by hiring them--and pay more for them. These companies tend to be located in expensive cities like New York and San Francisco, further driving up salaries. Plain old Software Engineering has been around a lot longer, so smaller companies in smaller towns can offset the largest employers. The average salary for a "Computer Programmer," a job title that's even older, is far less, at $69k. The best thing you can do is to compare the average salaries among jobs within your city; the best thing Glassdoor could have done is adjust for cost of living per city. Unless I'm wrong; then, if you want to earn more as a programmer, it looks like a quick line change to your job title will get you a $30k raise!



The laugh test

The things that should make you reconsider whether a statistic represents the truth, the whole truth yada yada, are unfortunately the same things that make the numbers so powerful. In the data science salary case, it's a figure that's so far out of the ballpark (unless you live in SF/NY) that it makes you feel something, like you need one of these jobs or else, and that you're missing out on this hot new field.

College advertisements are another type of offender. Ever see an ad at the bus stop for a college in your own city that you haven't heard of? Wonder how they're all number one at something? Well, I actually got my law degree up there from the best non-accredited four-minute university on the east coast on my computer.

As a data set grows, the possibilities for statistical misuse grow exponentially.*** You can divide the data any way you want until you find something that sounds good. You can also do this unfairly; I won't give too much detail but I once saw an ad for a technical and nursing college that claimed something along the lines of "our graduates are #1 in starting salaries for public schools in the northeast." That sounds pretty good! But that figure relies on an important detail: it only offers technical, engineering and nursing majors, which have higher starting pay than other majors. They're not playing on the same field as schools that offer more majors.

***Source. Data points represent elements of the set, and for each combination (subset) of those elements, there is a truth to be stretched.

Insignificant but fun

Another type of misuse is most commonly found in sports, political speculation and anywhere else you'd hear them called "stats." There are useful and relevant stats; you can tell the difference. These ones are often a result of panning for correlations and clinging to them as predictors of future performance. They usually sound like "in the past thirty years, the Giants have a 7-2 record against the Bengals in home games played in the snow when they were up by 3 or fewer points in the 3rd quarter and both teams had an injured kicker." Sure, the data says there's a good chance, but there's really no reason for the data to look like that.

 "Sox are gonna have a good season."

Robot takeover

How can you tell the difference between a breakthrough finding and a spurious claim? What really separates correlation from causation?

That's actually... not a job for statistics. Not entirely. Statistics is a tool that determines correlation and likelihood, and we have to fill in the rest. Well-designed experiments help a lot. The main thing people look for is a mechanism for causation: some reason for the causation to exist. Apple falls faster when planet is bigger = gravity. Apple flying around in any old direction = no conclusion (yet).

That's why we trust scientists to conduct research, not statisticians or data scientists alone. All scientists use statistics, but their domain knowledge is critical in analyzing a result. Machine learning works really well for a huge number of problems, but AI won't replace scientists--or me and you--any time soon.

A thousand times better off? I need a source on that.

Comments

Popular Posts