Sam Gentle.com

Little data

I hear a lot lately about Big Data, the process of collecting lots and lots of information and then figuring out how to find meaning in all of it. This is an approach that has become popular in (especially tech) businesses, in science, and even in personal management. When it comes to data, bigger is better. Why have megabytes when you can have gigabyes, and why gigabytes when you can have terabytes? Just think of all the great science you can do with all that data!

Only, I'm not sure it really is science. I mean, one of the fundamental tenets of science is that you test theories with evidence. That means the theory has to come before the evidence does. If you do the experiment first and then figure out the theory afterwards, you fairly quickly fall into the famed phlogiston school of "oh what a coincidence, my theory also explains this evidence". In other words, you're not using the evidence to test the theory, you're using theory to describe the evidence.

Even when they work, there is something fundamentally missing from data-driven approaches; no matter how sophisticated, prediction isn't understanding. Let's say you have a powerful weather prediction model, trained on the biggest Big Data your big Big Data data warehouse can hold. It can tell you with 99.9% accuracy the weather tomorrow based on a thousand different dimensions of input. But do you actually know how the weather works? Have you learned anything about fluid dynamics? Can you turn your predictions into understanding?

I think the essential conflict is that understanding means less data. A terabyte of random-looking numbers can be replaced with a one-line formula if you know the rule that generated them. An enormously sophisticated and complex model can be replaced with a simple one if you figure out what the underlying mechanics are. The standard model of particle physics can fit on a t-shirt. If your model doesn't, either it's more complex than particle physics or you just don't understand it very well.

Now, you might say particle physics is a bad example. The Large Hadron Collider records terabytes of data for every particle collision, surely that's Big Data? That actually gets to the core of the issue; it's not that analysing lots of data can't be useful, but it's a means, not an end. Nobody at CERN started with "hey let's just smash a bunch of particles together and record it all, maybe we'll find some science in there". Theory came first, and the LHC is the experiment that comes after. If there was an easier experiment, some way that didn't require all that data and all that expense, you think anyone would bother with particle colliders?

The most wonderful skill in all of science is to take a complex question and turn it into a simple question, and then use a simple answer to solve both. Big Data is too often a way of answering complex questions without making them simple. What you get is a complex answer, when what you really wanted was a simple one. Sometimes you might need a lot of data to answer a simple question, but often you only need a little, and I think it'd be good to see more hype for little data.