One of the premises of Big Data is that it can be “theory free”: rather
than starting with a hypothesis (“men at buffets eat more when women are
present,” “more people will click this button if I move it here,” etc)
and then gathering data to validate your guess, you just gather a ton of
data and look for patterns in it.
The thing is, patterns emerge in every large dataset, without
necessarily being representative of a wider statistical truth. Think of
the celebrated rise and fall of Google Flu:
researchers examined the 45 search terms that were most prevalent where
the flu had spread and concluded that these were predictors of flu, but
the predictive power turned out to be an illusion. Every place has 45
top search terms, all the time, and some of them will coincide with flu
outbreaks, but without a causal theory that you can test, all you know
for sure is that you’ve found an incident of correlation, and no way to
know whether the correlation is coincidence or a newly discovered iron
law.
Big Data is still a useful statistician’s tool, and can be examined to
gain intuition that leads to new hypotheses – but those hypotheses then
need to be investigated with statistical rigor.