We’ve been living with “big data” for a long time now. “Big data” is essentially a buzzword, along the lines of “sustainable.” We should really understand that “big” just means “larger than we have used in the past.” Yes, there are storage and architecture costs, but of course there always have been. Arguably, what’s made “big data” or the size we’re dealing with now is the cheapness of data storage. Walmart and Target wouldn’t collect all the data they do if they couldn’t store and use it economically.

To a financial economist, big data is old news. We can’t publish a paper these days that doesn’t have at least a few thousand observations. Many papers have several hundred thousand observations. The first thing we realize as researchers is that hypotheses and understanding of existing theory must precede data analysis. If a person just does a massive correlation analysis, it’s absolutely certain you’ll get a bunch correlations that are statistically significant, but meaningless. For example, in investment analysis there’s a phenomenon called the small firm effect.

The small firm effect essentially says that smaller firms have higher returns than larger firms. It’s pretty well established, and firm size is often used in calculating required returns. But the small firm effect is not reliable. It went away for the late 1990s and early 2000s. An example from macroeconomics is the Phillips curve, which postulates the inverse relationship between employment and inflation. The Phillips curve was derived from empirical observations of the post-WW2 years, and was not the product of rigorous hypothesizing and testing. And, since the late 1970s, the data hasn’t fit the Phillips curve idea.

In the context of financial economics, using a false theory to determine the required return on an asset will lead to mispricing and wrong decisions – selling when one should buy, buying when one should sell. In macroeconomics, false theory will lead to policy mistakes than can and do affect millions or more people.

None of this is to say that lots of data is not useful. It is useful for more rigorous testing of theory, but there is another caveat here. When dealing with very large data sets, it is quite common to observe lots of precision in the parameter estimates of regressions. This is because we can minimize standard error very easily by enlarging the data set. The problem is this gives a false sense of importance of the variables, because the parameter estimate itself can be very small even though it’s statistically significant. So we need theory and judgement and context to understand if the effect of a variable is economically important.

We don’t get the judgement and context without understanding the theory and the circumstances of what we’re observing. You can have all the technical expertise in the world but if you don’t know the capital asset pricing model (CAPM), or arbitrage pricing theory (APT), you wouldn’t know how professional investment analysts calculate required return for a specific asset the way they do, or how important the risk-free rate is in asset pricing. You might just find out that when there are more sunspots this week, the returns on agricultural firms will be higher next week.