Big data uncertainties

Research paper by Pierre-André G. Maugis

Indexed on: 11 Oct '16Published on: 10 Sep '16Published in: Journal of Forensic and Legal Medicine


Big data—the idea that an always-larger volume of information is being constantly recorded—suggests that new problems can now be subjected to scientific scrutiny. However, can classical statistical methods be used directly on big data? We analyze the problem by looking at two known pitfalls of big datasets. First, that they are biased, in the sense that they do not offer a complete view of the populations under consideration. Second, that they present a weak but pervasive level of dependence between all their components. In both cases we observe that the uncertainty of the conclusion obtained by statistical methods is increased when used on big data, either because of a systematic error (bias), or because of a larger degree of randomness (increased variance). We argue that the key challenge raised by big data is not only how to use big data to tackle new problems, but to develop tools and methods able to rigorously articulate the new risks therein.

Figure 10.1016/j.jflm.2016.09.005.0.jpg
Figure 10.1016/j.jflm.2016.09.005.1.jpg