In our ongoing blog series on data mining for policy, we’ve been trying to synthesize a lot of information into short, bite-sized chunks for our audience. Invariably, well-intentioned as such efforts are, something valuable always ends up on the cutting room floor. In this case, we were a bit too hasty in providing the definition of data mining itself, which one of our readers followed up to ask about. Our initial definition was put together through literature review and our earliest experiences with data mining, but the opportunity to revisit that definition more recently has enabled us to uncover some further nuances that we hadn’t yet appreciated.
Our full report defines data mining as the “integrated use of data extraction, storage, pre-processing & analytical tools” to derive novel findings. We usually follow up that definition with one for big data, based on the four Vs: Volume, Velocity, Variety and Veracity. These methods emerge at the intersections of computer science, statistics and econometrics, making them inherently interdisciplinary. What I want to focus on here is the triumvirate of data collection, processing and analysis. Our definitions here present data mining as an integrated approach to processing and analysis, whereas big data adds in the integration of the data collection component.
Thus, big data and data mining pull together these three pieces, which can be visualized in the following diagram:
The essence of data mining and big data are that they are integrated approaches to these three components—integration is supposed to be the distinctive feature. But in reflecting on this definition, I realized that there was a substantial problem: these components are integrated in every approach to analysis, not just these new ones that we’re trying to define.
Similarly, the revolutions in physics in the early twentieth century (specifically the moves towards relativity theory and towards quantum mechanics) demonstrate more clearly than ever before that—even in that most pure of sciences—the context in which the measurement is taken cannot be disregarded in interpreting the meaning of that measurement. En passant, Ernst Cassirer offers wonderful, if dense, explorations of the significance of measurement: for relativity theory and for quantum mechanics. My own work on the subject is deeply indebted to his ideas.
If all approaches are integrated, then what is novel and unique about data mining and big data? It’s the velocity of iteration. Changes to official statistical collection processes take place over very long time frames, passing through processes of peer review, painstaking inspection of results and interactions between many stakeholders (including those who collect the raw data, those who collate it into official statistics and those who use them). These processes are incredibly time-consuming, but they also date from an era when data collection, processing and analysis were much more labour intensive. Conducting surveys and censuses are extremely expensive endeavours—their cost being an important driver of interest in big data, in fact—and one therefore cannot afford to be too haphazard in one’s approach.
By contrast, data mining and big data are incredibly fast and nimble, in virtue of their deployment of computing power and data sources never before available to humankind. These new resources make it possible to run lots of experiments, within a time frame that would have been unthinkable even short decades ago. Because data collection, processing and analysis are always integrated, these rapid iterations call each of these pieces to be constantly considered with regard to their fit with the others. As any piece of the puzzle evolves, the others must accommodate to retain fit. When that process slows down enough, involving long timelines, large groups of stakeholders and structured decision-making procedures, we can lose sight of their interdependence. However, once we speed up the process, their interdependence once again resurfaces and we can no longer fail to notice it.
At the end of the day, what are data mining and big data? They are integrated approaches to data collection, processing and analysis, just like all the rest. However, they are highly iterative, and these short cycle times mean that collection, processing and analysis tools are in much more dynamic states of flux, causing their interdependence to resurface. These short cycle times are enabled by the ubiquity of computing power and data to which to apply it, which are pushing our notions of analysis and of knowledge to evolve.
Science-Metrix’s final report for this data mining project is available from the Publications Office of the European Union.
Data Mining. Knowledge and technology flows in priority domains within the private sector and between the public and private sectors. (2017). Prepared by Science-Metrix for the European Commission. ISBN978-92-79-68029-8; DOI 10.2777/089
All views expressed are those of the individual author and are not necessarily those of Science-Metrix or 1science.