Throughout 2015 and 2016, we at Science-Metrix worked on a project for the European Commission that focused on data mining and big data analytics in the context of policymaking, specifically research & innovation policy. While carrying out this work, we learned some fascinating and valuable things, and so rather than leave all that knowledge locked up in a full report that’s hundreds of pages long (before annexes!), we’re sharing the key insights from the project through a series of blog posts. Here’s a free sample to pique your interest: it’s highly valuable to cross big data sources with more established sources that are better understood. This introductory post outlines the context of the project; stay tuned for the rest of the series!
First, as an overview, this series of blog posts will cover the project context, the technical framework we developed (and refined) for data mining in a policy context, the policy findings we discovered in conducting some data mining case studies, the enduring challenges, and future directions. A full table of contents for the series can be found below (with hyperlinks), and we’ll be rolling out one post per week, on Wednesday mornings.
The main task for this project was to assess the current potential for using data mining for research & innovation (R&I) policymaking, including costs and benefits, feasibility, and validity for informing decision-making around R&I. The process for accomplishing this task started with a literature review and interviews with experts working in the area, to canvas the state of play and inform the development of a technical framework to guide data mining projects for R&I policy. This first half of the project concluded with a presentation of the framework to an expert workshop, the aim of which was to get a signal check on our work to date and to refine our approach before setting off to test it.It’s highly valuable to cross big data with more established sources Click To Tweet
The second half of the project was to take that framework and apply it to six case studies, addressing six R&I policy questions to identify the strengths and weaknesses of the framework, while identifying any policy-relevant conclusions available. Based on the testing phase, results were presented at a final expert workshop to validate the revisions and recommendations that arose during this second phase. The purpose of the report—and the framework that it presents—is to bridge the current gap between policy analysts (who generally have little facility with quantitative analysis) and data scientists (who generally have little familiarity with the policymaking process). These skills shortages are amplified by the weak linkages between the people who work on either side of that divide.
Before moving forward, it’s worth defining some of the key terms we’re using here. What constitutes big data is defined by the three Vs outlined by Einav & Levin and Clarke & Margetts, along with a fourth V contributed by Gök, Waterworth & Shapira:
- Volume: big data are in large quantity
- Velocity: big data record events in real time, or close to real time
- Variety: big data have a high dimensionality, covering many different parameters (and much of big data is unstructured)
- Veracity: big data often need adjusting to accurately represent the target population, and these processes are often sensitive to methodological decisions
Accuracy and representativeness are sometimes improved by volume, whereas in other cases grappling with a high volume actually presents an additional challenge to addressing veracity. Furthermore, we define big data with an inherent novelty component; while some very well-established databases might meet the 4V definition above, in the context at hand we’ll reserve the term big data for data sets that are still emerging. Examples of some established sources would include surveys and national statistics. Novel sources would include social media data (content and structure), web scraping data (again, content and structure), data exhausts and repurposed administrative data.
Also, while data mining is often used to refer only to the step of analyzing big data sources—to discover novel patterns in the data—we use the term here much more broadly, including steps such as data identification, extraction, storage, pre-processing, analysis and interpretation.
Benefits & challenges
Big data, data mining, machine learning and other terms referring to this constellation of activities are a very hot topic now in many spheres. Side-stepping the hype for a moment, consider that some estimates have concluded that data mining could increase European GDP by nearly 2% (more than €200 billion) within the next five years. Worldwide, the market for big data technology and services is expected to increase from €1.8 billion in 2013 to €5.3 billion in 2018. Moving to cloud-based IT services could save up to 87% of the energy consumed by decentralized computing.
In R&I policy specifically, interlinking data sets provides data for a huge variety of new types of investigations, to inform decisions about structural and financial programs. Some examples from successfully completed projects include assessments of the time to uptake of research into innovation (specifically in the context of nanotech), the need for a core set of firms to be able to capitalize on publicly supported tech programs, the innovation activities of firms based on their web data, and the evolving needs for different skills in the professional workforce.
Some of the benefits of big data over traditional data sources include access to information that might not otherwise be available. For instance, information about where individuals are employed and the type of work they do is traditionally only available in aggregate form (disaggregated data are often not publicly accessible), but much of this information can be collected through professional networking sites such as LinkedIn—though even there, gaining access is far from trivial.User-generated data evolve as user focus evolves Click To Tweet
Big data can also give much broader coverage than a survey instrument. Additionally, survey instruments and other data collection schemes need to be updated intentionally, and this is undertaken only so often. User-generated data evolve as user focus evolves, enabling them to more effectively capture developing phenomena. User-generated data are also less expensive than survey or administrative data, and do not augment the administrative burden on respondents. Lastly, the multidimensionality of big data provides a richness that traditional, siloed data sets cannot offer.
While the benefits of big data are notable, there are certainly challenges to confront in its usage as well. As noted above, linking novel data sets to ones that are better understood is a valuable approach in helping to wade into new territory, gradually exploring the meaning of new data. However, cross-linking these sources when traditional data are available only in aggregate means that the value of big, disaggregated data sources is considerably diminished. Interlinking data sets is also a technical challenge, as much of the information is presently in siloes.
As the volume of existing data is increasing exponentially—as is the value—regulatory and legislative frameworks are under intense pressure to keep pace with the rapid evolution of data itself, meaning that data can be in nebulous circumstances with respect to ownership and re-use rights. The differences between for-profit exploitation and public-good usage of data are not well clarified and treated in existing legal frameworks, and public trust in data usage for social benefit is still in its early stages.
There is also reason for people to be skeptical, as extensively interlinked data can profile subgroups within a population, locking people into circles of self-fulfilling prophecy (for instance, if correlations between income level and crime rate were to prompt increased policing in poor neighbourhoods). One interesting real-life example of data-driven discrimination is a program that used smartphone gyroscope data to locate potholes and target repairs—except that poorer areas were sparsely covered because fewer people there had smartphones, disadvantaging these areas in terms of pothole detection and repair.Big data provides a richness that traditional, siloed data sets cannot offer Click To Tweet
Many mature econometric analysis approaches are not designed to cope with such high dimensionality, calling for the innovation of new mathematical tools to fully capitalize on these data sources. Text mining tools are predominantly developed for the English language, significantly restricting our ability to apply similar approaches across cultural contexts, while the most cutting-edge tools (analytical and otherwise) are seldom interoperable across platforms. Managing large data sets requires significant investments in hardware, software and IT support.
Skills gaps and lack of awareness across the data–policy divide still make it difficult to design structures to effectively implement data-driven policy systems, in turn undermining the push for the investments that are required to realize them. Finally, optimism about data mining often distracts users from the fact that this approach still requires a robust study design (and evaluation framework, for policy purposes) with clear questions, to ensure that findings are relevant. Data mining is very powerful and can produce astoundingly valuable results, but there’s a lot more to it than just trawling the net for a sleek solution to the challenges of modern governance. Stay tuned to this series to find out just how much more is involved.
Table of contents for our Data Mining series:
- Structure and management of data mining for policy
- Case study findings for research & innovation policy
- Cross-boundary collaboration [6 September 2017]
- Innovation, growth and prosperity [13 September 2017]
- Open Access: policies and outcomes [20 September 2017]
- Enduring challenges
- Data access [27 September 2017]
- Risk, failure and continuous learning [5 October 2017]
Items in the table of contents will provide live hyperlinks as content is rolled out.
Science-Metrix’s final report for this data mining project is pending publication by the European Commission. Upon the report’s release, the bibliographic details will be as follows:
Data Mining. Knowledge and technology flows in priority domains within the private sector and between the public and private sectors. (2017). Prepared by Science-Metrix for the European Commission. ISBN978-92-79-68029-8; DOI 10.2777/089