In my previous post, I outlined the initial technical framework developed by Science-Metrix in the course of the data mining project for the European Commission documented in this blog series. This initial data mining framework—strongly inspired by existing frameworks—provided a solid foundation on which to build. However, to support data mining in a policy context in particular, and research & innovation policy specifically, further development was needed. This post covers some of the more novel work we undertook in our project, bringing new suggestions into the data mining space. There were two main drivers for these further developments; the generality of the framework is treated here, and its feedback loops will be covered next week.
As mentioned last week, this framework is very general, which is both its advantage and its undoing: it is flexible enough to apply in any thematic context, but vague enough to require further specification before being applied in any particular area. The first major challenge for our initial framework, then, is that it’s not specific enough to research & innovation (R&I) policy. When engaging with data scientists, policy experts tend to have a limited awareness about the full breadth of indicators, data sources and analysis algorithms that are already out there in the R&I space. Put simply, a lightning tour of the state of the art for data science is a very helpful way to facilitate the engagement of R&I policy experts in data mining projects.These tools facilitate collaboration, they don’t do away with it Click To Tweet
Accordingly, Science-Metrix developed several domain-specific components—plug-ins to the general framework that help to illustrate the range of possible analyses for data mining in R&I, along with decision-making tools to help navigate that range of possibilities. These tools are intended to further facilitate engagement between policy experts and data scientists, to help them narrow in on a valuable study design. But note that these plug-ins would be hopeless as a full-fledged replacement for a data scientist! Furthermore, the plug-ins are not comprehensive, they just give a broad overview of established options. Recall that big data in this context are partially defined by their novelty, meaning that, by definition, these tools can only be a stepping stone to a true data mining project.
The domain-specific components plug into the framework as follows:
And the five domain-specific components are the following:
(1) Inventory of R&I indicators: This inventory covers more than 1,000 indicators, aggregated from key documents in the international R&I policy scene. The indicators are all categorized, based on a system inspired by the Oslo model of innovation (which focuses on firms as the engines of innovation, in line with our project’s focus). A set of questions is also provided to help users narrow in on the sets of indicators that would best fulfill their policy needs. Additionally, browsing the inventory is very helpful to find inspiration; creativity and open-mindedness really enrich this process.
However, while the indicator inventory is valuable in operationalization—translating a policy question into a data mining problem—it relies on the question being adequately developed beforehand. (We developed tools to facilitate that as well, not shown here.) Just as the framework and plug-ins presented here cannot alter the fundamental need for data scientists to be involved in this process, so too do they not remove the fundamental need for policy experts to be involved in the process as well. These tools facilitate collaboration, they don’t do away with it.
(2) Inventory of data sources: This inventory covers about 100 data sources, characterizing each along several dimensions, including access method & cost, confidentiality, file format, data type, structure, size, statistical population, geographical coverage, time coverage, sector coverage, and level of aggregation. Each data source is related to the indicators listed in the inventory above, using a unique identifier system to interlink the inventories.
As noted in our first post in the data mining series, crossing sources is especially valuable: characterizing “crossability,” though, is not possible, as it is a specific relationship between sources rather than a property of sources individually. Thus, no crossability parameter is described in the inventory. Furthermore, confidentiality issues are an important concern in data mining, and while confidentiality is described for sources individually in the inventory, crossing sources sometimes produces emergent effects that impact confidentiality.
This is known as the mosaic effect: Although individual data sources are each anonymized, crossing these sources can give such a detailed picture that individuals (in this case, usually individual firms) in fact become once again recognizable. The risk of the mosaic effect cannot be accounted for in the inventory, but must nonetheless be kept in mind during study design.
(3) Inventory of data extraction, transformation and loading (ETL) tools: A range of tools are characterized here by their various functions, once again with critical questions provided to facilitate navigating the inventory. Much of this material is of such a technical nature that it is well beyond the ken of policy experts; in fact, even many quantitative analysts accustomed to working with pre-built databases have only a marginal awareness of tools such as these, which are usually the speciality of database managers and data quality experts. Nonetheless, it’s important to consider the policy-relevant impacts that methodological decisions will have, so data scientists and policymakers must collaborate on these decisions—even though they may appear to be “just technical.”
The hardware required to undertake these tasks can be expensive to set up, and upfront costs such as these can be a substantial barrier for policy shops looking to experiment with data mining. However, many cloud-based options are now readily available, and can be used for a considerable range of data mining project needs, lowering the barrier to entry. One concern in a government context is about the compliance of these cloud-based solutions with policies/legislation around privacy, confidentiality and security.It's important to consider the policy-relevant impacts that methodological decisions will have Click To Tweet
(4) Inventory of analysis methods: These tools are sorted by the types of project goals they support (data mining for description, prediction or hypothesis testing), by type (classification, regression, clustering or association), and types of variables supported (discrete or continuous). A decision tree is also supplied to help users navigate the inventory based on their study question.
Visualization tools are also included in this inventory, acknowledging in a prominent way the importance of communicating findings as a critical step on the pathway towards implementation of those findings. The inclusion of visualization tools in this inventory also acknowledges the indelible interconnection between analysis methods and approaches to communication. Communication should not be considered as “just the last step in the project,” but should rather be an ongoing consideration throughout the project. Feeding such an ongoing discussion is harder when the data science team and the policy team do not work together frequently—either in the context of a single project, or across projects. Communication is a skill that must be regularly honed if it is to keep its trenchant edge.
(5) Inventory of data pre-processing tools: This inventory covers tools for various functions, including cleaning and enriching data, imputing missing values, transforming data, and reducing their dimensionality. As noted in previous posts, data preparation can constitute 60%–80% of personnel time spent on a project, and the results of data mining projects are often particularly sensitive to data preparation methods. Pre-processing tools must therefore be selected and applied with due caution.The results of data mining projects are often particularly sensitive to data preparation methods Click To Tweet
These domain-specific plug-ins provide a considerable level of detail about what data, indicators and tools exist for data mining in the R&I policy context. With the collaboration of a good data scientist, these plug-ins can help move discussions forward, and contribute to the evolution of an appropriate study design to address the policy question at hand. The framework on its own is already valuable, but the plug-ins can help policymakers to get up to speed much more quickly on some of the specific decisions that go into building and undertaking this kind of project, enabling them to participate more actively.
The second major difficulty—the iterative nature of data mining projects, and the resulting dependence on the feedback mechanisms of the data mining framework—will be the subject of next week’s post. Stay tuned.
Read the next post in the series.
Science-Metrix’s final report for this data mining project is available from the Publications Office of the European Union.
Data Mining. Knowledge and technology flows in priority domains within the private sector and between the public and private sectors. (2017). Prepared by Science-Metrix for the European Commission. ISBN 978-92-79-68029-8; DOI 10.2777/089
Note: All views expressed are those of the individual author and are not necessarily those of Science-Metrix or 1science.