Continuing on in our series of posts on data mining for policymaking, this post presents the initial technical framework developed by Science-Metrix to guide the conduct of data mining projects in a government context (with some shout-outs to other contexts as well). This seven-step framework formed the basis of our case studies, and effectively lays out the steps through which data mining projects progress from stem to stern. It’s a great introduction for those who have never participated in data mining. For more experienced practitioners, the framework provides the basis for understanding the main challenges to data mining in a policy context, as well as the recommendations we put forward to address those challenges and get more value out of the process.
The framework presented here traces its lineage to the knowledge discovery in databases (KDD) framework, originally developed in the 1990s to guide academics through data mining projects. The KDD framework covers the full range of steps, from understanding the state of knowledge in which the question arises and formulating the research question, all the way through to analysis and the re-integration of findings back into that broader context of knowledge.
While the KDD framework is the basis for this structure, the CRISP-DM framework—which focuses on data mining in the business sector—is the most popular framework according to a study by KDNuggets. CRISP-DM, along with many of the other popular approaches, shares many structural similarities with the KDD framework, so while KDD is our basis, this is not to the exclusion of other approaches. (Additionally, the survey found that RapidMiner is the most popular software for data mining. RapidMiner is free, so it’s a good starting point if you’re looking to explore a little bit.) Our initial framework tried to distill the essence common to various existing models, and would later evolve under the influence of pressures encountered during the case studies.#DataMining for #policy relies on strong collab b/w policymakers & data scientists Click To Tweet
The aim of this framework is to guide policymakers and data scientists through the process of data mining for policy research. Such projects rely on strong collaboration between these two groups, who can often be separated by a sizeable chasm. The framework presented here helps to create a bridge, providing an avenue for each to engage with the other. What the framework is not—and could not hope to be—is a tool for policymakers to undertake data mining projects on their own; for policymakers, it is a tool to facilitate engaging with data scientists, not a substitute for a data scientist. Here’s our initial framework depicted visually, with a verbal description to follow:
This framework has seven steps.
(1) Application domain understanding: This is basically a review of the existing state of knowledge, including previous studies, the policy context, and the broader goal of the data mining project. Logic models are valuable resources to locate, as they depict the policy context (and elucidate underlying thinking). The data mining project will basically assess the ecosystem depicted by this logic model and/or the model’s adequacy for depicting the ecosystem.
This first step is also where the driving policy question begins to be translated into an empirical research problem to address using data mining, specifically through the selection of appropriate indicators. This translation (which I’ve also called “operationalization” in other places) is one of the most tangible membranes between evidence and policy. The permeability of this membrane goes a long way in determining how relevant the evidence will be for the policy application, and how likely it is that the policy discussions will be influenced by the evidence gathered. It’s a junction point that seems to get much less attention than it deserves, given how crucial it can be—especially in circumstances where the policymaker and the data scientist don’t work in the same unit (or even in the same organization).
(2) Data understanding: In this step, the data sources are identified, and the data are extracted and stored. The data also get characterized, elucidating the variables that they cover, their completeness (whether there are missing values), their accuracy, their coherence (whether a consistent methodology has been applied to take all measurements), and their normalcy of distribution.
(3) Identification of analysis methods: Based on the project goal, study question and the data sources selected, analysis methods are chosen at this step. Methods are chosen based on the goal of the study, which can be a description of the state of affairs (descriptive data mining), the development of a hypothesis about active causal mechanisms at work in the ecosystem (predictive data mining), or the testing of a causal hypothesis already encoded in the logic model (hypothesis testing).
As big data sources are often of high dimensionality—they cover lots of different angles—overfitting is more of a concern in these projects than in other, more traditional research approaches, where dimensionality is typically lower. Also, algorithms for high-dimensionality analysis are an area where further development will be interesting to follow in the coming years; most of our analysis tools evolved in a world of considerably less computing power, where this high dimensionality wasn’t possible, but that’s changed quickly.
A neat corollary of dimensionality issues is that certain approaches can be screened out simply with a reasonable guess of the anticipated magnitude of a given effect. That is to say, if you’re trying to control for a lot of incidental factors, and you don’t have a very long and complete time series of data, you won’t find anything unless your effect size is very large. For instance, if you’re looking at the impact of innovation on employment growth, and you’re controlling for firm size, age, sector, country, average educational attainment, era, and so forth, then you’ll need a huge number of companies and a very high-quality time series of observations to uncover an effect.More factors to control for means higher #data needs to detect a usable signal. #datamining Click To Tweet
This intuitively makes sense: if you’re applying a lot of filters to screen out the noise, then your signal has to be very strong to not get lost in the filtering. A particularly nifty feature is that you can reverse-engineer this process: based on the number of observations and control variables in play, you can calculate the threshold below which no signal will be detectable. For example, based on your study design and data availability, you may need a 2% increase in innovation to lead to a 50% increase in employment for results to be statistically significant—and given that such a massive effect is way beyond reasonable expectation, you could then start planning a new approach without spending any more time on one that is unworkable. Of course, that depends on you having a sufficient grasp of background studies to know which magnitudes of effect are reasonable to expect and which aren’t.
(4) Data preparation: This step brings data from their state at extraction to a state of readiness for the analysis approaches selected in step (3) above. This includes standardization (such as ensuring names are spelled the same way in all cases), enrichment (such as geocoding postal codes to spatial coordinates, or parsing and structuring text by various attributes), identification of outliers and missing values, addressing redundancy (such as implementing composite indicators to remove redundancy between individual indicators that are strongly correlated with each other), and transforming the data format to fit the input format needed for analysis algorithms. Data preparation can often amount to 60%–80% of personnel time spent on a data mining project.
(5) Analysis: Once data extraction, analysis method selection and data preparation are complete, this step is basically the implementation of the study itself—running the program that’s been painstakingly designed on data that have been meticulously prepared.
(6) Evaluation: Results are assessed in this step to determine their meaning for the policy context at hand, their novelty with respect to the existing knowledge base, and the further relevant questions that they raise.
(7) Knowledge consolidation & deployment: Reporting covers the process and the findings, including their policy relevance. Findings are integrated into decision-making, and their implications for future decisions and for future studies are catalogued.
Steps (5) to (7) are not covered in detail in this post, given that the framework was primarily meant to address the technical aspects of these projects. They are nonetheless crucial steps, and will be treated more fully in an upcoming post on project management structures for data mining in a policy context.Data preparation can often amount to 60%–80% of personnel time spent on a data mining project Click To Tweet
There are also several feedback loops depicted in the diagram, illustrating how iterative these projects can be. For instance, sometimes in developing a research question, indicators are chosen for which no appropriate data source is available, forcing the team to go back to define the question in an alternate way. In other cases, once indicators and data are chosen, no appropriate analysis methods are available, pushing the team to circle back to the question design or data source selection. Similarly, sometimes the indicators, data source and analytical methods are just fine, but would require far more data cleaning and preparation than can be accommodated in the timeline and budget, calling for a re-adjustment of some parameter set earlier in the process.
While this framework is already a helpful tool to facilitate collaboration between policymakers and data scientists, there are two main challenges it struggles to respond to in its present form. First, while the framework is general enough to apply in any policy context, it’s also too vague to apply in any particular context without further specification. Second, while the feedback loops in place provide the means for the process to be iterative, the framework structure suggests that the process is still primarily linear, underselling just how iterative the process really is. In next week’s post, we’ll start to outline these challenges more concretely, and present the revisions that they brought about in our process.
Read the next post in this series.
Science-Metrix’s final report for this data mining project is available from the Publications Office of the European Union.
Data Mining. Knowledge and technology flows in priority domains within the private sector and between the public and private sectors. (2017). Prepared by Science-Metrix for the European Commission. ISBN 978-92-79-68029-8; DOI 10.2777/089
Note: All views expressed are those of the individual author and are not necessarily those of Science-Metrix or 1science.