In our ongoing series on data mining to inform policy, we are giving the topic of data access its own post because of the implications it had for the success or failure of our case studies. The simple reality is that you can’t mine data that don’t exist (or that may as well not exist when they are functionally or realistically impossible to access). As a result, access is particularly important since it underpins the rest of the work in a data mining project. Let’s tease this topic out a little, shall we?
In October 2016, I attended an excellent keynote presentation at the 25th colloquium of the Société québécoise d’évaluation de programme, delivered by Yves-Alexandre de Montjoye of Imperial College London and the MIT Media Lab. His talk was on the use of metadata for the evaluation of public policy and focused specifically on the limits of anonymization for big data. After presenting a number of fascinating (albeit somewhat disturbing) examples of efforts to anonymize data that had later been reverse engineered—leading to serious breaches in data privacy—one of his takeaway messages was that “anonymization does not work for big data (i.e., modern high-dimensional behavioral data sets).” He presented a simple but powerful figure depicting the relationship between data privacy and utility, which showed that higher privacy = lower utility.The simple reality is that you can’t mine data that don’t exist Click To Tweet
His talk was well timed for our project team, given that for several months leading up to that point we had encountered a series of acute data access challenges, many of which were precipitated by privacy concerns. Large data sets that could be of value to our research were only available at a highly aggregated level and/or prohibitive cost, often accompanied by lengthy request processes that sometimes only permitted access by researchers in academia. In other cases, user terms prevented access to data that would be technically feasible to obtain.
For example, while LinkedIn contains a wealth of data that could in theory be used to explore innovation clusters and other economic phenomena of great interest to R&I policymakers, the site’s user terms prohibit scraping of the data. Hold on, though, you may be saying—you’ve certainly seen studies that have used LinkedIn data, so there must be a way around this. The answer is that there is, but the process is neither straightforward nor codified in such a way as to enable equal, open access to all. As it turns out, one of the most frustrating aspects of the case studies we carried out was that although some studies are technically possible to conduct (i.e., the data exist and we have the necessary skill set to extract and analyze them), they turn out to be impossible to conduct in practice.
Creative approaches to overcoming data access challenges
Let’s consider the case of Prasanna Tambe—a researcher who conducted a study on employment dynamics using LinkedIn data not normally available outside of the company’s Mountain View, California, headquarters. In what began as a self-funded summer research project, Tambe was able to analyze LinkedIn’s highly restricted data set by leveraging his programming skills and connections in California to spend time in the company’s offices.
In a publication describing emerging practices in big data analysis in the field of economics, Tambe described the process of gaining access to the LinkedIn data as having no real “blueprint,” noting that he “knew somebody who knew somebody […] it wasn’t that easy or direct.” According to the authors of that paper, Tambe’s experience highlights a geographical and generational divide between economists conducting research with traditional survey-based data sets versus those using non-traditional data from big internet firms in Silicon Valley.
Another example we found to be highly illustrative of inequalities in data access—also described in the paper cited above—comes from the work of Raj Chetty from Harvard University and Emmanuel Saez of the University of California Berkeley, who carried out ground-breaking research on intergenerational mobility in the early 2010s. The researchers analyzed millions of US tax records after convincing the Treasury Department’s Office of Tax Policy of the merits of conducting such a study—and that doing so required a higher level of access than protocol generally allows.Just because a study is possible in theory doesn't mean it’s possible in practice Click To Tweet
While other researchers working with IRS data must rely on dummy data sets and a cumbersome back and forth between the researchers and tax office, in this case, Chetty and Saez’s team sidestepped the process by essentially becoming staff of the IRS. They were required to undergo fingerprinting, background checks and training in the handling of administrative data. They were also subject to the same rules and penalties that apply to any IRS employee, and all their work needed to be pre-approved by the IRS in order to be published.
At the level of our data mining framework, the acute impact of data access constraints on project outcomes led to the addition of the scoping phase where the finer details of how data can be accessed—including by whom and for what cost—should be explored before making a full-fledged commitment to a given study design. We have learned that just because a study is possible in theory does not necessarily mean that it’s possible in practice.
We also suggest undertaking a two-phase data characterization process to determine whether the data sources available are good candidates for use. The first phase should be used to derive some fundamentals about the data (e.g., data fields included, access requirements, confidentiality clauses, data file format, structure, size, and coverage), while the second phase requires a more in-depth assessment of data quality (timeliness, punctuality, accuracy, reliability, coherence, etc.) that can only be conducted when the data are in hand.Access underpins the rest of the work in a data mining project Click To Tweet
In some cases, access restrictions are so acute that even the first phase of characterization cannot be undertaken; the basic structure of the information, variable types and coverage can sometimes not be characterized before securing access to the data, meaning that even a preliminary assessment of suitability can require the purchase of data (for private data), or the completion of a lengthy bureaucratic process (for public data). In such cases, the cost just to assess suitability is usually a sufficient deterrent to considering these potential data sources. A number of commercial providers were quite accommodating about allowing us to preview the data to figure out whether they would be suitable for our purposes, which we greatly appreciated. All the same, these accommodations still required one-off requests and exceptions; the default way of operating—the path of least resistance—is at odds with the actual workflow of data mining projects.
On a broader level, the need for political clout and/or connections, as well as long time frames to access data, are frequently ill-suited to individual project timelines or budgets. Also, the mandates of firms carrying out data mining projects do not necessarily involve advancing the data access discourse. The finding that data access considerations are not solely technical in nature—that is, that they can be created and broken down by power dynamics and relationships—underpinned our recommendation that the European Commission develop a Community of Practice for the use of big data in R&I policymaking.
Science-Metrix’s final report for this data mining project is available from the Publications Office of the European Union.
Data Mining. Knowledge and technology flows in priority domains within the private sector and between the public and private sectors. (2017). Prepared by Science-Metrix for the European Commission. ISBN978-92-79-68029-8; DOI 10.2777/089
All views expressed are those of the individual author and are not necessarily those of Science-Metrix or 1science.