In a blog post from earlier this year, Neil Lawrence describes some challenges to data mining projects that are familiar to many working in the domain—our team definitely included! These challenges include the availability and quality of the data available for the project. Data scientists are often faced with very detailed expectations of budgets and timelines for a project but are provided with very little information at the outset regarding what data they will have to work with, making it difficult to determine whether a project’s outline is realistic. To begin addressing this problem, Lawrence lays out a very general taxonomy of “data readiness levels,” which provides useful language to help us identify and ultimately overcome these important challenges that currently hinder many data science projects.
Lawrence’s taxonomy is divided into three “bands”: C, B and A.
- Band C: Accessibility—in this stage, the team is trying to sort out what data sets exist, where they are (consolidated in one place or distributed across locations), who owns them, what restrictions exist on their usage (including legal or privacy concerns), and what format they’re provided in, among other similar questions. A data set is stamped “Band C” when all these questions have been answered.
- Band B: Faithfulness—in this stage, the team is trying to sort out the accuracy of the data for representing what they are intended to represent, covering such issues as missing values, outliers, noise in the data, their representativeness of the whole population (a random sample as opposed to a sample of convenience), and so forth. A data set is stamped “Band B” once these issues have been addressed.
- Band A: Applying data in context—in this stage, the fit of the data with various use cases is assessed, determining their relevance to the problem at hand, the types of analyses that can be conducted on data of this type, and the like. A data set is stamped “Band A” when it has been greenlit for application in the context specified.
Along with these bands, Lawrence hints at the notion of breaking them down into more granular “levels” within each band. For instance (and this is a personal favourite of mine), a data set that someone has heard of but no one in the organization has actually seen or even located yet would be labeled C4, the lowest level for Band C.
One source of value for these bands and levels is that they would raise awareness of the difficulty and time-intensiveness of starting from scratch when it comes to locating data, collecting/extracting them, processing them, and finally analyzing them. Having a well-structured architecture to delineate each of these steps could be an important salve to a rather ubiquitous and enthusiastic hysteria: the belief that for basically no money and in no time, a completely new data set can be created and refined to solve the world’s most wicked problems. Much as we take great pains to acknowledge that such expectations are unrealistic, our practices still seem more informed by our starry-eyed, optimistic vision than by our hard-nosed, realistic caveats.
Furthermore, the adoption of such a taxonomy would provide us a language in which to couch our discussions of data readiness. Budgets, timelines and expectations of projects could be delineated in the following kinds of ways, for instance: “The first X months of the project will be spent identifying three data sets and bringing them to C1 level, consuming Y budget.” Advancing the readiness of data from one specified level to another specified level could even be the entire goal of a project! Certainly, having a language like this at our disposal at least gives us an opportunity to ask what the current data readiness level is, what is the expected endpoint for the data set under this project, and how we intend to traverse each step along the way—questions that don’t currently get much attention, much to the chagrin of data science teams and the increased risk of project failure.
While I enthusiastically support this proposed taxonomy, there are two concerns that I have about the process Lawrence describes and that I think warrant consideration. In particular, I worry that the determination of the use case comes at the end of this process, and that the process suggests that data sets as a whole would be walked through the levels and bands, in sequence. The threads of these concerns are intertwined.
The concern about application in a specific use case I feel can be clarified with a simple question: Why would we bring a data set through Bands C and B if we didn’t already know what we were going to do with it? Or, why would we spend all that time and money not knowing if there were anything we would do with it? Much time and effort could be expended on a solution that ultimately can’t find a worthwhile problem to solve.
The concern about bringing data sets through the bands as a whole pertains to estimating costs and timelines. Until one actually jumps into the data set and starts to hunt around, it is often very difficult to predict how much time and budget would be required to upgrade the data set from Band B to Band A. For instance, until we know roughly how clean the data set already is and how thorny its bad bits are, it’s hard to predict how long it will take to complete and validate this preparation process. What we have proposed in our blog post on this topic is that projects should go through an initial scoping phase.
The following steps would be included in such a scoping phase: take a reasonable sample of the data; walk that sample through the entire process to upgrade it from C, to B and finally to A; and run some analyses to determine whether this process—applied at scale—would indeed be viable to address the problem you’re facing. Once that scoping is done, you’ll have a lot more information about the viability of the project as a whole, as well as a fistful of valuable information with which to assess the likely budgets and timelines needed for the project. At that point, an informed decision can be made as to whether applying the project at scale is worth the investment. Bringing the discussion back to the topic of readiness levels, the lesson I wish to extract here is that in the scoping phase we walk a subset of the data through the various levels and bands, not necessarily the data set as a whole in one shot.
In terms of the intertwining of these two points—of determining the use case early on and of walking a subset of the data through the bands (in a scoping phase)—until we’ve figured out the use case, we actually cannot define very closely what it means for a data set to be stamped Band C, Band B or Band A. This is because until we’ve determined the problem we need to address, we cannot determine what kinds of privacy or legal concerns are relevant to the project at hand; we also cannot determine what kinds of holes, coverage issues or outliers would be problematic in this given context, amongst other things. In short, I contend that we shouldn’t ask about data readiness tout court—we can only really ask about data readiness for a specified purpose.
All views expressed are those of the individual author and are not necessarily those of Science-Metrix or 1science.