According to the leaders of business and government, every decision made is based on rock-solid “data”. While this may reassure people, those of us in the data trenches know that “data-driven decision-making” is always the goal but the devil is in the nature of the data. And if “data is the new oil” then there are a lot of folks putting diesel fuel into their electric Porsches and expecting positive results.
The nature of any mission-critical data set needs to be _fully_ understood if real-world decisions with serious human and financial impacts are to be based on it. Here are the types of metadata needed for these important kinds of datasets:
Field-Level Data Provenance
- The source of the field of data (URL, phone survey, news article, public filing, etc.)
- The date and timestamp of data retrieval for that field.
Minimal Viability of Core Data
- Are fields of data that are critical to the performance of your software services often missing?
- Are geographic areas, job functions, product types, and/or other significant parts of the ‘universe’ in your product’s scope conspicuously absent or sparsely populated?
Defensible/Legal Processing Methodologies
- Is user-supplied data included without independent verification?
- Did AI make the decision about a record or a specific field of data (categorizations, sentiment analyses, assumptions)?
- Do fields have ‘confidence values’ appended to indicate the need for heightened QA scrutiny and/or to give end-users a sense of how much they can rely on the given value?
Accuracy
- Do you know exactly how old each record in your dataset is? Each field of each record?
- What are your quality assurance processes? When did you last audit them?
At Information Evolution we have an open R&D department that is transparent during all stages of the data supply chain construction so you can be confident about the decisions that you make and your customers can be confident about the decisions that they make based on your data. That’s the difference between real data science and paying lip service to the value of data science without doing your homework.