We manage a lot of data at IEI and, more often than not, I find that the key to cleaning up and improving the data we get lies in where the data came from in the first place – in other words, its “provenance.”
Data “sources” roughly fall into these groups:
- Proprietary structured databases
- Internal customer or prospect data
- CRM data
- Circulation files
- One-off customer purchases
- Public data
- Government filings
- Public transaction data (shipping manifests, bills of lading)
- User-generated content (reviews, rankings)
- News (press releases, news articles, blog posts)
- Web information (addresses, bios, products)
A typical data project usually involves deduping, normalizing, appending missing information, and direct verification via a combination of in-house researchers, trained crowdsourced workers, and software tools. The choice of these tools depends not only on the desired end result of the project (a publishable database, a list clean enough to use for marketing) but also where the data came from.
Typical red flags in data’s provenance:
- Was the data harvested data? (old, miscategorized, unverified)
- Was the data entered by hand? (misspellings, transposed fields, missing required fields)
- If internal data, when was it last used? (old)
- Were multiple sources combined? (mixed formatting conventions, truncation)
Based on these indicators, we define a process to address each of the issues (re-categorizing, fixing spelling errors, targeting key missing fields) in a logical sequence.
The one indispensable step in all data projects is direct verification via primary sources. These sources can include recent government filings, official websites, or direct communication with a person at the company in question. Without this step at the end of a process there is a significant risk of introducing old or incorrect data into the deliverable. This final verification also adds value as a citation as to the data’s accuracy, much like a “sell by” date. Increasingly, data customers expect this piece of metadata as a “certificate of authenticity” and for good reason: Their customers, either paying subscribers or internal sales teams, all need to know where the data came from, too.