News & Insights

De-duplication and the Crowd

Recently, CrowdFlower blogged about de-duplicating records in a merged CRM database using crowdsourcing. The article suggested that this kind of data is often too wildly divergent to be de-duped automatically.

At IEI, we’ve found:

Most CRM data is similar (first name, last name, email, phone, etc.) and easy to standardize automatically.
Exact matches are simple, but are only the first step in potentially dozens of other matching scenarios that resolve “fuzzy” matches automatically.
The crowd is best used for de-duplication after all automated de-duping processes have been exhausted.

Identical matches can be marked as resolved right away, with no further work required. Then, the dataset needs to be standardized and normalized, again through automation, which reveals even more identical matches which can be automatically resolved.

A few examples of standardization and normalization results:

Sending non-exact matches to the crowd without performing thorough fuzzy matching comes at a high cost and slows down the entire operation. Compare the automated process to the crowdsourcing process:

The task looping required to get accurate de-duping via crowdsourcing can quickly add up to three or more times what an automated process would cost. Limiting crowdsourcing de-duping efforts saves money that can instead go toward crowd research to provide up-to-date and reliable datapoints, the goal of an effective CRM database.

Keep on top of the information industry
with our ‘Data Content Best Practices’ newsletter:

Keep on top of the information industry with our ‘Data Content Best Practices’ newsletter:

Join our team!

News & Insights

News & Insights

De-duplication and the Crowd