Austin

1601 E. 5th St. #109

Austin, Texas 78702

United States

Coimbatore

Module 002/1, Ground Floor, Tidel Park

Elcosez, Aerodrome Post

Coimbatore, Tamil Nadu 641014 India

Coonoor

138G Grays Hill

Opp. BSNL GM Office, Sims Park

Coonoor, Tamil Nadu 643101 India

Laguna

Block 7, Lot 5,

Camella Homes Bermuda,

Phase 2B, Brgy. Banlic,

City of Cabuyao, Laguna,

Philippines

San Jose

Escazu Village

Calle 118B, San Rafael

San Jose, SJ 10203

Costa Rica

News & Insights

News & Insights

Is Big Data Good Data?

Taking information at face value is a risky proposition, as Gary Hoover (formerly of Hoover’s and four other companies launched over the course of his career to date), pointed out last week in a Fundamentals of Business Research presentation, part of the University of Texas’s Information Institute‘s “Boiling The Ocean: 21st Century Business Research Tactics And Sources” workshop. Hoover used D&B as his example, explaining that in the past, the company’s researchers visited listees to gather information by interviewing business owners. The system was flawed, but had one advantage: interviewers could temper the information they were given by the company owner with what they observed during the interview. A million dollar company in a six-person back-alley office? Not likely. Today, without the check of an in-person interview, the owner-provided information goes unchallenged, inflated or deflated as it might be.

For individual researchers, the calculus of extracting a good answer from this kind data is never simple. As Hoover pointed out, “The reality is it’s a human endeavor.” Serious researchers develop estimates in multiple ways, discarding outliers and using the numbers that come close to each other. As an example, Hoover explained how he calculated average revenue for a set of restaurants by estimating median revenue per seat and per square foot, and then backed those numbers into average sales.

In 2013, no database publisher has the budget to send teams of human interviewers out to gather information firsthand, but it’s increasingly easy to incorporate double-checks into the information gathering process. In a managed crowdsourced data collection campaign, for example, multiple answers are collected for each question and outliers are discarded. For Internet researchers, electronic checks can ensure that all data in a given field is formatted the same way. Call center calls can be monitored to be sure that the identical data points are collected across the board.

One QC method we at IEI use for our customers is a “checksum” that compares the ratio of the number of a company’s full-time employees to its revenue. For instance, if the SIC code indicates that the company is a restaurant, then the annual revenue figure should be $85K x the number of FTE employees and if it isn’t, the underlying data (SIC code, headcount) needs to be directly reconfirmed. Each industry has a different ratio of heads to revs with Amazon at $750K/head and Wal-Mart at $125K/head, so it’s an easy-to-set up confirmation with human follow-up that helps ensure the quality of our clients’ data.

Keep on top of the information industry 
with our ‘Data Content Best Practices’ newsletter:

Keep on top of the information industry with our ‘Data Content Best Practices’ newsletter: