Austin

1601 E. 5th St. #109

Austin, Texas 78702

United States

Coimbatore

Module 002/1, Ground Floor, Tidel Park

Elcosez, Aerodrome Post

Coimbatore, Tamil Nadu 641014 India

Coonoor

138G Grays Hill

Opp. BSNL GM Office, Sims Park

Coonoor, Tamil Nadu 643101 India

Laguna

Block 7, Lot 5,

Camella Homes Bermuda,

Phase 2B, Brgy. Banlic,

City of Cabuyao, Laguna,

Philippines

San Jose

Escazu Village

Calle 118B, San Rafael

San Jose, SJ 10203

Costa Rica

News & Insights

News & Insights

Crowdsourcing 102: Q&A

We received such a positive response to our February Crowdsourcing 102 post that we decided to follow it up with some of the questions we fielded about it, all answered by our resident managed crowdsourcing expert, Kevin Dodds.

Q. Have you considered using known answers (KAs) throughout production in addition to using KAs as a qualification method? If so/not, why?

A. BOTH! Absolutely both. Qualification tests (QTs) are excellent when two conditions exist:

  1. Time is not of the essence, and
  2. the scope of the project is not too wide.

QTs let the project manager specifically determine who gets to work on the project. However, there are other ways to gather a solid worker pool quickly, such as setting accepted HITs and acceptance rates at high levels for workers. Using KAs throughout the project is absolutely a must whether you use a qualification test or not. I have absolutely seen clear cases of workers passing a qualification test only to later perform poorly against known answers. It is important to keep campaign runs down to manageable sizes in order to look at individual workers’ performance against known answers.

Q. Could you elaborate a little bit more on the types of data collection use cases that are an ideal fit for using consensus and semi-private crowds and why? Do you find that this strategy also works well for questions where there is a binary result?

A. Projects that are an ideal fit for using consensus have binary answers, meaning there is one and only one correct answer that is not subjective in any way and, preferably, that all the answers are gathered from a single location (for example, parsing a publicly available government website). Semi-private crowds are filled with workers of a project manager’s own choosing and are preferred when data is not always absolutely the same every time. The more binary an answer is, the more it lends itself to consensus, that is to gathering the data by finding agreements through looping.

Q. How do you manage data collection HITs when the data is difficult to find/verify? Do you do anything differently? (For instance, with email collection tasks, sometimes an individual’s email can be hard to find online.)

A. I would say that projects typically cover the full spectrum of difficulty, but, clearly, some data points are far more challenging to find. I have specifically run several crowd projects that sought email addresses exactly as you mentioned. There are plenty of challenges there, including finding a company on the web, navigating their online corporate directory successfully, and copying and pasting accurately.

Also, there are some sites that, in order to reduce spam, require a person to “assemble” an email themselves (something like “Take the first letter of the first name, the full last name, and add @company.com). With any hard-to-find data, good project managers spend hours, days, even a week researching and performing the project objectives themselves. This way, they know exactly what to expect, can alert workers to potential challenges by capturing screen shots of examples, and build up a valuable set of known answers by which to gauge worker performance.

Q. If you’re seeing higher disagreement than you would expect across HITs—what might this mean? How do you respond?

A. Something is fundamentally wrong. Either the instructions are not clear, not enough research and planning was done up front, or there is something technologically wrong with the template. This requires immediate action and suspension of the campaign, and may warrant taking a completely different approach. Often, a lot of disagreement just means the data is not as binary as you hoped it would be and simply does not lend itself to consensus data gathering. In those cases, you cut out the looping, raise the pay rate, and seek out either masters or other highly-qualified people and monitor their performance against known answers.

Q. Are there any “out of the box” data validators available that you recommend?

A. There are two types of data validation. For some projects, it’s just ensuring that data is gathered in a consistent format. For others, it’s necessary to use technology to verify data.

For the first, data often needs to be massaged to ensure consistent formatting. This can mean something as simple as ensuring that phone numbers are delivered in the following consistent format, like (XXX) XXX-XXXX. Your phone number data field may come back to with phone numbers such as 212-555-1111, 2125551111, 1-212-555-2111, or (212) 555-2111. In these cases, Word or Excel often convert these just fine. Some cases may warrant using an off-the-shelf database product such as FileMaker Pro. For the second type of data validation, technologically-driven validation tools are typically proprietary and can include checking websites for status codes and pinging phone numbers.

Keep on top of the information industry 
with our ‘Data Content Best Practices’ newsletter:

Keep on top of the information industry with our ‘Data Content Best Practices’ newsletter: