Run description: Aggregation of crowdsourcing results by first filtering out suspected random votes, choosing the vote made by the least random worker and rank by probability that is the correct outcome.
Run description: Aggregation of crowdsourcing results by first filtering out suspected random votes, using simple mle to determine which label is supported by the most evidence from ethical workers, and rank by probability that is the correct outcome.
Run description: We used a game to collect relevance judgements between paragraphs of the web pages and topics. CrowdFlower was only used to direct worker attention to our off-site game.
Run description: Total cost $133. A total of 6875 image judgments. Total time taken to gather judgments ~3 hours. 23 different gold standards used. 35% of the hits presented to workers were gold standards.
Run description: Mechanical Turk, with 5 documents per HIT. Quality control question per document, asking to choose between 2 sets of keywords that best describe the document, besides control of minimum time spent per document. Relevance asked with a 3-point graded scale.
Run description: Mechanical Turk, with 5 documents per HIT and topic terms highlighted inside documents. Permissive Quality control question per document, asking to choose between 2 sets of keywords that best describe the document, besides control of minimum time spent per document. Relevance asked with an unbiased slider from bad to good document, providing a direct ranking and from which binary labels can be computed.
Run description: Mechanical Turk, with 5 documents per HIT. Restrictive quality control question per document, asking to choose between 2 sets of keywords that best describe the document, besides control of minimum time spent per document. Relevance asked with a biased slider from bad to good document, providing a direct ranking and from which binary labels can be computed.
Run description: GetAnotherLabel software splitting the topic set in hard and easy topics (according to wordnet), using just the best workers per topic subset.
Run description: All labels collected by a single human individual with a home grown relevance assessing platform designed for the TREC Crowd Source track. Because of bugs and topic misunderstandings, some topics were judged more than once and the last collected judgments were submitted. (This description is for run UWatCS1Human.)
Run description: First, we designed a job on CrowdFlower as the qualification test. Workers would be notified by email if the quality of their assignments on CrowdFlower meets our requirement. We ran the HITs on Amazon Mechanical Turk, by using a extenal webpage to load HITs. Workers were asked to give a binary label and a rank over a set. Each HIT contains 6 sets (30 documents). Some quality control measures had been taken by us. We used some aditional gold-sets to compute two kinds of scores of each assignment, workers would be notified automatically if they reached the threshold. On the contrary, they would be refused to submit the result. And the workers can review the reference answers of the gold set as instruction for further HITs. Assignments were automatically approved or rejected by our system, we collected 12000 labels in about 10 days. Workers could get $0.42 for every approved assignment.