Quality assessment in crowdsourced classification tasks

Qiong Bu; Elena Simperl; Adriane Chapman; Eddy Maddalena

doi:10.1108/ijcs-06-2019-0017

Quality assessment in crowdsourced classification tasks

Bu, Qiong; Simperl, Elena; Chapman, Adriane; Maddalena, Eddy 2019-12-09 00:00:00 Purpose – Ensuring quality is one of the most signiﬁcant challenges in microtask crowdsourcing tasks. Aggregation of the collected data from the crowd is one of the important steps to infer the correct answer, but the existing study seems to be limited to the single-step task. This study aims to look at multiple-step classiﬁcation tasks and understand aggregation in such cases; hence, it is useful for assessing the classiﬁcation quality. Design/methodology/approach – The authors present a model to capture the information of the workﬂow, questions and answers for both single- and multiple-question classiﬁcation tasks. They propose an adapted approach on top of the classic approach so that the model can handle tasks with several multiple- choice questions in general instead of a speciﬁc domain or any speciﬁc hierarchical classiﬁcations. They evaluate their approach with three representative tasks from existing citizen science projects in which they have the gold standard created by experts. Findings – The results show that the approach can provide signiﬁcant improvements to the overall classiﬁcation accuracy. The authors’ analysis also demonstrates that all algorithms can achieve higher accuracy for the volunteer- versus paid-generated data sets for the same task. Furthermore, the authors observed interesting patterns in the relationship between the performance of different algorithms and workﬂow-speciﬁc factors including the number of steps and the number of available options in each step. Originality/value – Due to the nature of crowdsourcing, aggregating the collected data is an important process to understand the quality of crowdsourcing results. Different inference algorithms have been studied for simple microtasks consisting of single questions with two or more answers. However, as classiﬁcation tasks typically contain many questions, the proposed method can be applied to a wide range of tasks including both single- and multiple-question classiﬁcation tasks. Keywords Aggregation, Classiﬁcation, Task-oriented crowdsourcing, Quality assessment, Human computation Paper type Research paper 1. Introduction Microtask crowdsourcing has attracted interest from researchers, businesses and government as a means to leverage human computation into their activities in a fast, accurate and affordable way. In the last ten years, we have seen it applied to anything from spotting sarcasm on social media to discovering new galaxies and helping digitise large cultural heritage collections. The underlying model is relatively straightforward: a problem is decomposed into smaller chunks that can be tackled independently by several people. © Qiong Bu, Elena Simperl, Adriane Chapman and Eddy Maddalena. Published in International Journal of Crowd Science. Published by Emerald Publishing Limited. This article is published under International Journal of Crowd Science the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), pp. 222-248 Emerald Publishing Limited subject to full attribution to the original publication and authors. The full terms of this licence may be 2398-7294 DOI 10.1108/IJCS-06-2019-0017 seen at http://creativecommons.org/licences/by/4.0/legalcode Their individual outputs are then compared and consolidated into a ﬁnal solution (Shahaf Crowdsourced and Horvitz, 2010). However, none of these steps is actually easy: some problems are less classiﬁcation amenable to microtasking and need to be turned into bespoke microtask workﬂows tasks (Bernstein et al.,2010; Kulkarni et al.,2011; Kittur et al., 2011); the performance of the crowd varies across tasks (Mao et al.,2013; Redi and Povoa, 2014); and determining which answers are the most useful ones can be both complex and computationally expensive (Kittur et al., 2008; Snow et al.,2008; Vickrey et al., 2008; Demartini et al., 2012; Wiggins et al., 2011).Itis on this last aspect, determining the correct answers, that we focus on in this paper. The aggregation method proposed in this paper is able to infer the correct answer for a range of tasks involving either single-step or multiple-step classiﬁcations when gold answers are not available. It also serves as a proxy to help task requesters to assess the quality of the crowdsourced results when they already have some gold answers, such as piloting speciﬁc multiple-step task design before putting it online for a larger scale. Quality assessment in microtask crowdsourcing refers to the evaluation of quality of the workers’ work. First, quality can be assessed based on different criteria, as it has many dimensions (Kahn et al.,2002; Batini et al., 2009). Under the crowdsourcing context, it depends on the type of the data, which is decided by the task type (Malone et al.,2010; Gadiraju et al.,2014, 2015). The most common quality metric we have seen is to calculate the accuracy (Bernstein et al., 2010; Gelas et al., 2011; Hung et al.,2013; Zhang et al., 2017a, 2017b) with available gold standards. However, in lots of the cases the gold standard is not available. This is where different inference algorithms come into picture, which helps to infer or predict the correct (gold) answer. Second, quality assessment can be done either on the ﬂy(Ipeirotis et al., 2014) during the task running that can be used to optimise task assignment hence reduce cost, or in the post aggregation (Whitehill et al.,2009; Ipeirotis et al., 2010; Bachrach et al.,2012; Difallah et al., 2015a) to assess the overall quality of the classiﬁcation. This work focus on aggregating the result after the crowdsourcing task has been completed, so that accuracy can be calculated based on the gold standards we have. There are many different types of tasks where microtask crowdsourcing are applied (Eickhoff and de Vries, 2011; Difallah et al., 2015b; Yang et al.,2016; Zheng et al., 2017a).We focus on inferring the correct answer for a classiﬁcation task which is one of the most popular type of crowdsourcing tasks. We are by no means the ﬁrst to do so; previous research has proposed a range of methods to infer and predict the quality of crowd answers (Bachrach et al.,2012; Dawid and Skene, 1979; Difallah et al., 2015a; Hare et al.,2013; Ipeirotis et al.,2010; Karger et al.,2011; Loni et al., 2014; Paulheim and Bizer, 2014; Hung et al., 2013; Rosenthal and Dey, 2010; Simpson et al.,2013; Whitehill et al., 2009). Whilst all methods have their beneﬁts, they work on relatively simple task models that consist of single questions with one or more answers (Sheshadri and Lease, 2013; Hung et al.,2013; Zhang et al.,2017a; Zheng et al., 2017b). The scenario we are targeting is different. We take a close look at existing classiﬁcation tasks from Zooniverse, and notice a large percentage of these tasks are multiple-step tasks, as shown in Figure 1. In fact, in a random sampling of 20 tasks, only 20 per cent has a single question. Consider the example in Figure 2, which is taken from a labelled citizen science project in which pictures taken in the Serengeti national park in Tanzania are analysed online by thousands of volunteers[1]. The crowd is asked to answer a series of related, independent questions about what they see in the image, including the types and number of animals. Our work is motivated by a range of online crowd science classiﬁcation projects. Each of them uses a slightly different type of task to classify an object, for example, an image, according to a number of criteria. For a relatively complex task, it is split into several steps, typically in the form of multiple-choice answers. Sometimes there are dependencies between IJCS 3,3 Figure 1. Classiﬁcation tasks from Zooniverse Figure 2. Example classiﬁcation paths collected from 20 workers for a given photo steps as the answer chosen for one questions prompts other questions to be displayed. For instance, in the Cities at Night project, which uses microtask crowdsourcing to analyse night-time photographs taken by astronauts onboard the ISS[2], seven different Options are provided for the ﬁrst question to identify what the given image contains, a city, stars, aurora, astronaut, black image, no photo or none of these, and only when “city” is identiﬁed, two more independent questions will be asked to classify cloudiness (three Options: cloudy, someclouds, clear) and sharpness (two Options: sharp, blurry). In the GalaxyZoo[3] project, several different questions were asked in sequence depending on the answers to previous questions, and questions and answers are arranged in a decision tree. It has a more complex workﬂow in which more questions are involved, and questions vary based on what has been chosen in previous classiﬁcation step. For instance, the ﬁrst question is “Is the galaxy Crowdsourced simply smooth and rounded, with no sign of a disk?” and three options are provided: classiﬁcation “Smooth”, “Features or disk”, and “Star or artifact”. When choosing “Smooth”, a new tasks question will be asked “How rounded is it?” and available options are “Completely round”, “In between” and “Cigar shaped”.If “Features or disk” is chosen as the answer to the ﬁrst question, a different set of subsequent questions will be asked. Other times, workﬂows are rather sequences of independent, though related questions, such as what we see in Snapshot[1](Figure 2). Determining the correct answer for such complex classiﬁcation task can be tricky and has not been fully studied yet. Existing research also does not investigate how inference methods could affect the classiﬁcation accuracy when using different crowd types for complex classiﬁcation tasks. As a result, there is the need to understand whether different algorithms and aggregation strategies are required for different crowd contexts. To tackle the issue of determining the correct answer from crowd produced annotations for the classiﬁcation task with multiple questions, we model the problem of complex classiﬁcation tasks that span over multiple, related questions as a graph. To the best of our knowledge, we are the ﬁrst to propose using the structure of a microtask crowdsourcing workﬂow as an additional feature to support inference algorithms in making decisions about correct labels, using output data produced by the crowd. We look at three inference algorithms (majority voting [MV] [Paulheim and Bizer, 2014; Hung et al.,2013], message passing [MP] [Karger et al., 2011] and expectation maximisation [EM] [Dawid and Skene, 1979; Whitehill et al., 2009]), which have been commonly used in answer inference in microtask crowdsourcing previously. We adapt these algorithms to work on the graph modelled from crowdsourcing tasks with multiple steps. We perform a large-scale evaluation of the performance of these algorithms on six data sets across two crowd contexts from three image classiﬁcation tasks: Darkskies[2], GalaxyZoo[3] and Snapshot Serrengeti[1]. The rationale behind choosing data sets from both volunteer and paid crowd context is that algorithms may perform differently in these contexts. The experiments show that our aggregation strategy achieves signiﬁcantly better performance than the current approach of naively applying individual algorithms on each node level. The result also indicates that MV, despite its simplicity, compares well with more sophisticated approaches that consider additional factors such as user performance and hence need more computation time. Sophisticated algorithms such as expectation maximisation, however, can complement MV for relatively complex tasks. We also prove that each algorithm obtains better inference accuracy in the volunteer context compared to paid crowdsourcing context. This rest of this paper is structured as follows: Section 2 provides the foundations of existing algorithms which we have adapted to handle answer inference in classiﬁcation tasks with multiple questions, and illustrate how this aggregation ﬁts in the quality assessment process. In Section 3, we explain our graph model and notations used in the graph, formalise the classiﬁcation problem, and elaborate our aggregation approach. In Section 4, we perform large-scale evaluation and demonstrate the performance of different algorithms. Section 5 discusses our ﬁndings. Section 6 reviews existing work which has inspired our research, and Section 7 summarises our result and future work. 2. Foundations A classiﬁcation task generally has one single question and a few options to choose from, such as the one shown in Figure 3. It looks like a simple tree structure where the classiﬁcation starts with a root node which refers to the object to be classiﬁed and has a few branches which represent the available options. In this section, we present three existing algorithms, MV, MP and EM,that have been used in inferring the true label for a single-step multiple-choice classiﬁcation task. These are the foundations to understand our proposed adapted approach. Notations used in IJCS elaborating these algorithms are deﬁned in Table I. For the sake of explaining the individual 3,3 algorithms and our method, we use following notations throughout this paper. 2.1 Majority voting Due to its simplicity, MV has been used in many microtask projects (Hung et al.,2015; Liu et al., 2012) and is the standard aggregation method in some existing crowdsourcing platforms[4]. Given the list of options for a labelling task and an object, the MV algorithm chooses those options with the highest number of votes from the crowd. Formally, it takes as input an object o and the crowd labels L and outputs the resulting candidate label l that received the most votes from the users. Algorithm 1 MV 1: procedure FINDUNIQUELABEL(L ) u u 2: L fl g, where L A and l 2 A and u# U ; unique unique o o o 3: l =““; 4: num ¼ 0; max 5: for i2jL j do unique 6: if count l num then ðÞ max unique i 7: num count l ; max ðÞ unique i 8: l l ; o ðÞ unique i 9: return l ; Figure 3. Representation of a task with a single question Notation Definition o The current object being classiﬁed O The set of all objects in a data set A All available options u User u U The set of all users who contributed to the current data set U All users who have classiﬁed object o L All labels received from the crowd, and L A L The set of all labels from the crowd for object o L The set of all labels from user u Table I. l The label for object o from user u Notations l The inferred label for object o o 2.2 Expectation maximisation Crowdsourced EM is another algorithm that has been widely used and involves two steps to infer the true classiﬁcation label for a given object. In the ﬁrst step, the true label for the current object is estimated tasks using simple MV, where the input of all users is considered equally. Then, in the next step, the error rate of each user is estimated based on this result and used in turn to calculate the new estimation for the ﬁrst step. The steps are alternating iteratively until the algorithm converges and a maximum is found. It takes as input an object o and all labels L. It starts by estimating the true label for each object and each user’s error rate by comparing their answers (using an indicator function I() to check whether the user classiﬁes object to a certain category/class) for all objects they have looked at. The error rate is used subsequently to update the confusion matrix for each user. The output is candidate labels for o with the probability (indicated by p) of the corresponding candidate label to be correct. Algorithm 2 EM 1: procedure INITIALISE(p ) ðÞ 2: p count l jL j ⊳ probability of l being the true label for l o object o (l2 A); 3: while not converged do 4: Estimate error rate for user u u u 5: u l þ p Il ¼ l ll ll o o2L 6: Estimate confusion matrix: u u 7: e u u ⊳ q is the accuracy of user u ll ll lq 8: Estimate class priors: 9: pr p jOj 10: Calculate class probability for object o: YY X Y j u u u ðÞ ðÞ 11: p pr e Il ¼ m pr e Il ¼ m l l q am qm u2U m q m 12: l =““; 13: p ¼ 0; max 14: for l 2 A do 15: if p p then l max 16: p p ; max l 17: l l; 18: return l ; 2.3 Message passing MP is an algorithm that takes into account both the labels and the performance of the users. MP constructs object and user-speciﬁc messages to represent the reliability of the particular user, and iteratively updates the object and the user messages. More speciﬁcally, at each object update, it adds up more weight to labels that come from more trustworthy parts of the crowd, and at each user update, it adds more trust (a conﬁdence value) to the user if the labels they give for other objects are in line with the current estimates of object labels. The iterative updates continue until the algorithm converges or a speciﬁed threshold is hit. The threshold for the stopping condition is a parameter that has to be empirically determined. It takes as input an object o,alabel a2 A,all IJCS labels received from the crowd L and a threshold k . MP computes the object message by max 3,3 ﬁrstly iterating all previous labels from the users who have been assigned the object o and then looking at whether each label is the same as the given one. In a next step, it uses the object message x (2 L) to update the user message y (2 L), whichiscomputed byiterating o>u u>o over the labels they have submitted. Until convergence, the object message for object o is aggregated by weighing the user messages (conﬁdence) for that object and the computed sign is stored in E . MP outputs the candidate label l for o and the sign of whether the label applies or ou not. A detailed description of the algorithm can be found in Karger et al.’s (2011) study. Whilst providing accurate estimations, MP is also known for its high computational costs as the number of labels and users increase. Algorithm 3 MP 1: procedure INITIALISATION(y ) u>u 2: for (o, u)2 L do 3: Initialise y (NðÞ 1; 1 ); u>o 4: procedure ITERATION(k ) max 5: for k2f1; .. . ; k g do max 6: for (o, u)2 L do k k1 7: x E y (u 6¼u) ou o>u u>o u 2U 8: for (o, u)2 L do k k 9: y / E x (o 6¼o) o u u>o o>u o 2O k 1 max 10: x / E y o u2U ou u>o 11: if sign(x )==1 then 12: l ¼ x o o 13: return l 2.4 Quality assessment In the microtask crowdsourcing context, achieving a good quality result is one of the major goals, and when we talk about quality, it generally means the quality of the data collected from the crowd. For the classiﬁcation microtasks, existing work in quality assessment mostly use the accuracy metric (Khattak and Salleb-Aouissi, 2011; Hung et al., 2013; Zhang et al., 2017a). Some research also uses precision/recall (Hung et al.,2015; Zhang et al., 2017) or F1 score (Zheng et al., 2017a), while other work use ROC (Zheng et al.,2017b) or RMSE (Bachrach et al., 2012). For classiﬁcation, the quality of the result refers to how good the overall collected classiﬁcations are, which is a data-value centric dimension to reﬂect how accurate the classiﬁcations are. In this work, if not specially speciﬁed, when referring to quality of the input/answer/data/result, it means Accuracy –“The degree to which data values correctly represent the real-world facts” (Zaveri et al., 2013);deﬁnition in science (JCGM, 2008) as “closeness of agreement between a measured quantity value and a true quantity value of a measurand”. We can look at individual crowd worker’s work to evaluate whether its work is of good quality, or we can look at the overall result from all the workers to see how accurate they classify the given objects. The later one which involves aggregating the input from different crowd workers in a multiple-step classiﬁcation task is the focus of this paper. In the crowdsourcing context, the ground truth is not usually available. To assess the Crowdsourced quality of the result, we need to understand what algorithms or mechanisms can be used to classiﬁcation infer or predict the correct answer based on all the input from the crowd workers. tasks Correspondingly, each existing different algorithm has been studied by researchers and evaluated its performance in various contexts (Section 6.2). This work mainly takes a look at three popular existing algorithms elaborated above and investigates how the adaptation of these algorithms can be used for aggregating the crowdsourced data and help to assess the quality of the classiﬁcation result. The whole process, in a nutshell, includes three major phases, data collection (microtask design and task execution) from the crowd which is available to this study, aggregation to infer the correct answer/label, and evaluation of the quality (in this work is the Accuracy metric) by comparing the inferred result to the gold standards we have. This research focuses on the aggregation and evaluates the accuracy accordingly. 3. Our approach In this section, we ﬁrst illustrate the range of classiﬁcation tasks we address via a set of examples: classiﬁcation tasks with a single question and multiple-questions. We then introduce a set of notations and formalise the classiﬁcation problem as a path searching problem in a graph. Following that, we present our aggregation method by illustrating how existing established algorithms can be adapted to handle more complex cases. 3.1 Multi-level workﬂow model and problem formalisation A classiﬁcation task, as shown in Figure 3, is generally considered as a simple task as it contains only one question. A relatively complex task normally involves more than one question and hence more options. It will be more like a tree with branches which has further branches and leaves. If we draw such a ‘tree’ for the three tasks we are exploring in this paper, we can see each of them uses a different type of workﬂow consisting of several independent/interdependent steps. Each step in the workﬂow is associated with a Question to classify an object according to a criterion. To answer the question the crowd needs to choose among a set of Options. Figure 4 involves minimum one step and maximum three steps for the classiﬁcation task. Figure 5 has Figure 4. Representation of dark skies workﬂow from cities at night IJCS 3,3 Figure 5. Representation of snapshot Sergenti workﬂow from Zooniverse Figure 6. Representation of GalaxyZoo workﬂow from Zooniverse a ﬁxed two steps to complete a classiﬁcation task and each step has more than ten Crowdsourced options. For the GalaxyZoo[5] task, it can involve minimum one step and a maximum classiﬁcation of nine steps to complete a classiﬁcation, as shown in Figure 6. It is notable that these tasks different tasks do present a tree-like structure each of which has a number of questions,with various number of available options, however, there are indeed cases where some nodes have more than one parent node which means it can not be considered as a tree. As a result, the workﬂow can be modelled as a directed acyclic graph (DAG), where the root node is the object under consideration and all other nodes are classiﬁcation options. Each node can be reached via multiple paths from the root, which prompts the ﬁrst question of the workﬂow[6]. For a given object o, the crowd is asked to carry out a labelling task, which implies answering a series of (independent or dependent) classiﬁcation questions with a set of labels which identify the outstanding features of the object being classiﬁed. We deﬁne this task as a path search problem in a workﬂow W modelled as directed acyclic graph (DAG) with a root entry point and levels (similar to tree levels, representing the number of questions in the task), each corresponding to a set of options as depicted in Figure 7.Each node in such a graph represents a particular labelling option. The labelling ﬁnishes when a leaf in the graph is reached, that is a label that does not lead to any further questions. In our deﬁnition, the level corresponds to classiﬁcation question(s) and the level of a node is serialised and counted at the lowest level. We use level exchangeably with depth of a node which is indicated by the number of edges from the node to the root node. A directed edge represents a label chosen for the corresponding question related with that node level. Table II has a summary of the deﬁnitionsweuse. On top of the notations we deﬁned in Section 2, we also deﬁne the notations which are speciﬁc to our workﬂow graph model in Table III. The problem we are solving in the paper can be deﬁned as follows: Figure 7. Graph representation of an example classiﬁcation workﬂow W vs the corresponding classic way of looking at the classiﬁcation with multiple questions IJCS Term Definition 3,3 Task A general term referring to an action or a series of action need to be executed Classiﬁcation Task classifying objects into given categories, it could be a simple task (one question) or a task relatively complex task (more than one question) Microtask A task is decomposed into smaller unit making it easier for the crowd. One microtask is equivalent to one question in classiﬁcation task Workﬂow Microtasks are arranged/chained in a way to automatically complete the task Question Classiﬁcation task asked of the user to elicit/assign a label to an attribute of the object to be classiﬁed Option The set of possible labels Chosen option An option user chooses per question Correct label The correct label for a question Chosen path A user chooses a set of labels for entire workﬂow Correct path The correct set of labels for entire workﬂow Workﬂow The workﬂow can be modelled as a directed acyclic graph (DAG), in which the root node graph represents the object under consideration and all other nodes are classiﬁcation options Table II. Node A representation of an option in our model Deﬁnitions Node level The sequence that the question is presented to the user within a workﬂow Notation Definition W Represents the graph based on the workﬂow of classifying object o, it has node levels to indicate the questions to classify the corresponding attributes of the given object, and nodes to represent the options available for each attribute A Represents the available options at node level n (n) a Represents the individual option at node level n, where j2f1; .. . ;jA jg njðÞ ðÞ n l Represents the label chosen by user u at node level n for object o. Thus, the labelling result ðÞ n 1 1 1 (l ; l ,.. ., l ) will represent the ordered list of nodes (the traversal path) visited by user 1 oðÞ oðÞ o 1 2 ðÞ n when classifying o, which is called as a label path L The label path chosen by user u for object o L All labels for object o at node level n oðÞ n L Unique labels for object o at node level n, L A o ðÞ o ðÞ ðÞ n ðÞ n unique ðÞ n unique Table III. ~ L Represents the inferred label path for object o. It is a set of inferred labels for each node level ~ ~ Notations speciﬁcto described as (l ; ... ; l ) o o ðÞ ðÞ n our model L True label path for object o gold Deﬁnition 3.1 The Correct Labelling Problem: Given a particular object o, a workﬂow-based graph W , a set of labels L for object o, and f o (optionally) a set of previous labels from all users on all objects L, our aim is to infer the correct label path L in W for object o. o f 3.2 Adapted aggregation In the classic approaches, it does not look at the dependency between node levels hence naively putting inferred result from each node level together does not guarantee a valid result. It is obvious that producing a valid path with possible choices should improve the accuracy of the users. As such, a basic adaptation of the classic algorithms should show some improvement over multiple level workﬂows. We show such a basic adaptation in Algorithm 4. Algorithm 4 Our Adapted Approach Crowdsourced 1: procedure PREDICT_BY_NODELEVEL(L ) classiﬁcation 2: num_levels ¼ n; tasks 3: for level 2 range(n) do 4: if method ¼¼ mv then 5: procedure FINDUNIQUELABEL(L ) u u 6: L fl g, where L A and l 2 A and u# U ; unique unique o o 7: for l 2 L do 233 unique ðÞ 8: p count l jLj ⊳ percentage of l being voted as the l i label for object o; 9: return LC fðÞ l; p g ⊳ list of candidate labels and their n l percentage for o; 10: if method ¼¼ em then 11: procedure INITIALISE(p ) ðÞ 12: p count l jL j ⊳ percentage of l being the true label l o for object o (l2 A); 13: while not converged do 14: Estimate error rate for user u: u u u 15: u l þ p Il ¼ l ll ll o o2L 16: Estimate confusion matrix: u u u 17: e u u ⊳ q is the accuracy of user u ll ll lq 18: Estimate class priors: 19: pr p jOj 20: Calculate class probability for object o: YY X Y j u u u ðÞ ðÞ 21: p pr e Il ¼ m pr e Il ¼ m l l q am qm m q m u2U ðÞ 22: return LC f l; p g ⊳ list of label candidates and n l corresponding probability for o; 23: if method ¼¼ mp then 24: procedure INITIALISATION(y ) u>u 25: for (o, u)2 L do 26: Initialise y (NðÞ 1; 1 ); u>o 27: procedure ITERATION(k ) max 28: for k2f1; .. . ; k g do max 29: for (o, u)2 L do k k1 30: x / E y ðu 6¼u); ou o>u u>o u 2U 31: for (o, u)2 L do k k 32: y / E x ðo 6¼o); o u u>o o>u o 2O k 1 max 33: x / E y o ou u>o u2U 34: if sign(x )==1 then 35: LC :appendðÞ ðÞ x ; 1:0 n o 36: procedure ASSEMBLE_MOSTPOSSIBLEPATH(L ) IJCS o 37: num_levels ¼ n; 3,3 38: LC¼fg; 39: for z 2 LC do 1 1 40: for z 2 LC do 2 2 41: .. . 42: for z 2 LC do n n 43: LC:append z ; z ; .. . ; z ; p p .. . p ; ðÞðÞ 1 2 n z z z 1 2 n 44: L =; 45: p =0; max 46: for Z 2 LC do 47: if p p then Z max 48: p p ; max Z 49: L Z; 50: return L ; Our adapted approach assumes that labels at different levels in the workﬂow are independent, then assemble the label path from each node level based on the workﬂow graph. In the adapted approach, not only we reward partially correct answers from the crowd by applying each of the algorithms at each node level in the graph and compute scores for each individual labels, but also we consider the valid path when inferring the correct path. We also specially choose two algorithms that take into account the performance of the crowd in their computations, EM and MP.The EM algorithm sums up all node probabilities along each path to determine the ranking score. The MP algorithm returns true if that particular label at the node level is relevant or false otherwise. This means that we assign the score for the candidate paths correspondingly either as 1.0 or 0.0. By studying it, we want to allow MP and EM to be able to better identify those users who, while not doing so well overall, are very skilled at a particular sub-task (question) in the workﬂow. 4. Evaluation To evaluate the three algorithms and our adapted approach, we compare the classic approach where algorithms are applied on each node level and simply put together (we call it “naive- approach” here) with our “adapted-approach” which uses classic approach while strives to infer a valid correct path by considering the workﬂow graph. Thus, we have six different approaches: mv_adapted, mv_naive, mp_adapted, mp_naive, em_adapted, em_naive. Each inference algorithm was applied to six data sets with different microtask crowdsourcing workﬂows. We start with the evaluation setup of the data in Section 4.1 and the evaluation metrics in Section 4.2. Then we present the evaluation of inferred result in Section 4.3. 4.1 Data First, we used three existing data sets. The ﬁrst one is from the Snapshot Serengeti[1] project and consists of all crowd classiﬁcations within the time span from 10 December 2012 until 17 July 2013. It contains 7,800,896 labels from 890,280 volunteers for a total of 66,892 objects. For our evaluation, we used a gold standard with curated labels for 4,149 objects, which was created by professional scientists working on the Snapshot Serengeti project. To evaluate our approach we took all labels received from the crowd for the 4,149 objects which contains 112,027 labels submitted by 8, 304 volunteers. The second data set is from the Dark Skies app within the Cities at Night[2] project. It consists of 1,275,354 classiﬁcations by 19,818 volunteers submitted in a time span from April 27th, 2014 until December 5th, 2016. Crowdsourced The gold standard consisted of 200 objects whose labels were manually validated by the classiﬁcation science team in Cities at Night. These 200 objects received 1,341 labels from 692 users from tasks CrowdCrafting[7]. The third one is from the GalaxyZoo[3] project where we randomly choose 500 objects consisting classiﬁcations from 16 February 2009 to 21 May 2009. The workﬂows for the three data sets are depicted in Figures 4, 5 and 6, respectively. To explore the effects of volunteers/paid context on the results, the tasks are also setup on paid crowdsourcing platform to mimic the tasks done by volunteers. 4.2 Metric To measure the performance of our aggregation approach, we employ the Accuracy metric which has been commonly used in classiﬁcation evaluation in previous work (Khattak and Salleb- Aouissi, 2011; Kamar et al., 2012; Sheshadri and Lease, 2013; Hung et al.,2013; Zhang et al., 2017a; Zheng et al.,2017b). Accuracy is a measure allowing us to understand the percentage of correct answers (inferred by algorithms). The accuracy is deﬁned as the percentage of objects that have been correctly inferred. Higher accuracy indicates better performance. jOj Bernoulli L ¼¼ L gold Accuracy ¼ jOj The above equation is by default for calculating the accuracy for the inferred label path. Bernoulli L ¼¼ L indicates the outcome (either 0 or 1) of comparing gold category gold o with the category predicted by different predictor. As we use the adapted node-level based implementation, it makes sense to also evaluate how accurate the inferred label is on each node level. In such context, L ½ n represents the ground truth for object o at node level n gold and L ½ n represents the inferred true label at node level n. Hence, the accuracy at node level n for the top answer can be calculated by: jOj Bernoulli L ½ n ¼¼ L ½ n gold Accuracy ¼ _level jOj To understand whether our adapted approach is signiﬁcantly better, we will also run signiﬁcant testing for all algorithms chosen. We will use standard 5 per cent signiﬁcance level. For each data set, we will randomly select 100 objects and select 50 times. The accuracy for each selection is calculated for MV, MP and EM for both naive and adapted approach. We will use the function scipy.stats.ttest_ind from Python[8] to perform the two-sided test for naive and adapted samples in all six cases (three workﬂows, each has two contexts: volunteer and paid). 4.3 Results Table IV shows the accuracy of each algorithm on each data set for the inferred answer. Considering the overall classiﬁcation accuracy (by path), our adapted methods have better performance than the naive approach in both volunteer and paid crowd context; at the same time, each algorithm generally has higher accuracy for volunteer context compared to the paid crowd. Note that the best accuracy achieved increases as the depth of the workﬂow increases for the paid crowd context, where Serengeti with two questions achieves 45.9 per cent, darkskies with three IJCS Data set Graph depth/size Crowd type Algorithm Accuracy 3,3 serengeti 54-11 volunteer mv_naive 0.590 mv_dapted 0.776 em_naive 0.572 em_adapted 0.655 mp_naive 0.755 mp_adapted 0.755 paid mv_naive 0.299 mv_adapted 0.459 em_naive 0.244 em_adapted 0.337 mp_naive 0.083 mp_adapted 0.207 darkskies 8-3-2 volunteer mv_naive 0.690 mv_adapted 0.785 em_naive 0.040 em_adapted 0.450 mp_naive 0.340 mp_adapted 0.495 paid mv_naive 0.405 mv_adapted 0.530 em_naive 0.020 em_adapted 0.385 mp_naive 0.335 mp_adapted 0.305 galaxyzoo 3-3-2-3-2-2-3-6-4-2-7 volunteer mv_naive 0.554 mv_adapted 0.631 em_naive 0.470 em_adapted 0.564 mp_naive 0.002 mp_adapted 0.562 paid mv_naive 0.371 mv_adapted 0.579 em_naive 0.000 Table IV. em_adapted 0.331 Accuracy (by path) of mp_naive 0.002 each algorithm mp_adapted 0.367 questions achieves 53.0 per cent and galaxyzoo with maximum of nine questions achieves 57.9 per cent. Similar pattern is not observed for the volunteer context. If looking at the accuracy breakdown by node level (Figures 8, 9 and 10), it is notable that for multiple-questions task with more steps, adapted method of MP and EM generally shows better accuracy at most of the node levels. For the data sets from a task with fewer steps in its workﬂow (less number of levels in the graph), such as the Serengeti task in Figure 8, MV performs better. Meanwhile, from the Table IV we can see MV shows an acceptable accuracy for most of the volunteered data sets (mostly over 75 per cent, except for GalaxyZoo data set), but has poor accuracy (less than 60 per cent) in the paid crowd context though it performs better than other individual algorithms we tested, which suggests it need to be complemented by other methods which might be good at speciﬁc objects where MV cannot perform well. Looking at the accuracy by level results, it does not seem to suggest that as the depth of the task (number of levels) increases, accuracy has a tendency to consistently increase or decrease. The accuracy of each level is more relevant to its intrinsic character (e.g. number of options in that level, and ambiguity Crowdsourced classiﬁcation tasks Figure 8. Accuracy by node level (Serengeti) IJCS 3,3 Figure 9. Accuracy by node level (Darkskies) Crowdsourced classiﬁcation tasks Figure 10. Accuracy by node level (Galaxyzoo) or subjectivity of the corresponding object). For instance, the darkskies task asks the user to IJCS evaluate the sharpness and cloudiness of the image, which can be subjective to some degree. This 3,3 is also why the result by node level seems to show an interesting picture that on different node level for different workﬂow, sometimes em has the best result (such as level 4 and 5 of GalaxyZoo), sometimes mp has the best result (such as level 1 of Serengeti in volunteer case), other times mv has the best result (level 1, 2, 3 of Darkskies in both volunteer and paid context). Notice that MP for the darkskies paid crowd context, it is the only case we observe that the naive approach has higher overall accuracy (by path) than adapated, which is due to the fact that both the level 2 and 3 (determining cloudiness and sharpness of the image) of darkskies workﬂow are in essence independent questions of the ﬁrst node level (whether it is a city, or stars or anything else) though the task workﬂow made it a subsequent question only when“ city” is chosen as the label for ﬁrst node level. Similarly, the accuracy by level result from mp_adapated is lower than mp_naive on a few other occasions at different node level, but in those occasions, there is always one node level mp_naive has considerably poor accuracy, such as in Galaxyzoo node level 2, which subsequently leads to the very low overall accuracy considering the whole path. The reason that the mp_adapted approach could have lower accuracy at certain level is that mp approach actually only returns 1.0 or 0.0 to indicate whether that is the predicted label, but our adapted approach tried to assemble/infer a most probable valid label path (as shown in Algorithm 4) based on the candidate of predicted labels from individual node level. Therefore, for the mp case, the randomness of ranking the combinations might not do well for the corresponding node level, however, the overall accuracy has shown to be better than the naive approach which completely neglects the validity of a label path. Notice that though our adapted approaches achieve higher accuracy for the ﬁrst node level in most case, mv_adapted has slightly lower accuracy comparing to mv_naive for GalaxyZoo workﬂow under volunteer context, which is because the way we assemble the result is based on the overall possibility (percentage of voting at each node level multiplied) of a path instead of assuming the top voted label at node level 1 is correct (and then traversing subsequent node based on that assumption). Our main purpose is to obtain the most possible valid label path, which has been shown effective in Table IV. We have run the signiﬁcant testing for all algorithms chosen. The result is statistically signiﬁcant for all our adapted approach as the p- value is smaller than the pre-deﬁned signiﬁcant level (5 per cent) in all cases. 5. Discussion In this section, we expand on the key ﬁndings of the evaluation results introduced earlier. 5.1 Crowd context matters We have deliberately chosen three representative tasks each presenting two data sets produced by volunteers and paid crowd. Based on our results, thereisadistinctivedifferenceinperformance for the same algorithm applied in these two different contexts. For all algorithms, the accuracy it can achieve under the volunteer context is evidently higher than the paid crowd, without any exception. For the same workﬂow, the overall accuracy (by path) it can achieve in volunteer context is normally around 30 per cent higher than the paid crowd context for workﬂows with two to three questions. However, this does not seem to be the case when workﬂow involves more questions, such as in the galaxyzoo case where the best accuracy all the algorithms can achieve is only around 5 per cent higher in volunteer context compared to paid crowd context. 5.2 Workﬂow counts From the representative tasks we have shown so far, there are two main factors that need to be taken into account when designing a classiﬁcation crowdsourcing workﬂow especially when classiﬁcation steps are interdependent: the number of questions (determining the depth of the Crowdsourced graph) and how many answer options per each question (width of the corresponding node level, classiﬁcation affecting cognitive efforts required for passing that node level with correct chosen options). In tasks our evaluation, we found evidence that both depthand widthimpactonoverall performanceof the inference algorithms. One visible pattern is for the paid crowd data sets. In this setting, overall accuracy (by path) increases as the depth of the graph increases (for both mv_adapted and mp_adapted), which suggests that it might be a good idea to have more classiﬁcation questions each with fewer options rather than having fewer questions and giving many options to choose from, particularly for the case where the crowd’s skill level is uncertain. The other notable aspect is for volunteer context, the mp algorithm has a comparative performance with mv in Serengeti workﬂow, but not in the other two workﬂows with more levels. 5.3 Heuristics-based aggregation as an addition On observing the result in Section 4.3, it seems to be a promising way if we consider combining output from these algorithms using a heuristic strategy to perform better inference. We want to use results from mv_adapted, em_adapted and mp_adapted in combination to exploit their strengths and weaknesses for complex classiﬁcation tasks. To do so, we could have an aggregator which is based on following intuitions: the number of unique classiﬁcations of an object (deﬁned by u) shows the degree that the crowd workers agree/disagree on the classiﬁcation where the higher number indicates higher degree of disagreement and normally imply the object is either a bit difﬁcult or ambiguous to be classiﬁed; the ratio (deﬁned by r) between the unique number of classiﬁcations/answers collected from the crowd and the total number of classiﬁcations/judgments also demonstrates how diverse the answers are for the corresponding object and hence similarly; As three-sigma rule (Pukelsheim, 1994) in the empirical sciences suggests that almost all values should lie within three standard deviations of the mean in a normal distribution, and theoretically mean plus one, two or three standard deviation(s) covers 68, 95 and 99.7 per cent of the data. In the case where MV might potentially fail (where workers tend to disagree), the number of unique classiﬁcation or the ratio of the number of unique to the total number of classiﬁcation for an object falls within the higher range of the distribution. Thus, a heuristic aggregation strategy we could consider: Look at the intrinsic characteristics of collected classiﬁcations for each object, such as the number of unique classiﬁcations and the ratio of that against the total number of classiﬁcations. Then, based on the third intuition above, we can use the skewness (deﬁned by s below) of the distribution for number of unique (UNu ; u )and ðÞ m s ratio (deﬁned by RNr ; r ) respectively to heuristically chosen bound where MV can be ðÞ m s potentially complemented by other approaches. However, choosing an optimal threshold is not straightforward and need to be explored in future work. 6. Related work Our approach is informed by existing work on microtask crowdsourcing and quality assurance in crowdsourcing, which we review in section. 6.1 Microtask crowdsourcing and workﬂows In crowdsourcing, a problem needs to be sometimes decomposed into smaller, ﬁne-granular microtasks and then arranged in a workﬂow for more effective processing. In general, a workﬂow consists of a set of microtasks; the microtasks are sometimes of different types and can be dependent or independent of each other. For instance, the ﬁnd-ﬁx-verify workﬂow proposed by Bernstein et al. (2010) uses microtask crowdsourcing to proofread and shorten text in three steps: ﬁnding areas of improvement in the text; ﬁxing or improving them; and verifying the quality of the changes. In each step, the crowd is asked to carry out the same type of microtask, sometimes iteratively. In Kittur et al. (2008, 2013) and Acosta IJCS et al.’s (2013) studies, researchers have proposed to group the same or similar microtasks 3,3 into batches as a means to facilitate learning effects. Previous studies have also shown that task performance can be improved as a function of several factors, including the design of tasks and workﬂows, motivation and incentives and training (Bernstein et al.,2010; Demartini et al., 2012; Kittur et al., 2008; Wiggins et al., 2011). In the citizen science platform such as Zooniverse[9], most of the classiﬁcation projects are not simple tasks with one-question, instead is multiple-questions chained together. Zooniverse uses workﬂow to “group a collection of tasks into a logic unit”[10] which is, in essence, referring to the relatively multiple-questions task which need to be ﬁnsihed in several steps. In Snapshot Serengeti[1], classifying an image means answering a set of independent questions, sometimes several times when more than one animal is present in the image. In Cities at Night[2]and Galaxy Zoo[3], questions are inter-related and the answers given in one step determine the questions in the subsequent steps. In the context of such classiﬁcation task, a workﬂow is used to refer to the logical organisation of each classiﬁcation questions and corresponding options. Most previous studies around crowdsourcing workﬂows have focussed on the design of the workﬂows and have shown that a particular type of workﬂow can be crowdsourced effectively (in terms of the accuracy of outputs, budget, time etc.) (Little et al., 2009; Bernstein et al., 2010; Tran-Thanh et al., 2015). In some cases, researchers have proposed bespoke quality assurance methods for their workﬂows (Lintott et al.,2011; Willett et al., 2013). Our work proposes a strategy which can be applied to determine the correct label path for a whole range of classiﬁcation tasks, spanning over several steps with independent or dependent multiple-choice questions, which is different than existing research that mainly focus on the result for the ﬁnal step (no matter how many other previous steps exist in its workﬂow). 6.2 Inference algorithms Researchers have proposed inference algorithms, mathematical models that can automatically infer the correct solution to a given problem from a solution space deﬁned by the crowd. For example, Ipeirotis et al. presented an algorithm that assesses the performance of crowd workers and exploits this information to estimate the quality of answers on Mechanical Turk (Ipeirotis et al., 2010).Karger et al. proposed to use MP to infer correct answers from worker’sanswers (Karger et al., 2011). Bachrach et al. (2012) used a Bayesian graphical model to grade test answers in scenarios where the ground truth cannot be made available. Whitehill et al. (2009) followed an expectation maximisation approach to identify correct classiﬁcations, depending on the expertise of the workers and the level of difﬁculty of the task. In the citizen science project Galaxy Zoo Supernovae, crowd answers were analysed using a Bayesian generalisation of the same expectation maximisation idea (Simpson et al., 2011). More recently, Difallah et al. (2015b) compiled a set of features that can be used to predict answer quality, based on an analysis of Mechanical Turk logs. Several studies have shown that it is possible to combine automatic prediction methods (such as Bayesian or generative probabilistic models) with additional input from the crowd to further improve the accuracy of the predictions (dos Reis et al., 2015; Hare et al.,2013; Ipeirotis et al., 2010; Loni et al., 2014; Simpson et al., 2013). Other studies have analysed and compared different algorithms (Zheng et al., 2017a; et al., 2015; Sheshadri and Lease, 2013), emphasising the need for more research to understand the interplay among different sets of design parameters on the overall performance. All these existing methods have considerably advanced the state of the art. However, they cannot be applied to every type of microtask crowdsourcing workﬂow without restrictions. Moreover, most of the research carried out so far in this space has looked at rather simple binary or multiple-choice classiﬁcation tasks with the aim to identify a single, correct answer. This class of microtasks, albeit important and widely used, is not always the Crowdsourced norm. As we have seen in the examples from the previous section, there are cases where a classiﬁcation problem cannot be easily decomposed into independent microtasks, or where different, tasks related microtasks should be grouped into more complex workﬂows for efﬁciency reasons. Although there are a few recent works looking into the relatively complex multiple-step classiﬁcation tasks, each of them has a domain-speciﬁc or problem-speciﬁcfocus (Parameswaran et al., 2011; Kim et al.,2002; Wu et al., 2012; Bragg et al., 2013; Kamar and Horvitzm, 2015; Otani et al., 2016). Bragg et al. (2013) and Otani et al. (2016) both researched the entity classiﬁcation that normally involve categorising the given entity into parent-child classes in different steps but have very different perspectives. Bragg et al. (2013) focus on improving the workﬂow for generating taxonomy, as well as inference methods to induce the parent-child relationship, while Otani et al. (2016) focus on the task where a parent-child relationship exists between two adjacent classiﬁcation steps, and propose label aggregation methods that adapt from existing GLAD method (Whitehill et al., 2009) by considering the hierarchical class-subclass structure. In addition, Wu et al. (2012) investigate the sequential data labelling scenario and present Sembler to ensemble crowd sequential labellings by leveraging the statistical correlation and dependency among multiple instances/sentences which is domain speciﬁc and not applicable to other multiple-step classiﬁcation where no such statistics can be exploited. Parameswaran et al. (2011) and Kamar and Horvitz (2015) particularly look at the multiple-step image classiﬁcation tasks while both took the approaches that are not easy to be generalised to suit for other multiple-step classiﬁcation. Parameswaran et al. (2011) explicitly formulate the classiﬁcation task as human-assisted graph search problem, presenting the dimensions characterising the different type of classiﬁcation and developing algorithms to optimise the questions to be asked (at the different node) which is evaluated with simulation. On the other hand, Kamar and Horvitz (2015) focus on optimising worker allocation in the hierarchical classiﬁcation task (HCT) and develop answer models and evidence models for HCT consensus while both models are constructed with supervised learning, assisting with the Sloan Digital Sky Survey (SDSS) features identiﬁed by machine visions available for GalaxyZoo data set. There is also a few research particularly dedicating to automatic hierarchical classiﬁcation where an taxonomy is given and a parent-child relationship among classes exists, but all are bound to a certain domain. For instance, Dumais (2000) investigate automatic hierarchical classiﬁcation using Support Vector Machine with existing web pages whose category are known as training data. Su et al. (2006) present an automatic method to classify structured web databases by leveraging probing queries, the returned count of query result and the SVM classiﬁer. Such automatic hierarchical classiﬁcation not only needs existing labelled data as training data but also focus on the classiﬁcation where answers to further classiﬁcation step down the line (child classes) are always a sufﬁcient condition to conﬁrm the answer to the previous classiﬁcation step (parent classes). Our approach differs from existing work mainly in the fact it is not restricted to a speciﬁc type of multiple-step classiﬁcation and does not need additional information such as the machine identiﬁed features of the image or frequency/correlation among word usage, neither does it rely on the parent-child relationships between classiﬁcation steps. Our method is general and intuitively easy to be applied in any multi-step classiﬁcations. We discussed the three main individual algorithms in Section 2 and noted that whilst all three algorithms can be used to infer the correct answer for a multiple-choice question, they differ in terms of the inputs and outputs. In our approach, we devised a new strategy to use existing algorithms to achieve higher classiﬁcation accuracy. 7. Conclusion IJCS Ensuring quality is one of the grand challenges of microtask crowdsourcing. While previous 3,3 research has looked at inferring correct answers for microtasks consisting of single binary or multiple-choice questions, our research proposes a model that can be applied to both single-question and multiple-question scenarios, ﬁlling the gap for understanding how to aggregate in the multiple-question scenarios. We propose a graph model and an “adapted” aggregation method that can improve the accuracy in inferring true label path in complex workﬂows with several interdependent questions. Though a few previous works tried to address similar multiple-step classiﬁcation, they are either limiting it to the hierarchical classiﬁcation scenarios where a parent–child relationship exists between classiﬁcation steps or restricting the method by having to involve additional information. We propose using the graph to model a microtask crowdsourcing workﬂow and to support inference algorithms in making decisions about correct labels for classiﬁcation tasks with multiple-questions, where the answer to one question does not have to be the sufﬁcient condition to or imply the answer to the previous question is correct. We believe this is the ﬁrst work that investigates aggregation in a multiple-step classiﬁcation task with interdependent questions to infer the correct label path and assess the classiﬁcation accuracy accordingly. To this end, we explored three inference algorithms, MV, MP and EM, each with proven beneﬁts in quality assurance in crowdsourcing. We compared the performance of our adapted approach and the existing naive approach, using six representative data sets. We evaluate the performance of individual algorithms for overall accuracy where a full labelling path is considered as an atomic, correct answer and a more reﬁned measure which looks at accuracy in individual node level of the workﬂow graph. The results have shown that our adapted approach has signiﬁcantly improved the accuracy compared with the naive approach. The result also demonstrates that while MV does well in overall accuracy, a deeper analysis of the accuracy in each node level revealed a more interesting picture. Hence, a heuristic-based aggregation approach might be a potentially better solution by combining results from multiple algorithms leveraging the strength of each other. This suggests the need for more dynamic inference approaches that can adapt to the complexity of the crowdsourcing workﬂow. In future work, we plan to devise inference methods that take other, more workﬂow- speciﬁc factors into account. Our current method assumes independence between labels from different levels when inferring the answer for each level. It can be potentially improved to consider the possible correlation between labels in different node levels. For instance, it can consider giving different weight to labels based on the inferred result from the previous level. Such method requires a top-down traversal process which might bring side-effects since it counts heavily on the inferred result from the previous level, and carries on the effect (weight) to subsequent levels even the choice in the previous levels may be incorrect. As the correlation between labels in different node level is complicated, the feasibility of incorporating such correlation information into the aggregation process needs further investigation. Meanwhile, the number of options and the length of possible paths in a workﬂow deserve more in-depth experiments. One promising direction will be to employ other machine learning approaches for truth inference. For instance, using the workﬂow properties along with the crowdsourcing generated data to learn and explore features automatically [Huynh et al. (2013)], and produce decision tree to help choose the proper inference algorithm. Alternatively, certain properties from crowd-collected data could be further exploited to train machine learning algorithm(s) with selective labels to directly infer true label path. Notes Crowdsourced 1. https://www.snapshotserengeti.org/ classiﬁcation tasks 2. http://citiesatnight.org/ 3. https://www.galaxyzoo.org/ 4. https://success.crowdﬂower.com/hc/en-us/articles/203527635-CML-Attribute-Aggregation 5. https://data.galaxyzoo.org/gz_trees/gz_trees.html 6. In a lot of cases, the workﬂows are tree-shaped, but some cases are not a tree such as the three tasks presented above. 7. https://crowdcrafting.org/ 8. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html 9. https://www.zooniverse.org/ 10. https://blog.zooniverse.org/2013/06/20/how-the-zooniverse-works-the-domain-model/ References Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Auer, S. and Lehmann, J. (2013), “Crowdsourcing linked data quality assessment”, The Semantic Web – ISWC 2013, pp. 260-276. Bachrach, Y. Minka, T. and Guiver, J. (2012), “How to grade a test without knowing the answers – a Bayesian graphical model for adaptive crowdsourcing and aptitude testing”. Batini, C., Cappiello, C., Francalanci, C. and Maurino, A. (2009), “Methodologies for data quality assessment and improvement”, ACM Computing Surveys, Vol. 41 No. 3, pp. 1-52. Bernstein, M.S., Little, G., Miller, R.C., Hartmann, B., Ackerman, M.S., Karger, D.R., Crowell, D. and Panovich, K. (2010), “Soylent: a word processor with a crowd inside”, Proceedings of the 23nd annual ACM symposium on User interface software and technology, ACM, pp. 313-322. Bernstein, M.S., G., Little, Robert, C., Miller, B., Hartmann, Mark, S., Ackerman, David, R., Karger, David Crowell, K. and Panovich, (2010), “Soylent: a word processor with a crowd inside”, Proceedings of the 23nd Annual ACM Symposium on User Interface Software and Technology, ACM, pp. 313-322. Bragg, J., Mausam and Weld, D.S. (2013), “Crowdsourcing multi-label classiﬁcation for taxonomy creation”,in HCOMP 2013, First AAAI Conference on Human Computation and Crowdsourcing. Dawid, A.P. and Skene, A.M. (1979), “Maximum likelihood estimation of observer error-rates using the em algorithm”, Applied Statistics , Vol. 28 No. 1, p. 20. Demartini, G., Difallah, D.E. and Cudré-Mauroux, P. (2012), “Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking”, Proceedings of the 21st international conference on World Wide Web, ACM, pp. 469-478. Difallah, D.E. Catasta, M. Demartini, G. Ipeirotis, P.G. and Cudré-Mauroux, P. (2015a), “The dynamics of micro-task crowdsourcing: the case of Amazon MTurk”, pp. 238-247. Difallah, D.E., Catasta, M., Demartini, G. and Panagiotis, G. (2015b), “Ipeirotis, and Philippe Cudré- Mauroux”, The Dynamics of Micro-Task Crowdsourcing: The Case of Amazon MTurk,” Pages, pp. 238-247. dos Reis, F.J.C.S., Lynn, H.R., Ali, D., Eccles, A., Hanby, E., Provenzano, C., Caldas, W.J., Howat, L.-A., McDuffus, B. and Liu, (2015), “Crowdsourcing the general public for large scale molecular pathology studies in cancer”, EBioMedicine, Vol. 2 No. 7, pp. 679-687. Dumais, S. (2000), “Hierarchical classiﬁcation of web content”, pp. 256-263. Eickhoff, C. and de Vries, A. (2011), “How crowdsourcable is your task”,in Proceedings of the Workshop IJCS on Crowdsourcing for Search and Data Mining (CSDM) at the Fourth ACM International 3,3 Conference on Web Search and Data Mining (WSDM), pp. 11-14. Gadiraju, U., Demartini, G., Kawase, R. and Dietze, S. (2015), “Human beyond the machine: challenges and opportunities of microtask crowdsourcing”, IEEE Intelligent Systems, Vol. 30 No. 4, pp. 81-85. Gadiraju, U., Kawase, R. and Dietze, S. (2014), “A taxonomy of microtasks on the web”, Proceedings of the 25th ACM conference on Hypertext and social media, ACM, pp. 218-223. Gelas, H. Solomon Teferra Abate, L. and Besacier, (2011), Laboratoire Dynamique, Du Langage, Cnrs Universit, De Lyon, Laboratoire Informatique De Grenoble, Cnrs Universit, and Fourier Grenoble., “Quality assessment of crowdsourcing transcriptions for African languages,” (August), pp. 3065-3068. Hare, J.S., Acosta, M., Weston, A., Simperl, E., Samangooei, S., Dupplaw, D. and Lewis, P.H. (2013), “An investigation of techniques that aim to improve the quality of labels provided by the crowd”,in Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop, Barcelona, Spain, October 18-19, 2013., vol. 1043 of CEUR Workshop Proceedings, available at: CEUR-WS.org Hung, Q.V.N., Tam, N.T., Tran, L.N. and Aberer, K. (2013), “An evaluation of aggregation techniques in crowdsourcing”, Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artiﬁcial Intelligence and Lecture Notes in Bioinformatics), Vol. 8181 LNCS, no. PART 2, pp. 1-15. Hung, N.Q.V., Thang, D.C., Weidlich, M. and Aberer, K. (2015), “Minimizing efforts in validating crowd answers”, Proceedings of the ACM SIGMOD International Conference on Management of Data, Vol. 2015-May, pp. 999-1014. Huynh, T.D., Ebden, M., Venanzi, M., Ramchurn, S., Roberts, S. and Moreau, L. (2013), “Interpretation of crowdsourced activities using provenance network analysis”, The First AAAI Conference on Human Computation and Crowdsourcing, pp. 78-85. Ipeirotis, P.G., Provost, F., Sheng, V.S. and Wang, J. (2014), “Repeated labeling using multiple noisy labelers”, Data Mining and Knowledge Discovery, Vol. 28 No. 2, pp. 402-441. Ipeirotis, P.G., Provost, F. and Wang, J. (2010), “Quality management on amazon mechanical Turk”, Proceedings of the ACM SIGKDD Workshop on Human Computation – HCOMP ’10, p. 64. Ipeirotis, P.G., Provost, F. and Wang, J. (2010), “Quality management on amazon mechanical Turk”, Proceedings of the ACM SIGKDD Workshop on Human Computation – HCOMP ’10, p. 64. JCGM. JCGM 200 (2008), “International vocabulary of metrology? Basic and general concepts and associated terms (VIM) vocabulaire international de métrologie? Concepts fondamentaux et généraux et termes associés (VIM)”, International Organization for Standardization Geneva ISBN, Vol. 3 No. Vim, p. 1042008. Kahn, B.K., Strong, D.M. and Wang, R.Y. (2002), “Information quality benchmarks: product and service performance”, Communications of the Acm, Vol. 45 No. 4, pp. 184-192. Kamar, E., Hacker, S. and Horvitz, E. (2012), “Combining human and machine intelligence in large-scale crowdsourcing”, Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, International Foundation for Autonomous Agents and Multiagent Systems, Vol. 1, pp. 467-474. Kamar, E. and Horvitz, E. (2015), “Planning for crowdsourcing hierarchical tasks”, Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, p. 2030. Karger, D.R., Oh, S. and Shah, D. (2011), “Iterative learning for reliable crowdsourcing systems”, Advances in Neural Information Processing Systems, pp. 1953-1961. Khattak, F.K. and Salleb-Aouissi, A. (2011), “Quality control of crowd labeling through expert evaluation”, Second Workshop on Computational Social Science and the Wisdom of Crowds (NIPS 2011), pp. 1-5. Kim, J.-H., Kang, I.-H. and Choi, K.-S. (2002), “Unsupervised named entity classiﬁcation models and their ensembles”, Proceedings of the 19th International Conference on Computational Linguistics, Vol. 1, pp. 1-7. Kittur, A., Chi, E.H. and Suh, B. (2008), “Crowdsourcing user studies with mechanical Turk”, Proceedings Crowdsourced of the SIGCHI Conference on Human Factors in Computing Systems, ACM, pp. 453-456. classiﬁcation Kittur, A., Nickerson, J.V., Bernstein, M., Gerber, E., Shaw, A., Zimmerman, J., Lease, M. and Horton, J. tasks “The future of crowd work”, Proceedings of the 2013 Conference on Computer Supported Cooperative Work – CSCW ’13, ACM Press, New York, NY, USA), p. 1301, 2013. Kittur, A., Smus, B., Khamkar, S. and Kraut, R.E. (2011), “CrowdForge: Crowdsourcing complex work”, Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology – UIST ’11, pp. 43-52. Kulkarni, A.P., Can, M. and Hartmann, B. (2011), “Turkomatic”, Proceedings of the 2011 Annual Conference Extended Abstracts on Human Factors in Computing Systems – CHI EA ’11, p. 2053. Lintott, C., Schawinski, K., Bamford, S., Slosar, A., Land, K., Thomas, D., Edmondson, E., Masters, K., Nichol, R.C. and Raddick, M.J. (2011), “Galaxy zoo 1: data release of morphological classiﬁcations for nearly 900 000 galaxies”, Monthly Notices of the Royal Astronomical Society, Vol. 410 No. 1, pp. 166-178. Little, G., Chilton, L.B., Goldman, M. and Miller, R.C. (2009), “Turkit: tools for iterative tasks on mechanical Turk”,in Proceedings of the ACM SIGKDD Workshop on Human Computation, ACM, pp. 29-30. Liu, X., Lu, M., Ooi, C., Shen, Y., Wu, S. and Zhang, M. (2012), “CDAS: a crowdsourcing data analytics system”, Proceedings of the Vldb Endowment, Vol. 5 No. 10, pp. 1040-1051. Loni,B., Hare,J., Georgescu, M.,Riegler,M., Zhu, X.,Morchid,M., Dufour,R. and Larson, M.(2014), “Getting by with a little help from the crowd: practical approaches to social image labeling”, Proceedings of the 2014 International ACM Workshop on Crowdsourcing for Multimedia, pp. 69-74. Malone, T.W., Laubacher, R. and Dellarocas, C. (2010), “The collective intelligence genome”, IEEE Engineering Management Review, Vol. 38 No. 3. Mao, A., Kamar, E., Chen, Y., Horvitz, E., Schwamb, M.E., Lintott, C.J. and Smith, A.M. (2013), “Volunteering versus work for pay: incentives and tradeoffs in crowdsourcing”, First AAAI Conference on Human Computation and Crowdsourcing, pp. 94-102. Otani, N., Baba, Y. and Kashima, H. (2016), “Quality control for crowdsourced hierarchical classiﬁcation”, Proceedings – IEEE International Conference on Data Mining, ICDM, Vol. 2016- Janua, pp. 937-942. Parameswaran, A., Sarma, A.D., Garcia-Molina, H., Polyzotis, N. and Widom, J. (2011), “Human- Assisted graph search: it’s okay to ask questions”, Proceedings of the VLDB Endowment, Vol. 4 No. 5, pp. 267-278. Paulheim, H. and Bizer, C. (2014), “Improving the quality of linked data using statistical distributions”, International Journal on Semantic Web and Information Systems (Systems), Vol. 10 No. 2, pp. 63-86. Pukelsheim, F. (1994), “The three sigma rule”, The American Statistician, Vol. 48 No. 2, pp. 88-91. Redi, J. and Povoa, I. (2014), “Crowdsourcing for rating image aesthetic appeal: better a paid of a volunteer crowd?”, Proceedings of the 2014 International ACM Workshop on Crowdsourcing for Multimedia – CrowdMM ’14, no. NOVEMBER 2014, pp. 25-30. Rosenthal, S.L. and Dey, A.K. (2010), “Towards maximizing the accuracy of human-labeled sensor data”, in Proceedings of the 15th International Conference on Intelligent User Interfaces – IUI ’10, ACM Press, New York, NY, p. 259. Shahaf, D. and Horvitz, E. (2010), “Generalized task markets for human and machine computation”,in AAAI. Sheshadri, A., Lease, M. (2013), “SQUARE: a benchmark for research on computing crowd consensus”, First AAAI Conference on Human Computation and ..., pp. 156-164. Simpson, E., Roberts, S., Psorakis, I. and Smith, A. (2013), “Dynamic Bayesian combination of multiple imperfect classiﬁers”, Studies in Computational Intelligence, Vol. 474, pp. 1-35. Simpson, E., Roberts, S.J., Smith, A. and Lintott, C. (2011), “Bayesian combination of multiple, imperfect classiﬁers”,in Proceedings of the 25th Conference on Neural Information Processing Systems, Granada. Snow, R., O’Connor, B., Jurafsky, D. and Ng, A.Y. (2008), “Cheap and fast – but is it good? IJCS Evaluating non-expert annotations for natural language tasks”, Proceedings of the 3,3 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 254-263. Su, W., Wang, J. and Lochovsky, F. (2006), Automatic Hierarchical Classiﬁcation of Structured Deep Web Databases BT - Web Information Systems – WISE, 2006, Springer, Berlin Heidelberg, pp. 210-221. Tran-Thanh, S.R.L., Huynh, T.D. and Rosenfeld, A. (2015), “Crowdsourcing complex workﬂows under budget constraints”, Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence (AAAI-15), pp. 1298-1304. Vickrey, D., Bronzan, A., Choi, W., Kumar, A., Turner-Maier, J., Wang, A. and Koller, D. (2008), “Online word games for semantic data collection”, Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 533-542. Whitehill, J., Ruvolo, P., Wu, T., Bergsma, J. and Movellan, J. (2009), “Whose vote should count more: optimal integration of labels from labelers of unknown expertise”, Advances in Neural Information Processing Systems, Vol. 22 No. 1, pp. 1-9. Wiggins, A., Newman, G., Stevenson, R.D. and Crowston, K. (2011), “Mechanisms for data quality and validation in citizen science”, e-Science Workshops (eScienceW), 20111 IEEE Seventh International Conference on, IEEE, pp. 14-19. Willett, K.W., Lintott, C.J., Bamford, S.P., Masters, K.L., Simmons, B.D., Casteels, K.R.V., Edmondson, E. M., Fortson, L.F., Kaviraj, S., Keel, W.C., Melvin, T., Nichol, R.C., Raddick, M.J., Schawinski, K., Simpson, R.J., Skibba, R.A., Smith, A.M. and Thomas, D. (2013), “Galaxy zoo 2: detailed morphological classiﬁcations for 304 122 galaxies from the sloan digital sky survey”, Monthly Notices of the Royal Astronomical Society, Vol. 435 No. 4, pp. 2835-2860. Wu, X., Fan, W. and Yu, Y. (2012), “Sembler: ensembling crowd sequential labeling for improved quality”, Proceedings of the National Conference on Artiﬁcial Intelligence, vol. 2, pp. 1713-1719. Yang, J. Redi, J. Demartini, G. and Bozzon, A. (2016), “Modeling task complexity in crowdsourcing”. Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S. and Hitzler, P. (2013), “Quality assessment methodologies for linked open data”, Semantic Web. Zhang, J., Sheng, V.S., Li, Q., Wu, J. and Wu, X. (2017a), “Consensus algorithms for biased labeling in crowdsourcing”, Information Sciences, Vol. 382-383, pp. 254-273. Zheng, Y., Li, G., Li, Y., Shan, C. and Cheng, R. (2017b), “Truth inference in crowdsourcing: is the problem solved?”, Proceedings of the VLDB Endowment, Vol. 10 No. 5. Further reading Wang, J. Ipeirotis, P.G. and Provost, F. (2015), “Cost-effective quality assurance in crowd labeling”. Yoram, B. Tom, M. and John, G. (2012), “How to grade a test without knowing the answers – a bayesian graphical model for adaptive crowdsourcing and aptitude testing”. Corresponding author Qiong Bu can be contacted at: qb1g13@soton.ac.uk For instructions on how to order reprints of this article, please visit our website: www.emeraldgrouppublishing.com/licensing/reprints.htm Or contact us for further details: permissions@emeraldinsight.com http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png International Journal of Crowd Science Emerald Publishing http://www.deepdyve.com/lp/emerald-publishing/quality-assessment-in-crowdsourced-classification-tasks-w5M3LflrLu

Loading next page...

References (74)

(2013)
Crowdsourcing linked data quality assessment”,The SemanticWeb
Long Tran-Thanh, T. Huynh, A. Rosenfeld, S. Ramchurn, N. Jennings (2015)
Crowdsourcing Complex Workflows under Budget Constraints
U. Gadiraju, Gianluca Demartini, Ricardo Kawase, S. Dietze (2015)
Human Beyond the Machine: Challenges and Opportunities of Microtask Crowdsourcing
IEEE Intelligent Systems, 30
J. Redi, I. Povoa (2014)
Crowdsourcing for Rating Image Aesthetic Appeal: Better a Paid or a Volunteer Crowd?
H. Gelas, S. Abate, L. Besacier, F. Pellegrino (2011)
Quality Assessment of Crowdsourcing Transcriptions for African Languages
Andrew Mao, Ece Kamar, Yiling Chen, E. Horvitz, M. Schwamb, C. Lintott, Arfon Smith (2013)
Volunteering Versus Work for Pay: Incentives and Tradeoffs in Crowdsourcing
A. Wiggins, G. Newman, R.D. Stevenson, K. Crowston (2011)
Mechanisms for data quality and validation in citizen science
e-Science Workshops (eScienceW), 20111 IEEE Seventh International Conference on
Jing Wang, Panagiotis Ipeirotis, F. Provost (2016)
Cost-Effective Quality Assurance in Crowd Labeling
IRPN: Innovation & Human Resource Management (Topic)
Nguyen Hung, Chi Duong, M. Weidlich, K. Aberer (2015)
Minimizing Efforts in Validating Crowd Answers
Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
Jonathon Hare, Maribel Acosta, Anna Weston, E. Simperl, Sina Samangooei, D. Dupplaw, P. Lewis (2013)
An Investigation of Techniques that Aim to Improve the Quality of Labels provided by the Crowd
Jie Yang, J. Redi, Gianluca Demartini, A. Bozzon (2016)
Modeling Task Complexity in Crowdsourcing
Maribel Acosta, A. Zaveri, E. Simperl, D. Kontokostas, S. Auer, Jens Lehmann (2013)
Crowdsourcing Linked Data Quality Assessment
(2010)
Generalized taskmarkets for human andmachine computation ”
T. Malone, Robert Laubacher, Chrysanthos Dellarocas (2010)
The collective intelligence genome
IEEE Engineering Management Review, 38
Edwin Simpson, S. Roberts, Ioannis Psorakis, Arfon Smith (2012)
Dynamic Bayesian Combination of Multiple Imperfect Classifiers
Dafna Shahaf, E. Horvitz (2010)
Generalized Task Markets for Human and Machine Computation
Proceedings of the AAAI Conference on Artificial Intelligence
(2002)
CarloMethodologies Batini, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino
Greg Little (2009)
TurKit: Tools for iterative tasks on mechanical turk
2009 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC)
(2015)
2015a), “The dynamics of micro-task crowdsourcing: the case of AmazonMTurk
K. Willett, C. Lintott, S. Bamford, K. Masters, B. Simmons, K. Casteels, E. Edmondson, L. Fortson, S. Kaviraj, W. Keel, T. Melvin, R. Nichol, M. Raddick, K. Schawinski, R. Simpson, R. Skibba, Arfon Smith, Daniel Minnesota, U. Oxford, Adler Planetarium, U. Nottingham, U. Portsmouth, SepNet, U. Barcelona, U. Hertfordshire, U. Alabama, J. University, E. Zurich, University Diego (2013)
Galaxy Zoo 2: detailed morphological classifications for 304,122 galaxies from the Sloan Digital Sky Survey
Monthly Notices of the Royal Astronomical Society, 435
R. Snow, Brendan O'Connor, Dan Jurafsky, A. Ng (2008)
Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks
Michael Bernstein, Greg Little, Rob Miller, Bjoern Hartmann, M. Ackerman, David Karger, David Crowell, Katrina Panovich (2010)
Soylent: a word processor with a crowd inside
Proceedings of the 23nd annual ACM symposium on User interface software and technology
(1979)
Maximum likelihood estimation of observer error-rates using the em algorithm”,Applied
Jonathan Bragg, Mausam, Daniel Weld (2013)
Crowdsourcing Multi-Label Classification for Taxonomy Creation
Proceedings of the AAAI Conference on Human Computation and Crowdsourcing
(2008)
International vocabulary of metrology ??? Basic and general concepts and associated terms ( VIM ) Vocabulaire international de métrologie ?
Nguyen Hung, T. Nguyen, N. Lam, K. Aberer (2013)
An Evaluation of Aggregation Techniques in Crowdsourcing
Ece Kamar, Severin Hacker, E. Horvitz (2012)
Combining human and machine intelligence in large-scale crowdsourcing
Naoki Otani, Yukino Baba, H. Kashima (2015)
Quality Control for Crowdsourced Hierarchical Classification
2015 IEEE International Conference on Data Mining
S. Dumais, Hao Chen (2000)
Hierarchical classification of Web content
A. Kittur, J. Nickerson, Michael Bernstein, E. Gerber, Aaron Shaw, J. Zimmerman, Matthew Lease, J. Horton (2013)
The future of crowd work
Proceedings of the 2013 conference on Computer supported cooperative work
T. Huynh, Mark Ebden, M. Venanzi, S. Ramchurn, S. Roberts, L. Moreau (2013)
Interpretation of Crowdsourced Activities Using Provenance Network Analysis
Proceedings of the AAAI Conference on Human Computation and Crowdsourcing
B. Loni, Jonathon Hare, Mihai Georgescu, M. Riegler, Xiaofei Zhu, Mohamed Morchid, Richard Dufour, M. Larson (2014)
Getting by with a Little Help from the Crowd: Practical Approaches to Social Image Labeling
Bin Bi, Chen Liu, Yuchen Liu (2012)
Iterative Learning for Reliable Crowdsourcing Systems
Ece Kamar, E. Horvitz (2015)
Planning for Crowdsourcing Hierarchical Tasks
Xuan Liu, Meiyu Lu, B. Ooi, Yanyan Shen, Sai Wu, Meihui Zhang (2012)
CDAS: A Crowdsourcing Data Analytics System
ArXiv, abs/1207.0143
(2008)
International vocabulary of metrology? Basic and general concepts and associated terms (VIM) vocabulaire international de métrologie? Concepts fondamentaux et généraux et termes associés (VIM)
D. Difallah, Michele Catasta, Gianluca Demartini, Panagiotis Ipeirotis, P. Cudré-Mauroux (2015)
The Dynamics of Micro-Task Crowdsourcing: The Case of Amazon MTurk
Proceedings of the 24th International Conference on World Wide Web
(2011)
Turkomatic,” Proceedings of the 2011 annual conference extended abstracts on Human factors in computing systems
Faiza Khattak (2011)
Quality Control of Crowd Labeling through Expert Evaluation
Gianluca Demartini, D. Difallah, P. Cudré-Mauroux (2012)
ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking
Proceedings of the 21st international conference on World Wide Web
A. Kittur, Boris Smus, Susheel Khamkar, R. Kraut (2011)
CrowdForge: crowdsourcing complex work
Proceedings of the 24th annual ACM symposium on User interface software and technology
Stephanie Rosenthal, A. Dey (2010)
Towards maximizing the accuracy of human-labeled sensor data
A. Dawid, A. Skene (1979)
Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm
Journal of The Royal Statistical Society Series C-applied Statistics, 28
(2006)
Hierarchical Classification of Structured Deep Web Databases BT - Web Information Systems WISE 2006,
Heiko Paulheim, Christian Bizer (2014)
Improving the Quality of Linked Data Using Statistical Distributions
Int. J. Semantic Web Inf. Syst., 10
U. Gadiraju, Ricardo Kawase, S. Dietze (2014)
A taxonomy of microtasks on the web
Proceedings of the 25th ACM conference on Hypertext and social media
Panagiotis Ipeirotis, F. Provost, V. Sheng, Jing Wang (2012)
Repeated labeling using multiple noisy labelers
Data Mining and Knowledge Discovery, 28
Carsten Eickhoff, A. DeVries (2011)
How Crowdsourcable is Your Task
A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, Jens Lehmann, S. Auer (2012)
Quality Assessment Methodologies for Linked Open Data
(2015)
Ipeirotis, and Philippe Cudré-Mauroux
(2010)
Generalized taskmarkets for human andmachine computation”, inAAAI
C. Lintott, K. Schawinski, S. Bamford, A. Slosar, K. Land, D. Thomas, E. Edmondson, K. Masters, R. Nichol, J. Raddick, A. Szalay, D. Andreescu, P. Murray, Jan vandenBerg (2010)
Galaxy Zoo 1: data release of morphological classifications for nearly 900 000 galaxies
Monthly Notices of the Royal Astronomical Society, 410
Edwin Simpson, S. Roberts, C. Lintott (2011)
Bayesian Combination of Multiple , Imperfect Classifiers
(2011)
How crowdsourcable is your task”, in Proceedings of theWorkshop on Crowdsourcing for Search and Data Mining (CSDM) at the Fourth ACM International Conference onWeb Search and DataMining (WSDM)
Jens Bleiholder, Felix Naumann (2009)
Data fusion
ACM Comput. Surv., 41
Jae-Ho Kim, Inho Kang, Key-Sun Choi (2002)
Unsupervised Named Entity Classification Models and their Ensembles
Weifeng Su, Jiying Wang, F. Lochovsky (2006)
Automatic Hierarchical Classification of Structured Deep Web Databases
(2013)
Crowdsourcing multi - label classi fi cation for taxonomy creation ” , in HCOMP 2013 , First AAAI Conference on Human Computation and Crowdsourcing
F. Pukelsheim (1994)
The Three Sigma Rule
The American Statistician, 48
A. Kittur, Ed Chi, B. Suh (2008)
Crowdsourcing user studies with Mechanical Turk
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Y. Bachrach, T. Graepel, T. Minka, J. Guiver (2012)
How To Grade a Test Without Knowing the Answers - A Bayesian Graphical Model for Adaptive Crowdsourcing and Aptitude Testing
Aditya Parameswaran, A. Sarma, H. Garcia-Molina, N. Polyzotis, J. Widom (2011)
Human-assisted graph search: it's okay to ask questions
Proc. VLDB Endow., 4
A. Wiggins, Greg Newman, R. Stevenson, Kevin Crowston (2011)
Mechanisms for Data Quality and Validation in Citizen Science
2011 IEEE Seventh International Conference on e-Science Workshops
Francisco Reis, Francisco Reis, S. Lynn, H. Ali, D. Eccles, A. Hanby, E. Provenzano, C. Caldas, W. Howat, Leigh‐Anne McDuffus, B. Liu, F. Daley, P. Coulson, R. Vyas, Leslie Harris, J. Owens, Amy Carton, Janette McQuillan, Andy Paterson, Zohra Hirji, Sarah Christie, Amber Holmes, M. Schmidt, M. García-Closas, D. Easton, M. Bolla, Qin Wang, J. Benítez, R. Milne, R. Milne, A. Mannermaa, F. Couch, P. Devilee, R. Tollenaar, C. Seynaeve, A. Cox, S. Cross, F. Blows, J. Sanders, R. Groot, Jonine Figueroa, M. Sherman, M. Hooning, H. Brenner, B. Holleczek, C. Stegmaier, C. Lintott, P. Pharoah (2015)
Crowdsourcing the General Public for Large Scale Molecular Pathology Studies in Cancer
EBioMedicine, 2
Panagiotis Ipeirotis, F. Provost, Jing Wang (2010)
Quality management on Amazon Mechanical Turk
(2011)
Turkomatic
Aashish Sheshadri, Matthew Lease (2013)
SQUARE: A Benchmark for Research on Computing Crowd Consensus
Proceedings of the AAAI Conference on Human Computation and Crowdsourcing
X. Wu, W. Fan, Yong Yu (2012)
Sembler: Ensembling Crowd Sequential Labeling for Improved Quality
Proceedings of the AAAI Conference on Artificial Intelligence
Beverly Kahn, D. Strong, Richard Wang (2002)
Information quality benchmarks: product and service performance
Commun. ACM, 45
Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, Reynold Cheng (2017)
Truth Inference in Crowdsourcing: Is the Problem Solved?
Proc. VLDB Endow., 10
David Vickrey, Aaron Bronzan, William Choi, Aman Kumar, Jason Turner-Maier, Arthur Wang, D. Koller (2008)
Online Word Games for Semantic Data Collection
Jing Zhang, V. Sheng, Qianmu Li, Jian Wu, Xindong Wu (2017)
Consensus algorithms for biased labeling in crowdsourcing
Inf. Sci., 382-383
C. Batini, C. Cappiello, C. Francalanci, A. Maurino (2009)
Methodologies for data quality assessment and improvement
ACM Comput. Surv., 41
J. Whitehill, P. Ruvolo, Tingfan Wu, J. Bergsma, J. Movellan (2009)
Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise

Publisher: Emerald Publishing
Copyright: © Qiong Bu, Elena Simperl, Adriane Chapman and Eddy Maddalena.
ISSN: 2398-7294
DOI: 10.1108/ijcs-06-2019-0017
Publisher site: See Article on Publisher Site

Abstract

Purpose – Ensuring quality is one of the most signiﬁcant challenges in microtask crowdsourcing tasks. Aggregation of the collected data from the crowd is one of the important steps to infer the correct answer, but the existing study seems to be limited to the single-step task. This study aims to look at multiple-step classiﬁcation tasks and understand aggregation in such cases; hence, it is useful for assessing the classiﬁcation quality. Design/methodology/approach – The authors present a model to capture the information of the workﬂow, questions and answers for both single- and multiple-question classiﬁcation tasks. They propose an adapted approach on top of the classic approach so that the model can handle tasks with several multiple- choice questions in general instead of a speciﬁc domain or any speciﬁc hierarchical classiﬁcations. They evaluate their approach with three representative tasks from existing citizen science projects in which they have the gold standard created by experts. Findings – The results show that the approach can provide signiﬁcant improvements to the overall classiﬁcation accuracy. The authors’ analysis also demonstrates that all algorithms can achieve higher accuracy for the volunteer- versus paid-generated data sets for the same task. Furthermore, the authors observed interesting patterns in the relationship between the performance of different algorithms and workﬂow-speciﬁc factors including the number of steps and the number of available options in each step. Originality/value – Due to the nature of crowdsourcing, aggregating the collected data is an important process to understand the quality of crowdsourcing results. Different inference algorithms have been studied for simple microtasks consisting of single questions with two or more answers. However, as classiﬁcation tasks typically contain many questions, the proposed method can be applied to a wide range of tasks including both single- and multiple-question classiﬁcation tasks. Keywords Aggregation, Classiﬁcation, Task-oriented crowdsourcing, Quality assessment, Human computation Paper type Research paper 1. Introduction Microtask crowdsourcing has attracted interest from researchers, businesses and government as a means to leverage human computation into their activities in a fast, accurate and affordable way. In the last ten years, we have seen it applied to anything from spotting sarcasm on social media to discovering new galaxies and helping digitise large cultural heritage collections. The underlying model is relatively straightforward: a problem is decomposed into smaller chunks that can be tackled independently by several people. © Qiong Bu, Elena Simperl, Adriane Chapman and Eddy Maddalena. Published in International Journal of Crowd Science. Published by Emerald Publishing Limited. This article is published under International Journal of Crowd Science the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), pp. 222-248 Emerald Publishing Limited subject to full attribution to the original publication and authors. The full terms of this licence may be 2398-7294 DOI 10.1108/IJCS-06-2019-0017 seen at http://creativecommons.org/licences/by/4.0/legalcode Their individual outputs are then compared and consolidated into a ﬁnal solution (Shahaf Crowdsourced and Horvitz, 2010). However, none of these steps is actually easy: some problems are less classiﬁcation amenable to microtasking and need to be turned into bespoke microtask workﬂows tasks (Bernstein et al.,2010; Kulkarni et al.,2011; Kittur et al., 2011); the performance of the crowd varies across tasks (Mao et al.,2013; Redi and Povoa, 2014); and determining which answers are the most useful ones can be both complex and computationally expensive (Kittur et al., 2008; Snow et al.,2008; Vickrey et al., 2008; Demartini et al., 2012; Wiggins et al., 2011).Itis on this last aspect, determining the correct answers, that we focus on in this paper. The aggregation method proposed in this paper is able to infer the correct answer for a range of tasks involving either single-step or multiple-step classiﬁcations when gold answers are not available. It also serves as a proxy to help task requesters to assess the quality of the crowdsourced results when they already have some gold answers, such as piloting speciﬁc multiple-step task design before putting it online for a larger scale. Quality assessment in microtask crowdsourcing refers to the evaluation of quality of the workers’ work. First, quality can be assessed based on different criteria, as it has many dimensions (Kahn et al.,2002; Batini et al., 2009). Under the crowdsourcing context, it depends on the type of the data, which is decided by the task type (Malone et al.,2010; Gadiraju et al.,2014, 2015). The most common quality metric we have seen is to calculate the accuracy (Bernstein et al., 2010; Gelas et al., 2011; Hung et al.,2013; Zhang et al., 2017a, 2017b) with available gold standards. However, in lots of the cases the gold standard is not available. This is where different inference algorithms come into picture, which helps to infer or predict the correct (gold) answer. Second, quality assessment can be done either on the ﬂy(Ipeirotis et al., 2014) during the task running that can be used to optimise task assignment hence reduce cost, or in the post aggregation (Whitehill et al.,2009; Ipeirotis et al., 2010; Bachrach et al.,2012; Difallah et al., 2015a) to assess the overall quality of the classiﬁcation. This work focus on aggregating the result after the crowdsourcing task has been completed, so that accuracy can be calculated based on the gold standards we have. There are many different types of tasks where microtask crowdsourcing are applied (Eickhoff and de Vries, 2011; Difallah et al., 2015b; Yang et al.,2016; Zheng et al., 2017a).We focus on inferring the correct answer for a classiﬁcation task which is one of the most popular type of crowdsourcing tasks. We are by no means the ﬁrst to do so; previous research has proposed a range of methods to infer and predict the quality of crowd answers (Bachrach et al.,2012; Dawid and Skene, 1979; Difallah et al., 2015a; Hare et al.,2013; Ipeirotis et al.,2010; Karger et al.,2011; Loni et al., 2014; Paulheim and Bizer, 2014; Hung et al., 2013; Rosenthal and Dey, 2010; Simpson et al.,2013; Whitehill et al., 2009). Whilst all methods have their beneﬁts, they work on relatively simple task models that consist of single questions with one or more answers (Sheshadri and Lease, 2013; Hung et al.,2013; Zhang et al.,2017a; Zheng et al., 2017b). The scenario we are targeting is different. We take a close look at existing classiﬁcation tasks from Zooniverse, and notice a large percentage of these tasks are multiple-step tasks, as shown in Figure 1. In fact, in a random sampling of 20 tasks, only 20 per cent has a single question. Consider the example in Figure 2, which is taken from a labelled citizen science project in which pictures taken in the Serengeti national park in Tanzania are analysed online by thousands of volunteers[1]. The crowd is asked to answer a series of related, independent questions about what they see in the image, including the types and number of animals. Our work is motivated by a range of online crowd science classiﬁcation projects. Each of them uses a slightly different type of task to classify an object, for example, an image, according to a number of criteria. For a relatively complex task, it is split into several steps, typically in the form of multiple-choice answers. Sometimes there are dependencies between IJCS 3,3 Figure 1. Classiﬁcation tasks from Zooniverse Figure 2. Example classiﬁcation paths collected from 20 workers for a given photo steps as the answer chosen for one questions prompts other questions to be displayed. For instance, in the Cities at Night project, which uses microtask crowdsourcing to analyse night-time photographs taken by astronauts onboard the ISS[2], seven different Options are provided for the ﬁrst question to identify what the given image contains, a city, stars, aurora, astronaut, black image, no photo or none of these, and only when “city” is identiﬁed, two more independent questions will be asked to classify cloudiness (three Options: cloudy, someclouds, clear) and sharpness (two Options: sharp, blurry). In the GalaxyZoo[3] project, several different questions were asked in sequence depending on the answers to previous questions, and questions and answers are arranged in a decision tree. It has a more complex workﬂow in which more questions are involved, and questions vary based on what has been chosen in previous classiﬁcation step. For instance, the ﬁrst question is “Is the galaxy Crowdsourced simply smooth and rounded, with no sign of a disk?” and three options are provided: classiﬁcation “Smooth”, “Features or disk”, and “Star or artifact”. When choosing “Smooth”, a new tasks question will be asked “How rounded is it?” and available options are “Completely round”, “In between” and “Cigar shaped”.If “Features or disk” is chosen as the answer to the ﬁrst question, a different set of subsequent questions will be asked. Other times, workﬂows are rather sequences of independent, though related questions, such as what we see in Snapshot[1](Figure 2). Determining the correct answer for such complex classiﬁcation task can be tricky and has not been fully studied yet. Existing research also does not investigate how inference methods could affect the classiﬁcation accuracy when using different crowd types for complex classiﬁcation tasks. As a result, there is the need to understand whether different algorithms and aggregation strategies are required for different crowd contexts. To tackle the issue of determining the correct answer from crowd produced annotations for the classiﬁcation task with multiple questions, we model the problem of complex classiﬁcation tasks that span over multiple, related questions as a graph. To the best of our knowledge, we are the ﬁrst to propose using the structure of a microtask crowdsourcing workﬂow as an additional feature to support inference algorithms in making decisions about correct labels, using output data produced by the crowd. We look at three inference algorithms (majority voting [MV] [Paulheim and Bizer, 2014; Hung et al.,2013], message passing [MP] [Karger et al., 2011] and expectation maximisation [EM] [Dawid and Skene, 1979; Whitehill et al., 2009]), which have been commonly used in answer inference in microtask crowdsourcing previously. We adapt these algorithms to work on the graph modelled from crowdsourcing tasks with multiple steps. We perform a large-scale evaluation of the performance of these algorithms on six data sets across two crowd contexts from three image classiﬁcation tasks: Darkskies[2], GalaxyZoo[3] and Snapshot Serrengeti[1]. The rationale behind choosing data sets from both volunteer and paid crowd context is that algorithms may perform differently in these contexts. The experiments show that our aggregation strategy achieves signiﬁcantly better performance than the current approach of naively applying individual algorithms on each node level. The result also indicates that MV, despite its simplicity, compares well with more sophisticated approaches that consider additional factors such as user performance and hence need more computation time. Sophisticated algorithms such as expectation maximisation, however, can complement MV for relatively complex tasks. We also prove that each algorithm obtains better inference accuracy in the volunteer context compared to paid crowdsourcing context. This rest of this paper is structured as follows: Section 2 provides the foundations of existing algorithms which we have adapted to handle answer inference in classiﬁcation tasks with multiple questions, and illustrate how this aggregation ﬁts in the quality assessment process. In Section 3, we explain our graph model and notations used in the graph, formalise the classiﬁcation problem, and elaborate our aggregation approach. In Section 4, we perform large-scale evaluation and demonstrate the performance of different algorithms. Section 5 discusses our ﬁndings. Section 6 reviews existing work which has inspired our research, and Section 7 summarises our result and future work. 2. Foundations A classiﬁcation task generally has one single question and a few options to choose from, such as the one shown in Figure 3. It looks like a simple tree structure where the classiﬁcation starts with a root node which refers to the object to be classiﬁed and has a few branches which represent the available options. In this section, we present three existing algorithms, MV, MP and EM,that have been used in inferring the true label for a single-step multiple-choice classiﬁcation task. These are the foundations to understand our proposed adapted approach. Notations used in IJCS elaborating these algorithms are deﬁned in Table I. For the sake of explaining the individual 3,3 algorithms and our method, we use following notations throughout this paper. 2.1 Majority voting Due to its simplicity, MV has been used in many microtask projects (Hung et al.,2015; Liu et al., 2012) and is the standard aggregation method in some existing crowdsourcing platforms[4]. Given the list of options for a labelling task and an object, the MV algorithm chooses those options with the highest number of votes from the crowd. Formally, it takes as input an object o and the crowd labels L and outputs the resulting candidate label l that received the most votes from the users. Algorithm 1 MV 1: procedure FINDUNIQUELABEL(L ) u u 2: L fl g, where L A and l 2 A and u# U ; unique unique o o o 3: l =““; 4: num ¼ 0; max 5: for i2jL j do unique 6: if count l num then ðÞ max unique i 7: num count l ; max ðÞ unique i 8: l l ; o ðÞ unique i 9: return l ; Figure 3. Representation of a task with a single question Notation Definition o The current object being classiﬁed O The set of all objects in a data set A All available options u User u U The set of all users who contributed to the current data set U All users who have classiﬁed object o L All labels received from the crowd, and L A L The set of all labels from the crowd for object o L The set of all labels from user u Table I. l The label for object o from user u Notations l The inferred label for object o o 2.2 Expectation maximisation Crowdsourced EM is another algorithm that has been widely used and involves two steps to infer the true classiﬁcation label for a given object. In the ﬁrst step, the true label for the current object is estimated tasks using simple MV, where the input of all users is considered equally. Then, in the next step, the error rate of each user is estimated based on this result and used in turn to calculate the new estimation for the ﬁrst step. The steps are alternating iteratively until the algorithm converges and a maximum is found. It takes as input an object o and all labels L. It starts by estimating the true label for each object and each user’s error rate by comparing their answers (using an indicator function I() to check whether the user classiﬁes object to a certain category/class) for all objects they have looked at. The error rate is used subsequently to update the confusion matrix for each user. The output is candidate labels for o with the probability (indicated by p) of the corresponding candidate label to be correct. Algorithm 2 EM 1: procedure INITIALISE(p ) ðÞ 2: p count l jL j ⊳ probability of l being the true label for l o object o (l2 A); 3: while not converged do 4: Estimate error rate for user u u u 5: u l þ p Il ¼ l ll ll o o2L 6: Estimate confusion matrix: u u 7: e u u ⊳ q is the accuracy of user u ll ll lq 8: Estimate class priors: 9: pr p jOj 10: Calculate class probability for object o: YY X Y j u u u ðÞ ðÞ 11: p pr e Il ¼ m pr e Il ¼ m l l q am qm u2U m q m 12: l =““; 13: p ¼ 0; max 14: for l 2 A do 15: if p p then l max 16: p p ; max l 17: l l; 18: return l ; 2.3 Message passing MP is an algorithm that takes into account both the labels and the performance of the users. MP constructs object and user-speciﬁc messages to represent the reliability of the particular user, and iteratively updates the object and the user messages. More speciﬁcally, at each object update, it adds up more weight to labels that come from more trustworthy parts of the crowd, and at each user update, it adds more trust (a conﬁdence value) to the user if the labels they give for other objects are in line with the current estimates of object labels. The iterative updates continue until the algorithm converges or a speciﬁed threshold is hit. The threshold for the stopping condition is a parameter that has to be empirically determined. It takes as input an object o,alabel a2 A,all IJCS labels received from the crowd L and a threshold k . MP computes the object message by max 3,3 ﬁrstly iterating all previous labels from the users who have been assigned the object o and then looking at whether each label is the same as the given one. In a next step, it uses the object message x (2 L) to update the user message y (2 L), whichiscomputed byiterating o>u u>o over the labels they have submitted. Until convergence, the object message for object o is aggregated by weighing the user messages (conﬁdence) for that object and the computed sign is stored in E . MP outputs the candidate label l for o and the sign of whether the label applies or ou not. A detailed description of the algorithm can be found in Karger et al.’s (2011) study. Whilst providing accurate estimations, MP is also known for its high computational costs as the number of labels and users increase. Algorithm 3 MP 1: procedure INITIALISATION(y ) u>u 2: for (o, u)2 L do 3: Initialise y (NðÞ 1; 1 ); u>o 4: procedure ITERATION(k ) max 5: for k2f1; .. . ; k g do max 6: for (o, u)2 L do k k1 7: x E y (u 6¼u) ou o>u u>o u 2U 8: for (o, u)2 L do k k 9: y / E x (o 6¼o) o u u>o o>u o 2O k 1 max 10: x / E y o u2U ou u>o 11: if sign(x )==1 then 12: l ¼ x o o 13: return l 2.4 Quality assessment In the microtask crowdsourcing context, achieving a good quality result is one of the major goals, and when we talk about quality, it generally means the quality of the data collected from the crowd. For the classiﬁcation microtasks, existing work in quality assessment mostly use the accuracy metric (Khattak and Salleb-Aouissi, 2011; Hung et al., 2013; Zhang et al., 2017a). Some research also uses precision/recall (Hung et al.,2015; Zhang et al., 2017) or F1 score (Zheng et al., 2017a), while other work use ROC (Zheng et al.,2017b) or RMSE (Bachrach et al., 2012). For classiﬁcation, the quality of the result refers to how good the overall collected classiﬁcations are, which is a data-value centric dimension to reﬂect how accurate the classiﬁcations are. In this work, if not specially speciﬁed, when referring to quality of the input/answer/data/result, it means Accuracy –“The degree to which data values correctly represent the real-world facts” (Zaveri et al., 2013);deﬁnition in science (JCGM, 2008) as “closeness of agreement between a measured quantity value and a true quantity value of a measurand”. We can look at individual crowd worker’s work to evaluate whether its work is of good quality, or we can look at the overall result from all the workers to see how accurate they classify the given objects. The later one which involves aggregating the input from different crowd workers in a multiple-step classiﬁcation task is the focus of this paper. In the crowdsourcing context, the ground truth is not usually available. To assess the Crowdsourced quality of the result, we need to understand what algorithms or mechanisms can be used to classiﬁcation infer or predict the correct answer based on all the input from the crowd workers. tasks Correspondingly, each existing different algorithm has been studied by researchers and evaluated its performance in various contexts (Section 6.2). This work mainly takes a look at three popular existing algorithms elaborated above and investigates how the adaptation of these algorithms can be used for aggregating the crowdsourced data and help to assess the quality of the classiﬁcation result. The whole process, in a nutshell, includes three major phases, data collection (microtask design and task execution) from the crowd which is available to this study, aggregation to infer the correct answer/label, and evaluation of the quality (in this work is the Accuracy metric) by comparing the inferred result to the gold standards we have. This research focuses on the aggregation and evaluates the accuracy accordingly. 3. Our approach In this section, we ﬁrst illustrate the range of classiﬁcation tasks we address via a set of examples: classiﬁcation tasks with a single question and multiple-questions. We then introduce a set of notations and formalise the classiﬁcation problem as a path searching problem in a graph. Following that, we present our aggregation method by illustrating how existing established algorithms can be adapted to handle more complex cases. 3.1 Multi-level workﬂow model and problem formalisation A classiﬁcation task, as shown in Figure 3, is generally considered as a simple task as it contains only one question. A relatively complex task normally involves more than one question and hence more options. It will be more like a tree with branches which has further branches and leaves. If we draw such a ‘tree’ for the three tasks we are exploring in this paper, we can see each of them uses a different type of workﬂow consisting of several independent/interdependent steps. Each step in the workﬂow is associated with a Question to classify an object according to a criterion. To answer the question the crowd needs to choose among a set of Options. Figure 4 involves minimum one step and maximum three steps for the classiﬁcation task. Figure 5 has Figure 4. Representation of dark skies workﬂow from cities at night IJCS 3,3 Figure 5. Representation of snapshot Sergenti workﬂow from Zooniverse Figure 6. Representation of GalaxyZoo workﬂow from Zooniverse a ﬁxed two steps to complete a classiﬁcation task and each step has more than ten Crowdsourced options. For the GalaxyZoo[5] task, it can involve minimum one step and a maximum classiﬁcation of nine steps to complete a classiﬁcation, as shown in Figure 6. It is notable that these tasks different tasks do present a tree-like structure each of which has a number of questions,with various number of available options, however, there are indeed cases where some nodes have more than one parent node which means it can not be considered as a tree. As a result, the workﬂow can be modelled as a directed acyclic graph (DAG), where the root node is the object under consideration and all other nodes are classiﬁcation options. Each node can be reached via multiple paths from the root, which prompts the ﬁrst question of the workﬂow[6]. For a given object o, the crowd is asked to carry out a labelling task, which implies answering a series of (independent or dependent) classiﬁcation questions with a set of labels which identify the outstanding features of the object being classiﬁed. We deﬁne this task as a path search problem in a workﬂow W modelled as directed acyclic graph (DAG) with a root entry point and levels (similar to tree levels, representing the number of questions in the task), each corresponding to a set of options as depicted in Figure 7.Each node in such a graph represents a particular labelling option. The labelling ﬁnishes when a leaf in the graph is reached, that is a label that does not lead to any further questions. In our deﬁnition, the level corresponds to classiﬁcation question(s) and the level of a node is serialised and counted at the lowest level. We use level exchangeably with depth of a node which is indicated by the number of edges from the node to the root node. A directed edge represents a label chosen for the corresponding question related with that node level. Table II has a summary of the deﬁnitionsweuse. On top of the notations we deﬁned in Section 2, we also deﬁne the notations which are speciﬁc to our workﬂow graph model in Table III. The problem we are solving in the paper can be deﬁned as follows: Figure 7. Graph representation of an example classiﬁcation workﬂow W vs the corresponding classic way of looking at the classiﬁcation with multiple questions IJCS Term Definition 3,3 Task A general term referring to an action or a series of action need to be executed Classiﬁcation Task classifying objects into given categories, it could be a simple task (one question) or a task relatively complex task (more than one question) Microtask A task is decomposed into smaller unit making it easier for the crowd. One microtask is equivalent to one question in classiﬁcation task Workﬂow Microtasks are arranged/chained in a way to automatically complete the task Question Classiﬁcation task asked of the user to elicit/assign a label to an attribute of the object to be classiﬁed Option The set of possible labels Chosen option An option user chooses per question Correct label The correct label for a question Chosen path A user chooses a set of labels for entire workﬂow Correct path The correct set of labels for entire workﬂow Workﬂow The workﬂow can be modelled as a directed acyclic graph (DAG), in which the root node graph represents the object under consideration and all other nodes are classiﬁcation options Table II. Node A representation of an option in our model Deﬁnitions Node level The sequence that the question is presented to the user within a workﬂow Notation Definition W Represents the graph based on the workﬂow of classifying object o, it has node levels to indicate the questions to classify the corresponding attributes of the given object, and nodes to represent the options available for each attribute A Represents the available options at node level n (n) a Represents the individual option at node level n, where j2f1; .. . ;jA jg njðÞ ðÞ n l Represents the label chosen by user u at node level n for object o. Thus, the labelling result ðÞ n 1 1 1 (l ; l ,.. ., l ) will represent the ordered list of nodes (the traversal path) visited by user 1 oðÞ oðÞ o 1 2 ðÞ n when classifying o, which is called as a label path L The label path chosen by user u for object o L All labels for object o at node level n oðÞ n L Unique labels for object o at node level n, L A o ðÞ o ðÞ ðÞ n ðÞ n unique ðÞ n unique Table III. ~ L Represents the inferred label path for object o. It is a set of inferred labels for each node level ~ ~ Notations speciﬁcto described as (l ; ... ; l ) o o ðÞ ðÞ n our model L True label path for object o gold Deﬁnition 3.1 The Correct Labelling Problem: Given a particular object o, a workﬂow-based graph W , a set of labels L for object o, and f o (optionally) a set of previous labels from all users on all objects L, our aim is to infer the correct label path L in W for object o. o f 3.2 Adapted aggregation In the classic approaches, it does not look at the dependency between node levels hence naively putting inferred result from each node level together does not guarantee a valid result. It is obvious that producing a valid path with possible choices should improve the accuracy of the users. As such, a basic adaptation of the classic algorithms should show some improvement over multiple level workﬂows. We show such a basic adaptation in Algorithm 4. Algorithm 4 Our Adapted Approach Crowdsourced 1: procedure PREDICT_BY_NODELEVEL(L ) classiﬁcation 2: num_levels ¼ n; tasks 3: for level 2 range(n) do 4: if method ¼¼ mv then 5: procedure FINDUNIQUELABEL(L ) u u 6: L fl g, where L A and l 2 A and u# U ; unique unique o o 7: for l 2 L do 233 unique ðÞ 8: p count l jLj ⊳ percentage of l being voted as the l i label for object o; 9: return LC fðÞ l; p g ⊳ list of candidate labels and their n l percentage for o; 10: if method ¼¼ em then 11: procedure INITIALISE(p ) ðÞ 12: p count l jL j ⊳ percentage of l being the true label l o for object o (l2 A); 13: while not converged do 14: Estimate error rate for user u: u u u 15: u l þ p Il ¼ l ll ll o o2L 16: Estimate confusion matrix: u u u 17: e u u ⊳ q is the accuracy of user u ll ll lq 18: Estimate class priors: 19: pr p jOj 20: Calculate class probability for object o: YY X Y j u u u ðÞ ðÞ 21: p pr e Il ¼ m pr e Il ¼ m l l q am qm m q m u2U ðÞ 22: return LC f l; p g ⊳ list of label candidates and n l corresponding probability for o; 23: if method ¼¼ mp then 24: procedure INITIALISATION(y ) u>u 25: for (o, u)2 L do 26: Initialise y (NðÞ 1; 1 ); u>o 27: procedure ITERATION(k ) max 28: for k2f1; .. . ; k g do max 29: for (o, u)2 L do k k1 30: x / E y ðu 6¼u); ou o>u u>o u 2U 31: for (o, u)2 L do k k 32: y / E x ðo 6¼o); o u u>o o>u o 2O k 1 max 33: x / E y o ou u>o u2U 34: if sign(x )==1 then 35: LC :appendðÞ ðÞ x ; 1:0 n o 36: procedure ASSEMBLE_MOSTPOSSIBLEPATH(L ) IJCS o 37: num_levels ¼ n; 3,3 38: LC¼fg; 39: for z 2 LC do 1 1 40: for z 2 LC do 2 2 41: .. . 42: for z 2 LC do n n 43: LC:append z ; z ; .. . ; z ; p p .. . p ; ðÞðÞ 1 2 n z z z 1 2 n 44: L =; 45: p =0; max 46: for Z 2 LC do 47: if p p then Z max 48: p p ; max Z 49: L Z; 50: return L ; Our adapted approach assumes that labels at different levels in the workﬂow are independent, then assemble the label path from each node level based on the workﬂow graph. In the adapted approach, not only we reward partially correct answers from the crowd by applying each of the algorithms at each node level in the graph and compute scores for each individual labels, but also we consider the valid path when inferring the correct path. We also specially choose two algorithms that take into account the performance of the crowd in their computations, EM and MP.The EM algorithm sums up all node probabilities along each path to determine the ranking score. The MP algorithm returns true if that particular label at the node level is relevant or false otherwise. This means that we assign the score for the candidate paths correspondingly either as 1.0 or 0.0. By studying it, we want to allow MP and EM to be able to better identify those users who, while not doing so well overall, are very skilled at a particular sub-task (question) in the workﬂow. 4. Evaluation To evaluate the three algorithms and our adapted approach, we compare the classic approach where algorithms are applied on each node level and simply put together (we call it “naive- approach” here) with our “adapted-approach” which uses classic approach while strives to infer a valid correct path by considering the workﬂow graph. Thus, we have six different approaches: mv_adapted, mv_naive, mp_adapted, mp_naive, em_adapted, em_naive. Each inference algorithm was applied to six data sets with different microtask crowdsourcing workﬂows. We start with the evaluation setup of the data in Section 4.1 and the evaluation metrics in Section 4.2. Then we present the evaluation of inferred result in Section 4.3. 4.1 Data First, we used three existing data sets. The ﬁrst one is from the Snapshot Serengeti[1] project and consists of all crowd classiﬁcations within the time span from 10 December 2012 until 17 July 2013. It contains 7,800,896 labels from 890,280 volunteers for a total of 66,892 objects. For our evaluation, we used a gold standard with curated labels for 4,149 objects, which was created by professional scientists working on the Snapshot Serengeti project. To evaluate our approach we took all labels received from the crowd for the 4,149 objects which contains 112,027 labels submitted by 8, 304 volunteers. The second data set is from the Dark Skies app within the Cities at Night[2] project. It consists of 1,275,354 classiﬁcations by 19,818 volunteers submitted in a time span from April 27th, 2014 until December 5th, 2016. Crowdsourced The gold standard consisted of 200 objects whose labels were manually validated by the classiﬁcation science team in Cities at Night. These 200 objects received 1,341 labels from 692 users from tasks CrowdCrafting[7]. The third one is from the GalaxyZoo[3] project where we randomly choose 500 objects consisting classiﬁcations from 16 February 2009 to 21 May 2009. The workﬂows for the three data sets are depicted in Figures 4, 5 and 6, respectively. To explore the effects of volunteers/paid context on the results, the tasks are also setup on paid crowdsourcing platform to mimic the tasks done by volunteers. 4.2 Metric To measure the performance of our aggregation approach, we employ the Accuracy metric which has been commonly used in classiﬁcation evaluation in previous work (Khattak and Salleb- Aouissi, 2011; Kamar et al., 2012; Sheshadri and Lease, 2013; Hung et al.,2013; Zhang et al., 2017a; Zheng et al.,2017b). Accuracy is a measure allowing us to understand the percentage of correct answers (inferred by algorithms). The accuracy is deﬁned as the percentage of objects that have been correctly inferred. Higher accuracy indicates better performance. jOj Bernoulli L ¼¼ L gold Accuracy ¼ jOj The above equation is by default for calculating the accuracy for the inferred label path. Bernoulli L ¼¼ L indicates the outcome (either 0 or 1) of comparing gold category gold o with the category predicted by different predictor. As we use the adapted node-level based implementation, it makes sense to also evaluate how accurate the inferred label is on each node level. In such context, L ½ n represents the ground truth for object o at node level n gold and L ½ n represents the inferred true label at node level n. Hence, the accuracy at node level n for the top answer can be calculated by: jOj Bernoulli L ½ n ¼¼ L ½ n gold Accuracy ¼ _level jOj To understand whether our adapted approach is signiﬁcantly better, we will also run signiﬁcant testing for all algorithms chosen. We will use standard 5 per cent signiﬁcance level. For each data set, we will randomly select 100 objects and select 50 times. The accuracy for each selection is calculated for MV, MP and EM for both naive and adapted approach. We will use the function scipy.stats.ttest_ind from Python[8] to perform the two-sided test for naive and adapted samples in all six cases (three workﬂows, each has two contexts: volunteer and paid). 4.3 Results Table IV shows the accuracy of each algorithm on each data set for the inferred answer. Considering the overall classiﬁcation accuracy (by path), our adapted methods have better performance than the naive approach in both volunteer and paid crowd context; at the same time, each algorithm generally has higher accuracy for volunteer context compared to the paid crowd. Note that the best accuracy achieved increases as the depth of the workﬂow increases for the paid crowd context, where Serengeti with two questions achieves 45.9 per cent, darkskies with three IJCS Data set Graph depth/size Crowd type Algorithm Accuracy 3,3 serengeti 54-11 volunteer mv_naive 0.590 mv_dapted 0.776 em_naive 0.572 em_adapted 0.655 mp_naive 0.755 mp_adapted 0.755 paid mv_naive 0.299 mv_adapted 0.459 em_naive 0.244 em_adapted 0.337 mp_naive 0.083 mp_adapted 0.207 darkskies 8-3-2 volunteer mv_naive 0.690 mv_adapted 0.785 em_naive 0.040 em_adapted 0.450 mp_naive 0.340 mp_adapted 0.495 paid mv_naive 0.405 mv_adapted 0.530 em_naive 0.020 em_adapted 0.385 mp_naive 0.335 mp_adapted 0.305 galaxyzoo 3-3-2-3-2-2-3-6-4-2-7 volunteer mv_naive 0.554 mv_adapted 0.631 em_naive 0.470 em_adapted 0.564 mp_naive 0.002 mp_adapted 0.562 paid mv_naive 0.371 mv_adapted 0.579 em_naive 0.000 Table IV. em_adapted 0.331 Accuracy (by path) of mp_naive 0.002 each algorithm mp_adapted 0.367 questions achieves 53.0 per cent and galaxyzoo with maximum of nine questions achieves 57.9 per cent. Similar pattern is not observed for the volunteer context. If looking at the accuracy breakdown by node level (Figures 8, 9 and 10), it is notable that for multiple-questions task with more steps, adapted method of MP and EM generally shows better accuracy at most of the node levels. For the data sets from a task with fewer steps in its workﬂow (less number of levels in the graph), such as the Serengeti task in Figure 8, MV performs better. Meanwhile, from the Table IV we can see MV shows an acceptable accuracy for most of the volunteered data sets (mostly over 75 per cent, except for GalaxyZoo data set), but has poor accuracy (less than 60 per cent) in the paid crowd context though it performs better than other individual algorithms we tested, which suggests it need to be complemented by other methods which might be good at speciﬁc objects where MV cannot perform well. Looking at the accuracy by level results, it does not seem to suggest that as the depth of the task (number of levels) increases, accuracy has a tendency to consistently increase or decrease. The accuracy of each level is more relevant to its intrinsic character (e.g. number of options in that level, and ambiguity Crowdsourced classiﬁcation tasks Figure 8. Accuracy by node level (Serengeti) IJCS 3,3 Figure 9. Accuracy by node level (Darkskies) Crowdsourced classiﬁcation tasks Figure 10. Accuracy by node level (Galaxyzoo) or subjectivity of the corresponding object). For instance, the darkskies task asks the user to IJCS evaluate the sharpness and cloudiness of the image, which can be subjective to some degree. This 3,3 is also why the result by node level seems to show an interesting picture that on different node level for different workﬂow, sometimes em has the best result (such as level 4 and 5 of GalaxyZoo), sometimes mp has the best result (such as level 1 of Serengeti in volunteer case), other times mv has the best result (level 1, 2, 3 of Darkskies in both volunteer and paid context). Notice that MP for the darkskies paid crowd context, it is the only case we observe that the naive approach has higher overall accuracy (by path) than adapated, which is due to the fact that both the level 2 and 3 (determining cloudiness and sharpness of the image) of darkskies workﬂow are in essence independent questions of the ﬁrst node level (whether it is a city, or stars or anything else) though the task workﬂow made it a subsequent question only when“ city” is chosen as the label for ﬁrst node level. Similarly, the accuracy by level result from mp_adapated is lower than mp_naive on a few other occasions at different node level, but in those occasions, there is always one node level mp_naive has considerably poor accuracy, such as in Galaxyzoo node level 2, which subsequently leads to the very low overall accuracy considering the whole path. The reason that the mp_adapted approach could have lower accuracy at certain level is that mp approach actually only returns 1.0 or 0.0 to indicate whether that is the predicted label, but our adapted approach tried to assemble/infer a most probable valid label path (as shown in Algorithm 4) based on the candidate of predicted labels from individual node level. Therefore, for the mp case, the randomness of ranking the combinations might not do well for the corresponding node level, however, the overall accuracy has shown to be better than the naive approach which completely neglects the validity of a label path. Notice that though our adapted approaches achieve higher accuracy for the ﬁrst node level in most case, mv_adapted has slightly lower accuracy comparing to mv_naive for GalaxyZoo workﬂow under volunteer context, which is because the way we assemble the result is based on the overall possibility (percentage of voting at each node level multiplied) of a path instead of assuming the top voted label at node level 1 is correct (and then traversing subsequent node based on that assumption). Our main purpose is to obtain the most possible valid label path, which has been shown effective in Table IV. We have run the signiﬁcant testing for all algorithms chosen. The result is statistically signiﬁcant for all our adapted approach as the p- value is smaller than the pre-deﬁned signiﬁcant level (5 per cent) in all cases. 5. Discussion In this section, we expand on the key ﬁndings of the evaluation results introduced earlier. 5.1 Crowd context matters We have deliberately chosen three representative tasks each presenting two data sets produced by volunteers and paid crowd. Based on our results, thereisadistinctivedifferenceinperformance for the same algorithm applied in these two different contexts. For all algorithms, the accuracy it can achieve under the volunteer context is evidently higher than the paid crowd, without any exception. For the same workﬂow, the overall accuracy (by path) it can achieve in volunteer context is normally around 30 per cent higher than the paid crowd context for workﬂows with two to three questions. However, this does not seem to be the case when workﬂow involves more questions, such as in the galaxyzoo case where the best accuracy all the algorithms can achieve is only around 5 per cent higher in volunteer context compared to paid crowd context. 5.2 Workﬂow counts From the representative tasks we have shown so far, there are two main factors that need to be taken into account when designing a classiﬁcation crowdsourcing workﬂow especially when classiﬁcation steps are interdependent: the number of questions (determining the depth of the Crowdsourced graph) and how many answer options per each question (width of the corresponding node level, classiﬁcation affecting cognitive efforts required for passing that node level with correct chosen options). In tasks our evaluation, we found evidence that both depthand widthimpactonoverall performanceof the inference algorithms. One visible pattern is for the paid crowd data sets. In this setting, overall accuracy (by path) increases as the depth of the graph increases (for both mv_adapted and mp_adapted), which suggests that it might be a good idea to have more classiﬁcation questions each with fewer options rather than having fewer questions and giving many options to choose from, particularly for the case where the crowd’s skill level is uncertain. The other notable aspect is for volunteer context, the mp algorithm has a comparative performance with mv in Serengeti workﬂow, but not in the other two workﬂows with more levels. 5.3 Heuristics-based aggregation as an addition On observing the result in Section 4.3, it seems to be a promising way if we consider combining output from these algorithms using a heuristic strategy to perform better inference. We want to use results from mv_adapted, em_adapted and mp_adapted in combination to exploit their strengths and weaknesses for complex classiﬁcation tasks. To do so, we could have an aggregator which is based on following intuitions: the number of unique classiﬁcations of an object (deﬁned by u) shows the degree that the crowd workers agree/disagree on the classiﬁcation where the higher number indicates higher degree of disagreement and normally imply the object is either a bit difﬁcult or ambiguous to be classiﬁed; the ratio (deﬁned by r) between the unique number of classiﬁcations/answers collected from the crowd and the total number of classiﬁcations/judgments also demonstrates how diverse the answers are for the corresponding object and hence similarly; As three-sigma rule (Pukelsheim, 1994) in the empirical sciences suggests that almost all values should lie within three standard deviations of the mean in a normal distribution, and theoretically mean plus one, two or three standard deviation(s) covers 68, 95 and 99.7 per cent of the data. In the case where MV might potentially fail (where workers tend to disagree), the number of unique classiﬁcation or the ratio of the number of unique to the total number of classiﬁcation for an object falls within the higher range of the distribution. Thus, a heuristic aggregation strategy we could consider: Look at the intrinsic characteristics of collected classiﬁcations for each object, such as the number of unique classiﬁcations and the ratio of that against the total number of classiﬁcations. Then, based on the third intuition above, we can use the skewness (deﬁned by s below) of the distribution for number of unique (UNu ; u )and ðÞ m s ratio (deﬁned by RNr ; r ) respectively to heuristically chosen bound where MV can be ðÞ m s potentially complemented by other approaches. However, choosing an optimal threshold is not straightforward and need to be explored in future work. 6. Related work Our approach is informed by existing work on microtask crowdsourcing and quality assurance in crowdsourcing, which we review in section. 6.1 Microtask crowdsourcing and workﬂows In crowdsourcing, a problem needs to be sometimes decomposed into smaller, ﬁne-granular microtasks and then arranged in a workﬂow for more effective processing. In general, a workﬂow consists of a set of microtasks; the microtasks are sometimes of different types and can be dependent or independent of each other. For instance, the ﬁnd-ﬁx-verify workﬂow proposed by Bernstein et al. (2010) uses microtask crowdsourcing to proofread and shorten text in three steps: ﬁnding areas of improvement in the text; ﬁxing or improving them; and verifying the quality of the changes. In each step, the crowd is asked to carry out the same type of microtask, sometimes iteratively. In Kittur et al. (2008, 2013) and Acosta IJCS et al.’s (2013) studies, researchers have proposed to group the same or similar microtasks 3,3 into batches as a means to facilitate learning effects. Previous studies have also shown that task performance can be improved as a function of several factors, including the design of tasks and workﬂows, motivation and incentives and training (Bernstein et al.,2010; Demartini et al., 2012; Kittur et al., 2008; Wiggins et al., 2011). In the citizen science platform such as Zooniverse[9], most of the classiﬁcation projects are not simple tasks with one-question, instead is multiple-questions chained together. Zooniverse uses workﬂow to “group a collection of tasks into a logic unit”[10] which is, in essence, referring to the relatively multiple-questions task which need to be ﬁnsihed in several steps. In Snapshot Serengeti[1], classifying an image means answering a set of independent questions, sometimes several times when more than one animal is present in the image. In Cities at Night[2]and Galaxy Zoo[3], questions are inter-related and the answers given in one step determine the questions in the subsequent steps. In the context of such classiﬁcation task, a workﬂow is used to refer to the logical organisation of each classiﬁcation questions and corresponding options. Most previous studies around crowdsourcing workﬂows have focussed on the design of the workﬂows and have shown that a particular type of workﬂow can be crowdsourced effectively (in terms of the accuracy of outputs, budget, time etc.) (Little et al., 2009; Bernstein et al., 2010; Tran-Thanh et al., 2015). In some cases, researchers have proposed bespoke quality assurance methods for their workﬂows (Lintott et al.,2011; Willett et al., 2013). Our work proposes a strategy which can be applied to determine the correct label path for a whole range of classiﬁcation tasks, spanning over several steps with independent or dependent multiple-choice questions, which is different than existing research that mainly focus on the result for the ﬁnal step (no matter how many other previous steps exist in its workﬂow). 6.2 Inference algorithms Researchers have proposed inference algorithms, mathematical models that can automatically infer the correct solution to a given problem from a solution space deﬁned by the crowd. For example, Ipeirotis et al. presented an algorithm that assesses the performance of crowd workers and exploits this information to estimate the quality of answers on Mechanical Turk (Ipeirotis et al., 2010).Karger et al. proposed to use MP to infer correct answers from worker’sanswers (Karger et al., 2011). Bachrach et al. (2012) used a Bayesian graphical model to grade test answers in scenarios where the ground truth cannot be made available. Whitehill et al. (2009) followed an expectation maximisation approach to identify correct classiﬁcations, depending on the expertise of the workers and the level of difﬁculty of the task. In the citizen science project Galaxy Zoo Supernovae, crowd answers were analysed using a Bayesian generalisation of the same expectation maximisation idea (Simpson et al., 2011). More recently, Difallah et al. (2015b) compiled a set of features that can be used to predict answer quality, based on an analysis of Mechanical Turk logs. Several studies have shown that it is possible to combine automatic prediction methods (such as Bayesian or generative probabilistic models) with additional input from the crowd to further improve the accuracy of the predictions (dos Reis et al., 2015; Hare et al.,2013; Ipeirotis et al., 2010; Loni et al., 2014; Simpson et al., 2013). Other studies have analysed and compared different algorithms (Zheng et al., 2017a; et al., 2015; Sheshadri and Lease, 2013), emphasising the need for more research to understand the interplay among different sets of design parameters on the overall performance. All these existing methods have considerably advanced the state of the art. However, they cannot be applied to every type of microtask crowdsourcing workﬂow without restrictions. Moreover, most of the research carried out so far in this space has looked at rather simple binary or multiple-choice classiﬁcation tasks with the aim to identify a single, correct answer. This class of microtasks, albeit important and widely used, is not always the Crowdsourced norm. As we have seen in the examples from the previous section, there are cases where a classiﬁcation problem cannot be easily decomposed into independent microtasks, or where different, tasks related microtasks should be grouped into more complex workﬂows for efﬁciency reasons. Although there are a few recent works looking into the relatively complex multiple-step classiﬁcation tasks, each of them has a domain-speciﬁc or problem-speciﬁcfocus (Parameswaran et al., 2011; Kim et al.,2002; Wu et al., 2012; Bragg et al., 2013; Kamar and Horvitzm, 2015; Otani et al., 2016). Bragg et al. (2013) and Otani et al. (2016) both researched the entity classiﬁcation that normally involve categorising the given entity into parent-child classes in different steps but have very different perspectives. Bragg et al. (2013) focus on improving the workﬂow for generating taxonomy, as well as inference methods to induce the parent-child relationship, while Otani et al. (2016) focus on the task where a parent-child relationship exists between two adjacent classiﬁcation steps, and propose label aggregation methods that adapt from existing GLAD method (Whitehill et al., 2009) by considering the hierarchical class-subclass structure. In addition, Wu et al. (2012) investigate the sequential data labelling scenario and present Sembler to ensemble crowd sequential labellings by leveraging the statistical correlation and dependency among multiple instances/sentences which is domain speciﬁc and not applicable to other multiple-step classiﬁcation where no such statistics can be exploited. Parameswaran et al. (2011) and Kamar and Horvitz (2015) particularly look at the multiple-step image classiﬁcation tasks while both took the approaches that are not easy to be generalised to suit for other multiple-step classiﬁcation. Parameswaran et al. (2011) explicitly formulate the classiﬁcation task as human-assisted graph search problem, presenting the dimensions characterising the different type of classiﬁcation and developing algorithms to optimise the questions to be asked (at the different node) which is evaluated with simulation. On the other hand, Kamar and Horvitz (2015) focus on optimising worker allocation in the hierarchical classiﬁcation task (HCT) and develop answer models and evidence models for HCT consensus while both models are constructed with supervised learning, assisting with the Sloan Digital Sky Survey (SDSS) features identiﬁed by machine visions available for GalaxyZoo data set. There is also a few research particularly dedicating to automatic hierarchical classiﬁcation where an taxonomy is given and a parent-child relationship among classes exists, but all are bound to a certain domain. For instance, Dumais (2000) investigate automatic hierarchical classiﬁcation using Support Vector Machine with existing web pages whose category are known as training data. Su et al. (2006) present an automatic method to classify structured web databases by leveraging probing queries, the returned count of query result and the SVM classiﬁer. Such automatic hierarchical classiﬁcation not only needs existing labelled data as training data but also focus on the classiﬁcation where answers to further classiﬁcation step down the line (child classes) are always a sufﬁcient condition to conﬁrm the answer to the previous classiﬁcation step (parent classes). Our approach differs from existing work mainly in the fact it is not restricted to a speciﬁc type of multiple-step classiﬁcation and does not need additional information such as the machine identiﬁed features of the image or frequency/correlation among word usage, neither does it rely on the parent-child relationships between classiﬁcation steps. Our method is general and intuitively easy to be applied in any multi-step classiﬁcations. We discussed the three main individual algorithms in Section 2 and noted that whilst all three algorithms can be used to infer the correct answer for a multiple-choice question, they differ in terms of the inputs and outputs. In our approach, we devised a new strategy to use existing algorithms to achieve higher classiﬁcation accuracy. 7. Conclusion IJCS Ensuring quality is one of the grand challenges of microtask crowdsourcing. While previous 3,3 research has looked at inferring correct answers for microtasks consisting of single binary or multiple-choice questions, our research proposes a model that can be applied to both single-question and multiple-question scenarios, ﬁlling the gap for understanding how to aggregate in the multiple-question scenarios. We propose a graph model and an “adapted” aggregation method that can improve the accuracy in inferring true label path in complex workﬂows with several interdependent questions. Though a few previous works tried to address similar multiple-step classiﬁcation, they are either limiting it to the hierarchical classiﬁcation scenarios where a parent–child relationship exists between classiﬁcation steps or restricting the method by having to involve additional information. We propose using the graph to model a microtask crowdsourcing workﬂow and to support inference algorithms in making decisions about correct labels for classiﬁcation tasks with multiple-questions, where the answer to one question does not have to be the sufﬁcient condition to or imply the answer to the previous question is correct. We believe this is the ﬁrst work that investigates aggregation in a multiple-step classiﬁcation task with interdependent questions to infer the correct label path and assess the classiﬁcation accuracy accordingly. To this end, we explored three inference algorithms, MV, MP and EM, each with proven beneﬁts in quality assurance in crowdsourcing. We compared the performance of our adapted approach and the existing naive approach, using six representative data sets. We evaluate the performance of individual algorithms for overall accuracy where a full labelling path is considered as an atomic, correct answer and a more reﬁned measure which looks at accuracy in individual node level of the workﬂow graph. The results have shown that our adapted approach has signiﬁcantly improved the accuracy compared with the naive approach. The result also demonstrates that while MV does well in overall accuracy, a deeper analysis of the accuracy in each node level revealed a more interesting picture. Hence, a heuristic-based aggregation approach might be a potentially better solution by combining results from multiple algorithms leveraging the strength of each other. This suggests the need for more dynamic inference approaches that can adapt to the complexity of the crowdsourcing workﬂow. In future work, we plan to devise inference methods that take other, more workﬂow- speciﬁc factors into account. Our current method assumes independence between labels from different levels when inferring the answer for each level. It can be potentially improved to consider the possible correlation between labels in different node levels. For instance, it can consider giving different weight to labels based on the inferred result from the previous level. Such method requires a top-down traversal process which might bring side-effects since it counts heavily on the inferred result from the previous level, and carries on the effect (weight) to subsequent levels even the choice in the previous levels may be incorrect. As the correlation between labels in different node level is complicated, the feasibility of incorporating such correlation information into the aggregation process needs further investigation. Meanwhile, the number of options and the length of possible paths in a workﬂow deserve more in-depth experiments. One promising direction will be to employ other machine learning approaches for truth inference. For instance, using the workﬂow properties along with the crowdsourcing generated data to learn and explore features automatically [Huynh et al. (2013)], and produce decision tree to help choose the proper inference algorithm. Alternatively, certain properties from crowd-collected data could be further exploited to train machine learning algorithm(s) with selective labels to directly infer true label path. Notes Crowdsourced 1. https://www.snapshotserengeti.org/ classiﬁcation tasks 2. http://citiesatnight.org/ 3. https://www.galaxyzoo.org/ 4. https://success.crowdﬂower.com/hc/en-us/articles/203527635-CML-Attribute-Aggregation 5. https://data.galaxyzoo.org/gz_trees/gz_trees.html 6. In a lot of cases, the workﬂows are tree-shaped, but some cases are not a tree such as the three tasks presented above. 7. https://crowdcrafting.org/ 8. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html 9. https://www.zooniverse.org/ 10. https://blog.zooniverse.org/2013/06/20/how-the-zooniverse-works-the-domain-model/ References Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Auer, S. and Lehmann, J. (2013), “Crowdsourcing linked data quality assessment”, The Semantic Web – ISWC 2013, pp. 260-276. Bachrach, Y. Minka, T. and Guiver, J. (2012), “How to grade a test without knowing the answers – a Bayesian graphical model for adaptive crowdsourcing and aptitude testing”. Batini, C., Cappiello, C., Francalanci, C. and Maurino, A. (2009), “Methodologies for data quality assessment and improvement”, ACM Computing Surveys, Vol. 41 No. 3, pp. 1-52. Bernstein, M.S., Little, G., Miller, R.C., Hartmann, B., Ackerman, M.S., Karger, D.R., Crowell, D. and Panovich, K. (2010), “Soylent: a word processor with a crowd inside”, Proceedings of the 23nd annual ACM symposium on User interface software and technology, ACM, pp. 313-322. Bernstein, M.S., G., Little, Robert, C., Miller, B., Hartmann, Mark, S., Ackerman, David, R., Karger, David Crowell, K. and Panovich, (2010), “Soylent: a word processor with a crowd inside”, Proceedings of the 23nd Annual ACM Symposium on User Interface Software and Technology, ACM, pp. 313-322. Bragg, J., Mausam and Weld, D.S. (2013), “Crowdsourcing multi-label classiﬁcation for taxonomy creation”,in HCOMP 2013, First AAAI Conference on Human Computation and Crowdsourcing. Dawid, A.P. and Skene, A.M. (1979), “Maximum likelihood estimation of observer error-rates using the em algorithm”, Applied Statistics , Vol. 28 No. 1, p. 20. Demartini, G., Difallah, D.E. and Cudré-Mauroux, P. (2012), “Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking”, Proceedings of the 21st international conference on World Wide Web, ACM, pp. 469-478. Difallah, D.E. Catasta, M. Demartini, G. Ipeirotis, P.G. and Cudré-Mauroux, P. (2015a), “The dynamics of micro-task crowdsourcing: the case of Amazon MTurk”, pp. 238-247. Difallah, D.E., Catasta, M., Demartini, G. and Panagiotis, G. (2015b), “Ipeirotis, and Philippe Cudré- Mauroux”, The Dynamics of Micro-Task Crowdsourcing: The Case of Amazon MTurk,” Pages, pp. 238-247. dos Reis, F.J.C.S., Lynn, H.R., Ali, D., Eccles, A., Hanby, E., Provenzano, C., Caldas, W.J., Howat, L.-A., McDuffus, B. and Liu, (2015), “Crowdsourcing the general public for large scale molecular pathology studies in cancer”, EBioMedicine, Vol. 2 No. 7, pp. 679-687. Dumais, S. (2000), “Hierarchical classiﬁcation of web content”, pp. 256-263. Eickhoff, C. and de Vries, A. (2011), “How crowdsourcable is your task”,in Proceedings of the Workshop IJCS on Crowdsourcing for Search and Data Mining (CSDM) at the Fourth ACM International 3,3 Conference on Web Search and Data Mining (WSDM), pp. 11-14. Gadiraju, U., Demartini, G., Kawase, R. and Dietze, S. (2015), “Human beyond the machine: challenges and opportunities of microtask crowdsourcing”, IEEE Intelligent Systems, Vol. 30 No. 4, pp. 81-85. Gadiraju, U., Kawase, R. and Dietze, S. (2014), “A taxonomy of microtasks on the web”, Proceedings of the 25th ACM conference on Hypertext and social media, ACM, pp. 218-223. Gelas, H. Solomon Teferra Abate, L. and Besacier, (2011), Laboratoire Dynamique, Du Langage, Cnrs Universit, De Lyon, Laboratoire Informatique De Grenoble, Cnrs Universit, and Fourier Grenoble., “Quality assessment of crowdsourcing transcriptions for African languages,” (August), pp. 3065-3068. Hare, J.S., Acosta, M., Weston, A., Simperl, E., Samangooei, S., Dupplaw, D. and Lewis, P.H. (2013), “An investigation of techniques that aim to improve the quality of labels provided by the crowd”,in Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop, Barcelona, Spain, October 18-19, 2013., vol. 1043 of CEUR Workshop Proceedings, available at: CEUR-WS.org Hung, Q.V.N., Tam, N.T., Tran, L.N. and Aberer, K. (2013), “An evaluation of aggregation techniques in crowdsourcing”, Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artiﬁcial Intelligence and Lecture Notes in Bioinformatics), Vol. 8181 LNCS, no. PART 2, pp. 1-15. Hung, N.Q.V., Thang, D.C., Weidlich, M. and Aberer, K. (2015), “Minimizing efforts in validating crowd answers”, Proceedings of the ACM SIGMOD International Conference on Management of Data, Vol. 2015-May, pp. 999-1014. Huynh, T.D., Ebden, M., Venanzi, M., Ramchurn, S., Roberts, S. and Moreau, L. (2013), “Interpretation of crowdsourced activities using provenance network analysis”, The First AAAI Conference on Human Computation and Crowdsourcing, pp. 78-85. Ipeirotis, P.G., Provost, F., Sheng, V.S. and Wang, J. (2014), “Repeated labeling using multiple noisy labelers”, Data Mining and Knowledge Discovery, Vol. 28 No. 2, pp. 402-441. Ipeirotis, P.G., Provost, F. and Wang, J. (2010), “Quality management on amazon mechanical Turk”, Proceedings of the ACM SIGKDD Workshop on Human Computation – HCOMP ’10, p. 64. Ipeirotis, P.G., Provost, F. and Wang, J. (2010), “Quality management on amazon mechanical Turk”, Proceedings of the ACM SIGKDD Workshop on Human Computation – HCOMP ’10, p. 64. JCGM. JCGM 200 (2008), “International vocabulary of metrology? Basic and general concepts and associated terms (VIM) vocabulaire international de métrologie? Concepts fondamentaux et généraux et termes associés (VIM)”, International Organization for Standardization Geneva ISBN, Vol. 3 No. Vim, p. 1042008. Kahn, B.K., Strong, D.M. and Wang, R.Y. (2002), “Information quality benchmarks: product and service performance”, Communications of the Acm, Vol. 45 No. 4, pp. 184-192. Kamar, E., Hacker, S. and Horvitz, E. (2012), “Combining human and machine intelligence in large-scale crowdsourcing”, Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, International Foundation for Autonomous Agents and Multiagent Systems, Vol. 1, pp. 467-474. Kamar, E. and Horvitz, E. (2015), “Planning for crowdsourcing hierarchical tasks”, Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, p. 2030. Karger, D.R., Oh, S. and Shah, D. (2011), “Iterative learning for reliable crowdsourcing systems”, Advances in Neural Information Processing Systems, pp. 1953-1961. Khattak, F.K. and Salleb-Aouissi, A. (2011), “Quality control of crowd labeling through expert evaluation”, Second Workshop on Computational Social Science and the Wisdom of Crowds (NIPS 2011), pp. 1-5. Kim, J.-H., Kang, I.-H. and Choi, K.-S. (2002), “Unsupervised named entity classiﬁcation models and their ensembles”, Proceedings of the 19th International Conference on Computational Linguistics, Vol. 1, pp. 1-7. Kittur, A., Chi, E.H. and Suh, B. (2008), “Crowdsourcing user studies with mechanical Turk”, Proceedings Crowdsourced of the SIGCHI Conference on Human Factors in Computing Systems, ACM, pp. 453-456. classiﬁcation Kittur, A., Nickerson, J.V., Bernstein, M., Gerber, E., Shaw, A., Zimmerman, J., Lease, M. and Horton, J. tasks “The future of crowd work”, Proceedings of the 2013 Conference on Computer Supported Cooperative Work – CSCW ’13, ACM Press, New York, NY, USA), p. 1301, 2013. Kittur, A., Smus, B., Khamkar, S. and Kraut, R.E. (2011), “CrowdForge: Crowdsourcing complex work”, Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology – UIST ’11, pp. 43-52. Kulkarni, A.P., Can, M. and Hartmann, B. (2011), “Turkomatic”, Proceedings of the 2011 Annual Conference Extended Abstracts on Human Factors in Computing Systems – CHI EA ’11, p. 2053. Lintott, C., Schawinski, K., Bamford, S., Slosar, A., Land, K., Thomas, D., Edmondson, E., Masters, K., Nichol, R.C. and Raddick, M.J. (2011), “Galaxy zoo 1: data release of morphological classiﬁcations for nearly 900 000 galaxies”, Monthly Notices of the Royal Astronomical Society, Vol. 410 No. 1, pp. 166-178. Little, G., Chilton, L.B., Goldman, M. and Miller, R.C. (2009), “Turkit: tools for iterative tasks on mechanical Turk”,in Proceedings of the ACM SIGKDD Workshop on Human Computation, ACM, pp. 29-30. Liu, X., Lu, M., Ooi, C., Shen, Y., Wu, S. and Zhang, M. (2012), “CDAS: a crowdsourcing data analytics system”, Proceedings of the Vldb Endowment, Vol. 5 No. 10, pp. 1040-1051. Loni,B., Hare,J., Georgescu, M.,Riegler,M., Zhu, X.,Morchid,M., Dufour,R. and Larson, M.(2014), “Getting by with a little help from the crowd: practical approaches to social image labeling”, Proceedings of the 2014 International ACM Workshop on Crowdsourcing for Multimedia, pp. 69-74. Malone, T.W., Laubacher, R. and Dellarocas, C. (2010), “The collective intelligence genome”, IEEE Engineering Management Review, Vol. 38 No. 3. Mao, A., Kamar, E., Chen, Y., Horvitz, E., Schwamb, M.E., Lintott, C.J. and Smith, A.M. (2013), “Volunteering versus work for pay: incentives and tradeoffs in crowdsourcing”, First AAAI Conference on Human Computation and Crowdsourcing, pp. 94-102. Otani, N., Baba, Y. and Kashima, H. (2016), “Quality control for crowdsourced hierarchical classiﬁcation”, Proceedings – IEEE International Conference on Data Mining, ICDM, Vol. 2016- Janua, pp. 937-942. Parameswaran, A., Sarma, A.D., Garcia-Molina, H., Polyzotis, N. and Widom, J. (2011), “Human- Assisted graph search: it’s okay to ask questions”, Proceedings of the VLDB Endowment, Vol. 4 No. 5, pp. 267-278. Paulheim, H. and Bizer, C. (2014), “Improving the quality of linked data using statistical distributions”, International Journal on Semantic Web and Information Systems (Systems), Vol. 10 No. 2, pp. 63-86. Pukelsheim, F. (1994), “The three sigma rule”, The American Statistician, Vol. 48 No. 2, pp. 88-91. Redi, J. and Povoa, I. (2014), “Crowdsourcing for rating image aesthetic appeal: better a paid of a volunteer crowd?”, Proceedings of the 2014 International ACM Workshop on Crowdsourcing for Multimedia – CrowdMM ’14, no. NOVEMBER 2014, pp. 25-30. Rosenthal, S.L. and Dey, A.K. (2010), “Towards maximizing the accuracy of human-labeled sensor data”, in Proceedings of the 15th International Conference on Intelligent User Interfaces – IUI ’10, ACM Press, New York, NY, p. 259. Shahaf, D. and Horvitz, E. (2010), “Generalized task markets for human and machine computation”,in AAAI. Sheshadri, A., Lease, M. (2013), “SQUARE: a benchmark for research on computing crowd consensus”, First AAAI Conference on Human Computation and ..., pp. 156-164. Simpson, E., Roberts, S., Psorakis, I. and Smith, A. (2013), “Dynamic Bayesian combination of multiple imperfect classiﬁers”, Studies in Computational Intelligence, Vol. 474, pp. 1-35. Simpson, E., Roberts, S.J., Smith, A. and Lintott, C. (2011), “Bayesian combination of multiple, imperfect classiﬁers”,in Proceedings of the 25th Conference on Neural Information Processing Systems, Granada. Snow, R., O’Connor, B., Jurafsky, D. and Ng, A.Y. (2008), “Cheap and fast – but is it good? IJCS Evaluating non-expert annotations for natural language tasks”, Proceedings of the 3,3 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 254-263. Su, W., Wang, J. and Lochovsky, F. (2006), Automatic Hierarchical Classiﬁcation of Structured Deep Web Databases BT - Web Information Systems – WISE, 2006, Springer, Berlin Heidelberg, pp. 210-221. Tran-Thanh, S.R.L., Huynh, T.D. and Rosenfeld, A. (2015), “Crowdsourcing complex workﬂows under budget constraints”, Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence (AAAI-15), pp. 1298-1304. Vickrey, D., Bronzan, A., Choi, W., Kumar, A., Turner-Maier, J., Wang, A. and Koller, D. (2008), “Online word games for semantic data collection”, Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 533-542. Whitehill, J., Ruvolo, P., Wu, T., Bergsma, J. and Movellan, J. (2009), “Whose vote should count more: optimal integration of labels from labelers of unknown expertise”, Advances in Neural Information Processing Systems, Vol. 22 No. 1, pp. 1-9. Wiggins, A., Newman, G., Stevenson, R.D. and Crowston, K. (2011), “Mechanisms for data quality and validation in citizen science”, e-Science Workshops (eScienceW), 20111 IEEE Seventh International Conference on, IEEE, pp. 14-19. Willett, K.W., Lintott, C.J., Bamford, S.P., Masters, K.L., Simmons, B.D., Casteels, K.R.V., Edmondson, E. M., Fortson, L.F., Kaviraj, S., Keel, W.C., Melvin, T., Nichol, R.C., Raddick, M.J., Schawinski, K., Simpson, R.J., Skibba, R.A., Smith, A.M. and Thomas, D. (2013), “Galaxy zoo 2: detailed morphological classiﬁcations for 304 122 galaxies from the sloan digital sky survey”, Monthly Notices of the Royal Astronomical Society, Vol. 435 No. 4, pp. 2835-2860. Wu, X., Fan, W. and Yu, Y. (2012), “Sembler: ensembling crowd sequential labeling for improved quality”, Proceedings of the National Conference on Artiﬁcial Intelligence, vol. 2, pp. 1713-1719. Yang, J. Redi, J. Demartini, G. and Bozzon, A. (2016), “Modeling task complexity in crowdsourcing”. Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S. and Hitzler, P. (2013), “Quality assessment methodologies for linked open data”, Semantic Web. Zhang, J., Sheng, V.S., Li, Q., Wu, J. and Wu, X. (2017a), “Consensus algorithms for biased labeling in crowdsourcing”, Information Sciences, Vol. 382-383, pp. 254-273. Zheng, Y., Li, G., Li, Y., Shan, C. and Cheng, R. (2017b), “Truth inference in crowdsourcing: is the problem solved?”, Proceedings of the VLDB Endowment, Vol. 10 No. 5. Further reading Wang, J. Ipeirotis, P.G. and Provost, F. (2015), “Cost-effective quality assurance in crowd labeling”. Yoram, B. Tom, M. and John, G. (2012), “How to grade a test without knowing the answers – a bayesian graphical model for adaptive crowdsourcing and aptitude testing”. Corresponding author Qiong Bu can be contacted at: qb1g13@soton.ac.uk For instructions on how to order reprints of this article, please visit our website: www.emeraldgrouppublishing.com/licensing/reprints.htm Or contact us for further details: permissions@emeraldinsight.com

Journal

International Journal of Crowd Science – Emerald Publishing

Published: Dec 9, 2019

Keywords: Aggregation; Classification; Task-oriented crowdsourcing; Quality assessment; Human computation

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Quality assessment in crowdsourced classification tasks

Quality assessment in crowdsourced classification tasks

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Quality assessment in crowdsourced classification tasks

Quality assessment in crowdsourced classification tasks

References (74)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies