Access the full text.
Sign up today, get DeepDyve free for 14 days.
The online labor market Amazon Mechanical Turk (MTurk) is an increasingly popular source of respondents for social science research. A growing body of research has examined the demographic composition of MTurk workers as compared with that of other populations. While these comparisons have revealed the ways in which MTurk workers are and are not representative of the general population, variations among samples drawn from MTurk have received less attention. This article focuses on whether MTurk sample composition varies as a function of time. Specifically, we examine whether demographic characteristics vary by (a) time of day, (b) day of week, and serial position (i.e., earlier or later in data collection), both (c) across the entire data collection and (d) within specific batches. We find that day of week differences are minimal, but that time of day and serial position are associated with small but important variations in demographic composition. This demonstrates that MTurk samples cannot be presumed identical across different studies, potentially affecting reliability, validity, and efforts to reproduce findings. Keywords political methodology, political science, social sciences, politics and social sciences, research methods, data collection, research methodology and design, reliability and validity, political behavior/psychology, psychology similar to the general population. There are reasons to sus- Background pect, however, that there are also important variations Amazon Mechanical Turk (MTurk) is an online labor market between different samples drawn from MTurk, and these in which people (“requesters”) requiring the completion of variations have received far less attention. This article small tasks (“Human Intelligence Tasks” [HITs]) are matched addresses this question, using data from a study of approxi- with people willing to do them (“workers”). MTurk has mately 10,000 MTurk workers to examine whether sample become a popular data collection tool among social science composition varies as a function of the time that it is researchers: In 2015, the 300 most influential social science collected. journals (with impact factors greater than 2.5, according to We begin by reviewing what extant research reveals about Thomson-Reuters InCites) published more than 500 articles the demographic composition of the MTurk worker pool. that relied on MTurk data in full or in part (Chandler & Then, we describe the methods and measures that we use in Shapiro, 2016). our study, after which we present the results of our analyses, Reflecting the popularity of MTurk, considerable effort has been invested in evaluating data collected from it, with Harvard T.H. Chan School of Public Health, Boston, MA, USA particular emphasis on documenting the demographic and University of Michigan, Ann Arbor, USA Mathematica Policy Research, Ann Arbor, MI, USA psychological characteristics of its population, the quality of Cornell University, Ithaca, NY, USA respondent data, and the methodological limitations of the Princeton University, NJ, USA platform. As a result, MTurk workers have become one of Corresponding Author: the most thoroughly studied convenience samples currently Jesse Chandler, Research Center for Group Dynamics, Institute for Social available to researchers (for a review, see Chandler & Research, University of Michigan, 426 Thompson St., Ann Arbor, MI Shapiro, 2016), and researchers have learned a great deal 48105, USA. about the ways in which MTurk respondents are and are not Email: email@example.com Creative Commons CC BY: This article is distributed under the terms of the Creative Commons Attribution 4.0 License (http://www.creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access-at-sage). 2 SAGE Open which include a demographic description of the largest sam- population as a whole. Different studies occasionally observe ple of MTurk workers we are aware of and an exploration of substantially different demographic characteristics. For whether the demographic characteristics of MTurk respon- example, the proportion of female respondents differed by dent samples vary across day and time and earlier versus about 10% across two studies that each recruited several later in the data collection. We conclude with a discussion thousand participants (Chandler & Shapiro, 2016). about the implications of the temporal variations we uncover There are many potential causes for sampling variation for researchers using MTurk (and online data collection more across studies. Anecdotal evidence suggests that MTurk generally). sample composition might be influenced by the fact that workers share information about available studies and that reputation effects might lead workers to gravitate toward How Representative of the General Population (and to avoid) particular requesters (Chandler, Mueller, & Are Samples of MTurk Workers? Paolacci, 2014). Some of this variation is also surely the result of MTurk workers self-selecting into the studies that The demographic characteristics of samples drawn from interest them (for a discussion, see Couper, 2000). Design MTurk populations have been extensively studied. These choices that are exogenous to a study design may also inad- studies show that most MTurk workers live in the United vertently influence sample composition. The effects of such States and India (Paolacci, Chandler, & Ipeirotis, 2010), that exogenous choices are of particular interest to researchers U.S. MTurk workers are more diverse than many other con- because they are both within their control and typically irrel- venience samples, and that they are not representative of the evant to the substance of the studies themselves. population as a whole (Paolacci & Chandler, 2014). However, The present study focuses on the impact of intertemporal while scholars caution that MTurk samples are typically less variation on sample composition across (a) time of day, (b) representative than commercial web panels that make day of week and serial position (i.e., earlier or later in data explicit efforts to provide representative samples (Berinsky, collection), both (c) across the entire data collection and (d) Huber, & Lenz, 2012; Mullinix, Leeper, Druckman, & within specific batches. Extant evidence about sample differ- Freese, 2015; Weinberg, Freese, & McElhattan, 2014), they ences across time and day are suggestive but limited by small also agree that MTurk samples are more diverse than student sample sizes. Comparing samples of about 100 participants samples or community samples recruited from college towns obtained within two different studies, Komarov, Reinecke, (Berinsky et al., 2012; Krupnikov & Levine, 2014). and Gajos (2013) observed that compared with workers Differences between the U.S. MTurk population and the recruited later in the evening, workers recruited during the U.S. general population parallel differences between sam- daytime were older, more likely to be female, and less likely ples recruited through other online methods and the U.S. to use a computer mouse to complete the survey (suggesting population (Casler, Bickel, & Hackett, 2013; Hillygus, that they were using mobile devices). Lakkaraju (2015) com- Jackson, & Young, 2014; Paolacci & Chandler, 2014). Most pared the gender, income, education and age of 700 workers significantly, MTurk workers are typically younger than the across different times and days, finding that only gender var- general population (Berinsky et al., 2012; Paolacci et al., ied as a function of the day a given HIT was posted. 2010), have more years of formal education, and are more Variation among participants who complete a research liberal (Berinsky et al., 2012; Mullinix et al., 2015). MTurk study early or later in the data collection process (referred to workers are less likely to be married (Berinsky et al., 2012; here as serial position effects) has been observed in other Shapiro, Chandler, & Mueller, 2013), and more likely to modes of data collection, but has not been examined on identify as lesbian, gay, or bisexual (LGB; Corrigan, Bink, MTurk. Changes in sample composition between “early” and Fokuo, & Schmidt, 2015; Reidy, Berke, Gentile, and “late” responders have been observed in mail and email sur- Zeichner, 2014; Shapiro et al., 2013). MTurk workers also veys, in part because the easiest to contact participants tend tend to report lower personal incomes and are more likely to to complete surveys earlier (for a review, see Sigman, Lewis, be unemployed or underemployed than members of general Yount, & Lee, 2014). In general, people of color are under- population (Corrigan et al., 2015; Shapiro et al., 2013). represented among early respondents, as are men (Gannon, Whites and Asian Americans are overrepresented within Nothern, & Carroll, 1971; Sigman et al., 2014; Voigt, MTurk samples, while Latinos and African Americans are Koepsell, & Daling, 2003), younger people, and people with underrepresented (Berinsky et al., 2012). fewer years of formal education (Voigt et al., 2003; for a dis- cussion, see Sigman et al., 2014). Are Samples of MTurk Workers Representative Examinations of lab studies of college students have also of MTurk Workers? shown that sample compositions can vary over time. For While the forgoing research makes clear that the U.S. MTurk example, women (Ebersole et al., 2016) and students with population is not representative of the U.S. population as a high GPAs (Aviv, Zelenski, Rallo, & Larsen, 2002; Cooper, whole, there are also reasons to suspect that samples recruited Baumgardner, & Strathman, 1991) are more likely than men from MTurk are themselves not representative of the MTurk and students with lower GPAs to participate in lab studies at Casey et al. 3 the beginning of the semester. Personality variables also total of 56 days (or 8 weeks). We began by posting the HIT influence when students complete lab studies, with partici- twice daily, at 3 p.m. and 10 p.m. Eastern Time (ET). After pants who report that they are less extraverted, less open to the first week, we added a third posting at 10 a.m. ET. experience, and more conscientious more likely to respond at Only U.S.-based workers with a HIT acceptance ratio the beginning of the semester. (HAR) greater than 95% and who had completed at least 100 Investigating whether samples vary over the course of a HITs were eligible to participate. We selected workers with a survey fielding period is critical, because researchers tend to 95% HAR because this subsample of workers has been recruit small samples for their research (Fraley & Vazire, shown to result in higher quality data (Peer, Vosgerau, & 2014). In fact, most of the existing studies of the characteris- Acquisti, 2014) and, in our experience, to be favored by tics of MTurk workers rely on relatively small samples (N < researchers. We prevented workers from completing this sur- 500) that capture only a small proportion of the approxi- vey more than once across the entire fielding period. mately 16,000 active MTurk workers (Stewart et al., 2015). For the first 3 weeks, workers were paid US$0.25 to com- If researchers use only small samples, the samples they plete the survey. After learning that the average time to com- recruit may differ systematically from the worker pool as a pletion was roughly 5 min, we increased the pay rate to whole. In addition, if researchers recruit unique workers to US$0.50 for the remainder of the fielding period to comply participate in a series of related experiments (as they should; with recommended pay norms of US$0.10 per minute (see see Chandler et al., 2014; Chandler, Paolacci, Peer, Mueller, “Guidelines for Academic Requesters,” 2014). By the end of & Ratliff, 2015), sample composition may vary systemati- the study, we had posted the HIT 162 times and sampled cally across the experiments, compromising both the reliabil- 9,770 unique respondents. ity and validity of their studies, and possibly complicating efforts to reproduce findings. Measures A second potential serial position effect on MTurk is dif- ferences between people who complete HITs shortly after At the beginning of the study, we collected measures of age they are posted or later on. This factor is independent from and the U.S. state in which respondents lived. Participants early versus late responding to the study because study data were then asked to report demographic information includ- can be collected through any number of batch postings. In ing their highest level of education, current employment sta- practice, researchers often collect data from MTurk by post- tus, and current occupation. We also asked a series of ing more than one batch of HITs, either to speed up data col- questions about their current relationship status, sexual ori- lection (data collection is faster immediately after an HIT is entation, sex assigned at birth, and current gender identity. In posted; Peer, Brandimarte, Samat, & Acquisti, 2017) or to addition, we asked questions about household size, race and circumvent the fee Amazon charges for a batch that recruits ethnicity, household income, religious denomination, how more than nine participants. When more (but smaller) batches often they attend religious services, and self-perceived socio- are posted, the average batch will, by default, be closer to the economic status (see Howe, Hargreaves, Ploubidis, De front of the queue, which could affect sample composition Stavola, & Huttly, 2011; Ravallion & Lokshin, 1999). for at least three reasons. First, a batch closer to the front of We also included a 10-item measure of the “Big Five” the queue reduces the amount of work it takes to find it, espe- personality factors (Ten Item Personality Measure or TIPI; cially for workers who rely on the default sort order. Second, Gosling, Rentfrow, & Swann, 2003). The “Big Five” is smaller batches might limit the number of workers who dis- among the most widely accepted taxonomy of personality cover the survey through links on worker forums, because traits within psychology (for a review, see John & Srivastava, the link will be valid for a shorter period of time. Third, some 1999) and conceptualizes personality as consisting of five workers use automated scripts or other tools to be alerted bipolar dimensions: Openness, Conscientiousness, about the availability of new work. In this study, we post Extraversion, Agreeableness, and Neuroticism. The ques- multiple batches that allow us to disentangle serial position tionnaire and other materials are available online on the effects within batches of posted HITs from serial position Open Science Framework (osf.io/tg7h3). effects across the data collection as a whole. Prior to completing the survey, participants were asked whether they learned about the survey on MTurk or some- where else. Those who indicated somewhere else were asked Method to specify where they learned about it. To explore whether MTurk worker demographics vary inter- Finally, using a database of more than 100,000 HITs sub- temporally, we crafted a brief HIT (average completion time mitted over 3 years immediately prior to the present study was approximately 5 min) that contained demographic ques- (reported in Stewart et al., 2015), we were able to estimate tions that are of interest to scholars across an array of individual workers’ relative experience completing MTurk disciplines. tasks. Workers with no recorded experience during the We first posted our HIT on March 19, 2015, and data col- Stewart et al.’s study (N = 4,746) were assigned a value of lection concluded on May 14, 2015, so it was active for a one and all other workers were assigned a value equal to their 4 SAGE Open total number of previously completed HITs plus one. Characteristics of the MTurk Sample Although this measure does not capture total HITs a respon- Tables 1 to 4 present summary data about the entire sample, dent has completed, it does allow us to analyze temporal about participants in the first two batches only, and for variations in workers’ relative levels of experience (see national estimates when available. The entire sample repre- Chandler et al., 2015). sents the largest sample of MTurk workers we are aware of, and likely measures about two thirds of the available worker Results population (Stewart et al., 2015). The sample size of the first two batches (N = 438) approximates a sample slightly larger Data Cleaning and Survey Metadata than those typically used in behavioral science research Data collection resulted in 10,121 survey attempts, of which (Fraley & Vazire, 2014) and is presented to enable compari- 169 attempts (generated by 147 workers) were identified as sons of this study to other, typically sized data collections. duplicate responses. Duplicate responses were defined as The demographic data are reported in Table 1, including any submission from a WorkerID in excess of one. For work- information about worker experience and where they learned ers with duplicate responses, the most complete response about the survey. Differences between this sample and the was taken. When both responses were of equal length (typi- U.S. population as a whole are generally consistent with cally complete), the first response was taken. An additional those reported in previous analyses of smaller surveys 182 responses that came from non-U.S. IP addresses and one (Berinsky et al., 2012; Krupnikov & Levine, 2014; Paolacci respondent without a WorkerID were also identified and et al., 2010; Shapiro et al., 2013). For example, the workers deleted, resulting in 9,770 valid survey attempts. in our sample are younger and more likely to be white than Of the valid attempts, 780 (8%) were identified by the U.S. population as a whole. Workers residing in the Qualtrics as incomplete. A visual inspection of these Eastern Time Zone are overrepresented compared with those responses found that 724 of these respondents answered the in other parts of the United States. This variation is likely last question in the survey and were functionally complete. because the times that HITs were posted aligned most closely Only 56 respondents (0.6%) dropped out of the survey after with the times that workers in the time zone were likely to be providing only partial data. These partial responses were active. included for analysis. Almost all (90.9%) workers reported finding the survey Of all valid attempts, 518 (5.3%) came from an IP on MTurk. Of the 868 workers who found the survey else- address shared by at least one other response. The majority where, most (n = 671) named HitsWorthTurkingFor (a of IP addresses (n = 196) contributed two responses, with Reddit forum), 29 listed Hit Scraper (an automatic alerting 10 contributing three responses, three contributing four service), and virtually all other respondents listed other responses, two contributing 10 responses, one contributing MTurk discussion forums (e.g., TurkerNation). 26 responses, and another 39 responses. All responses from Table 2 summarizes the socioeconomic characteristics of duplicate IP addresses were left in for this analysis, as our sample. Respondents to our survey generally reported shared IP addresses do not necessarily indicate the same more years of formal education than the population as a worker repeating a task. whole. Although Americans residing in the wealthiest house- For example, the 433 responses from IP addresses that holds are underrepresented in our data, household income contributed four or fewer responses were examined. Of was much closer to the median U.S. income than would be these, 233 were almost certainly unique respondents from expected from previous measurements of individual worker the same household: They came from people who listed the income (Berinsky et al., 2012; Paolacci et al., 2010). A por- exact same household size, the same age of household mem- tion of this difference is likely due to the fact that 16.5% of bers (±2 years in aggregate) and reported an age that corre- the respondents in our sample are under 30 and living with sponded to an age that matched an age of a person that the someone at least 18 years older than they are, suggesting that other respondent reported that they lived with. An additional our sample includes a substantial number of millennials with 49 respondents were likely from the same household, report- low individual income but who are living with their higher ing approximately the same total age of members (±5 years income parents. in aggregate), or who appeared to have neglected to report a Table 3 summarizes the relationship status and characteris- household member (usually a child or much older adult). tics of respondents, revealing that approximately a third of Three of the four IP addresses that generated the most respondents are married and another third are single. In addi- responses were servers registered to Amazon. It is likely that tion, we find that 1.5% of our sample reports are currently participants from these addresses are using either a proxy engaged in a consensually nonmonogamous relationship (see server, or an ISP hosted on Amazon Web Services. These Haupert, Gesselman, Moors, Fisher, & Garcia, 2016). As has responses varied in the time they were attempted, the spe- been observed in other studies of other MTurk workers cific browser and operating system configuration used, and (Corrigan et al., 2015; Reidy et al., 2014; Shapiro et al., 2013), the content of the survey responses. the proportion of lesbian, gay, and particularly bisexual Casey et al. 5 Table 1. Demographic Characteristics of Workers. Characteristic Total sample (N = 9,770) First respondents (N = 438) National estimates Mean age 33.51 [32.3, 33.7] 33.59 [32.6. 34.58] 47.01 Female 51.7% [50.8, 52.7] 46.8% [42.2, 51.6] 50.8% Transgender 0.5% [0.3, 0.6] 0.2% [0.0, 0.6] 0.6% Gender queer 0.9% [0.7, 1.1] 0.2% [0.0, 0.6] — d d Mean worker experience 3.67 [3.5, 3.9] 6.94 [5.82, 8.06] — (Prior HITs completed) Found HIT outside of MTurk 9% 14.8% — U.S. time zone Eastern 52.2% [51.2, 53.2] 56.2% [51.6, 60.9] 47.3% Central 25.3% [24.4, 26.2] 23.3% [19.3, 27.3] 29.0% Mountain 5.9% [5.4, 6.4] 3.9% [2.1, 5.7] 6.5% Pacific 15.9% [15.2, 16.6] 16.4% [12.9, 19.9] 16.6% Other 0.6% [0.5, 0.8] 0.2% [0.0, 0.6] 0.6% Race and ethnicity White/Caucasian 82.9% [82.2, 83.7] 79.5% [75.7, 83.3] 73.6% African American 8.6% [8.0, 9.2] 7.8% [5.3, 10.3] 12.6% Asian American 7.7% [7.2, 8.2] 11.2% [8.3, 14.1] 5.1% American Indian or Alaskan Native 2.1% [1.8, 2.4] 3.2% [1.6, 4.9] 0.8% Native Hawaiian or Pacific Islander 0.6% [0.4, 0.8] 1.6% [0.4, 2.8] 0.2% Other 1.3% [1.1, 1.5] 0.9% [0.0, 1.8] 4.7% Latino 5.5% [5.1, 6.0] 6.4% [4.1, 8.7] 17.1% Note. 95% CI indicated in parentheses. HITs = Human Intelligence Tasks; MTurk = Amazon Mechanical Turk; CI = confidence interval. U.S. Census Bureau (2016; mean age of adult population). U.S. Census Bureau (2011-2015). Flores, Herman, Gates, and Brown (2016). Stewart et al. (2015). U.S. Census Bureau (2016) population estimates. U.S. Census Bureau (2011-2015). respondents is higher than it is in the U.S. population as a Sample Differences by Time of Completion whole. This is likely because online populations are dispropor- The focus of our investigation is how the composition of the tionately young, and younger people are also more likely to MTurk worker pool varied across days of the week, across identify as LGB (Gates & Newport, 2012; Moore, 2015). time of day, and across the serial order in which they partici- Finally, summary statistics for the attitudinal and personal- pated. Main findings of these analyses are summarized in ity measures are summarized in Table 4. Consistent with ear- Table 5. We looked for variations within the following vari- lier research, workers were more likely to identify as ables: age, gender identity, education, employment, house- Democrats than are members of the general population hold income, household size, race, Latino ethnicity, (Berinsky et al., 2012; Mullinix et al., 2015). Relatively few socioeconomic status, sexual orientation, relationship status, workers identified as religious, a disproportionate number party identification, religion, and religiosity. Our survey identified as atheists, and reported rates of church attendance design allowed respondents to identify as more than one were generally low. Relative to normed data obtained from a race, so we treated each racial category (White, Black or large convenience sample of Internet users (Gosling, Rentfrow, African American, Asian American, American Indian or & Potter, 2014), MTurk workers reported being about two Alaskan Native, Native Hawaiian or Pacific Islander, or thirds of a standard deviation less extraverted, about a third of Other) as a single binary dependent variable. We also looked a standard deviation less open to new experiences, and only for differences in the Big Five personality traits: extraver- slightly less agreeable, conscientious, or emotionally stable. sion, agreeableness, conscientiousness, emotional stability, The vast majority (92.5%) of participants in our study completed the survey on a computer. Of the remaining par- and openness. Finally, we examined workers’ prior experi- ticipants, 2% completed the survey using a tablet, 4.5% using ence and where they reported finding the survey. a phone, and the rest using other devices (e.g., game con- In two instances, similar and highly correlated variables soles) or devices that could not be identified. Rates of mobile were collected for purposes irrelevant to the present study. In device use are somewhat lower than have been noted in other each case, only one variable was selected for analysis. The online panels (de Bruijne & Wijnant, 2014a, 2014b). first instance was marital status and relationship status. We 6 SAGE Open Table 2. Socioeconomic Characteristics of Workers. Characteristic Total sample (N = 9,770) First respondents (N = 438) National estimates Household income <14,999 11.7% [11.1, 12.3] 11% [8.1, 13.9] 12.5% 15,000-29,999 17.5% [16.8, 18.3] 17.4% [13.9, 21.0] 15.6% 30,000-49,999 24.6% [23.8, 25.5] 25.8% [21.7, 29.9] 18.5% 50,000-74,999 20.7% [19.9, 21.5] 21.5% [17.7, 25.3] 17.8% 75,000-99,999 12.2% [11.6, 12.9] 12.6% [9.5, 15.7] 12.1% >US$100,000 12.9% [12.2, 13.6] 11.7% [8.7, 14.7] 23.5% Household size 2.82 [2.79, 2.85] 2.76 [2.62, 2.90] 2.64 Living with parents 16.5% [15.8, 17.2] 15.1% [11.8, 18.5] — Employment status Employed full-time 48.5% [47.5, 49.5] 55.3% [50.6, 60.0] 48.4% Working part-time 15.7% [15.0, 16.4] 14.2% [10.9, 17.5] 10.8% Homemaker 8.6% [8.0, 9.2] 8% [5.5, 10.5] 5.4% Unemployed 9.4% [8.8, 10.0] 9.1% [6.4, 11.8] 4.8% Retired 2.2% [1.9, 2.5] 1.4% [0.3, 2.5] 15.4% Student 11.9% [11.3, 12.5] 7.5% [5.0, 10.0] 6.4% Permanent disability 1.9% [1.6, 2.2] 2.3% [0.9, 3.7] 6.5% Other 1.7% [1.4, 2.0] 2.3% [0.9, 3.7] 1.2% Education Less than high school 0.7% [0.5, 0.9] 1.4% [0.3, 2.5] 16.1% High school or equivalent 10.2% [9.6, 10.8] 10.5% [7.6, 13.4] 27.6% Some college 31.4% [30.5, 32.3] 22.4% [18.5, 26.3] 18.1% 2-year college degree 11.7% [11.1, 12.3] 9.6% [6.8, 12.4] 9.1% 4-year college degree 34.8% [33.9, 35.7] 44.5% [39.9, 49.2] 18.5% Postgraduate degree 11.1% [10.5, 11.7] 11.6% [8.6, 14.6] 10.6% Note. 95% CI indicated in parentheses. CI = confidence interval. U.S. Census Bureau (2011-2015). Bureau of Labor Statistics, U.S. Department of Labor (2016). U.S. Census Bureau (2016). Table 3. Relationship Characteristics of Workers. Characteristic Total sample (N = 9,770) First respondents (N = 438) National estimates Relationship status Single 32.3% [31.4, 33.2] 36.5% [32.0, 41.0] — Casually dating 5% [4.6, 5.4] 5.7% [3.5, 7.9] — Monogamous 60.6% [59.6, 61.6] 56.8% [52.2, 61.4] — Consensually nonmonogamous 1.5% [1.3, 1.7] 0.7% [0.0, 1.5] — Other/refused 0.3% [0.2, 0.4] 0.0% [0.0, 0.3] — Marital status Never married 42.8% [41.8, 43.8] 46.1% [41.4, 50.8] 32.8% Married 34.9% [34.0, 35.9] 29.2% [24.9, 33.5] 48.2% Partnered 14.2% [13.5, 14.9] 16.4% [12.9, 19.9] — Separated 1.2% [1.0, 1.4] 0.5% [0.0, 1.2] 2.1% Divorced 6% [5.5, 6.5] 7.3% [4.9, 9.7] 11.0% Widowed 0.8% [0.6, 1.0] 0.5% [0.0, 1.2] 5.9% Sexual orientation Lesbian or gay 3.8% [3.4, 4.2] 2.3% [0.9, 3.7] 1.7% Bisexual 6.9% [6.4, 7.4] 6.6% [4.3, 8.9] 1.8% Straight 86.8% [86.1, 87.5] 88.8% [85.9, 91.8] 96.5% Other 2.2% [1.9, 2.5] 2.1% [0.8, 3.4] — Note. 95% CI indicated in parentheses. CI = confidence interval. U.S. Census Bureau (2011-2015). General Social Survey (as reported and summarized in Gates, 2014). Casey et al. 7 Table 4. Attitudinal and Personality Characteristics of Workers. Characteristic Total sample (N = 9,770) First respondents (N = 438) National estimates Political affiliation Identifies as republican 17.90% [17.1, 18.7] 18.3% [14.7, 21.9] 28.8% [27.0, 30.5] Identifies as democrat 41.30% [40.3, 42.3] 47% [42.3, 51.7] 34.9% [33.1, 36.7] Ideology (1 = extremely liberal, 3.39 [3.36, 3.42] 3.31 [3.16, 3.46] 4.26 [4.20, 4.31] 7 = extremely conservative) Religion Christian—Mainline Protestant 16% [15.3, 16.7] 13.3% [10.1, 16.5] 11.7% [10.5, 12.8] Christian—Evangelical 8.5% [8.0, 9.1] 8.6% [6.0, 11.3] 21.3% [19.7, 22.8] Christian—Catholic 11.4% [10.8, 12.0] 14.3% [11.0, 17.6] 22.4% [20.9, 24.0] Christian—Other/not specified 10% [9.4, 10.6] 7.3% [4.9, 9.7] 13.8% [12.5, 15.1] Jewish 1.2% [1.0, 1.4] 0.5% [0.0, 1.2] 2.2% [1.7, 2.8] Muslim 0.6% [0.5, 0.8] 1.4% [0.3, 2.5] 1.0% Atheist 20.4% [19.6, 21.2] 25.5% [21.4, 29.6] 3.1% Nothing in particular 24.6% [23.8, 25.5] 23.6% [19.6, 27.6] 24.0% [22.3, 25.6] Other 7% [6.5, 7.5] 5.3% [3.2, 7.4] 4.7% [3.9, 5.5] Religiosity Attends at least weekly 9.2% [8.6, 9.8] 6.9% [4.5, 9.3] 21.4% [19.8, 22.9] Attends at least monthly 12.1% [11.5, 12.8] 13.1% [9.9, 16.3] 11.3% [10.1, 12.5] Attends a few times per year 24.2% [23.4, 25.1] 22.6% [18.7, 26.5] 24.1% [22.5, 25.8] Never attends 54.1% [53.1, 55.1] 57.4% [52.8, 62.0] 43.2% [41.3, 45.1] Big Five personality traits (1 = low, 7 = high) Extraversion 3.58 [3.55, 3.61] 3.48 [3.33, 3.63] 4.13 [4.09, 4.18] Agreeableness 5.11 [5.09, 5.13] 5.18 [5.06, 5.30] 5.11 [5.07, 5.15] Conscientiousness 5.24 [5.21, 5.27] 5.40 [5.28, 5.52] 5.63 [5.59, 5.67] Emotional stability 4.70 [4.67, 4.73] 4.90 [4.76, 5.04] 4.92 [4.87, 4.97] Openness 5.09 [5.07, 5.11] 4.86 [4.74, 4.98] 4.81 [4.77, 4.85] Note. 95% CI indicated in parentheses. CI = confidence interval. Population estimates derived from American National Election Studies 2012 time series unless otherwise noted. Pew Research Center (2016a). Pew Research Center (2016b). selected marital status for analysis because this variable is already completed), which was modeled using a negative more typically recorded in national surveys and therefore binomial distribution. This approach was adapted to multino- more relevant for this demographic analysis. The second mial regression to evaluate differences in religion, as SPSS’ instance was political ideology and party affiliation. We con- implementation of GZLM cannot be used for multinomial ducted the analyses using political ideology, but results are variables. identical when party identification is used instead. Including so many independent and dependent variables To limit the number of comparisons, some response brings with it the risk of false positives. To mitigate this risk, options were collapsed into broader categories (e.g., specific we limited the number of comparisons by not including denominations of Christianity were collapsed into a single interactions in the model. We also limited the comparisons of category). In total, given the coding, our final analysis each time or day to the grand mean for all times and days included 31 different demographic variables. (rather than individual comparisons against all other times or For all continuous, ordinal, and binomial variables, gener- days). For example, we compared the mean percentage of alized linear modeling (GZLM) was used to regress (a) the college graduates in batches posted on Tuesdays with the day of the week (categorical), (b) the time of day the batch mean percentage of college graduates in all batches (includ- was posted (categorical), (c) the serial position of the batch ing Tuesdays). This approach led to a total of 13 significance within the data collection run (continuous), (d) the serial tests for each of the 29 demographic variables and two position of the individual response within the batch (continu- MTurk behavior variables (worker experience and where ous), and (e) a dichotomous variable representing the amount they found the study), for a total of 403 comparisons. of compensation (categorical) to control for possible effects To further reduce the potential for false positives, we set of increasing payment part way through the study. Interval the alpha criterion at .01, rather than the more typical .05, dependent measures were treated as linear effects, except for and used the Benjamini–Hochberg adjustment (Benjamini & worker experience (i.e., the total number of MTurk HITs Hochberg, 1995) to hold the false discovery rate across all 8 SAGE Open Table 5. Significant Results by Time of Day, Day of Week, Serial Position, and Pay Rate. Outcome Contrast Wald p d Interpretation Time of day effects Time zone 10 a.m. vs. mean 71.93 <.00001 0.17 More workers from eastern time zones at 10 a.m. ET Time zone 10 p.m. vs. mean 68.12 <.00001 0.17 More workers from western time zones at 10 p.m. ET Worker experience 10 p.m. vs. mean 43.67 <.00001 0.13 Workers are less experienced at 10 p.m. ET Worker experience 10 a.m. vs. mean 27.78 <.00001 0.11 Workers are more experienced at 10 a.m. ET % completed by smartphone 10 p.m. vs. mean 18.01 <.00001 0.09 Workers more likely to use phones at 10 p.m. ET Relationship status 10 a.m. vs. mean 16.91 <.00001 0.08 Workers less likely to be single at 10 a.m. ET Relationship status 10 p.m. vs. mean 16.63 <.00001 0.08 Workers more likely to be single at 10 p.m. ET Found HIT outside of MTurk 10 a.m. vs. mean 16.01 <.00001 0.08 Workers less likely to find the HIT outside of MTurk at 10 a.m. ET % Asian American 10 p.m. vs. mean 15.51 <.00001 0.08 Workers more likely to be Asian American at 10 p.m. ET % Asian American 10 a.m. vs. mean 15.24 <.00001 0.08 Workers less likely to be Asian American at 10 a.m. ET Found HIT outside of MTurk 3 p.m. vs. mean 13.07 .0003 0.07 Workers more likely to find the HIT outside of MTurk at 3 p.m. ET Conscientiousness 10 p.m. vs. mean 11.53 .0007 0.07 Workers are less conscientious at 10 p.m. ET Day of week effects Found HIT outside of MTurk Sat vs. mean 35.87 <.00001 0.12 Workers less likely to find the HIT outside MTurk on Saturday Found HIT outside of MTurk Thurs vs. mean 35.52 <.00001 0.12 Workers more likely to find the HIT outside MTurk on Thursday Age Sat vs. mean 35.08 <.00001 0.12 Workers were older on Saturdays Age Thurs vs. mean 32.14 <.00001 0.11 Workers were younger on Thursdays Employment status Sun vs. mean 14.01 0.0002 0.08 Workers more likely to have full-time jobs; less likely to lack formal employment altogether (no change in part- time status) Age Wed vs. mean 12.47 0.0004 0.07 Workers were younger on Wednesdays Found HIT outside of MTurk Sun vs. mean 12.12 0.0005 0.07 Workers less likely to find the HIT outside MTurk on Sunday Overall serial position effects Worker experience Linear effect 460.68 <.00001 0.44 Workers more experienced earlier in the data collection Emotional stability Linear effect 38.20 <.00001 0.13 Workers more emotionally stable earlier in the data collection Age Linear effect 26.67 <.00001 0.1 Workers were older earlier in the data collection Conscientiousness Linear effect 23.96 <.00001 0.1 Workers more conscientious earlier in the data collection Agreeableness Linear effect 23.44 <.00001 0.1 Workers more agreeable earlier in the data collection Employment status Linear effect 12.55 .0004 0.07 Workers more likely to have full-time jobs earlier in the data collection Household size Linear effect 12.36 .0004 0.07 Workers come from smaller households earlier in the data collection Within-batch serial position effects Worker experience Linear effect 35.27 <.00001 0.12 More experienced workers respond to an available HIT faster Sex Linear effect 26.99 <.00001 0.09 Female workers respond to an available HIT faster % Asian American Linear effect 18.52 <.00001 0.08 Asian workers respond to an available HIT slower Age Linear effect 14.06 .0002 0.08 Younger workers respond to an available HIT slower Found HIT outside of MTurk Linear effect 159.38 <.0001 0.26 Workers who completed the HIT sooner were less likely to have found it outside MTurk Pay effects Worker experience High vs. low pay 78.69 <.00001 0.18 Workers more experienced once pay was increased Emotional stability High vs. Low Pay 13.35 .0003 0.07 Workers more emotionally stable once pay was increased Note. This table includes the 33 comparisons that revealed statistically significant differences. We only report effect sizes for statistically significant results. The entries in the table are sorted by type of temporal variation, and then by ascending order of effect size. As noted in the text, we used the Benjamini–Hochberg adjustment for multiple comparisons and consider all p-values less than .0007 to be statistically significant (this ensures that the false discovery rate across all comparisons is held constant at .01). ET = Eastern Time; HITs = Human Intelligence Tasks; MTurk = Amazon Mechanical Turk. Casey et al. 9 comparisons constant at .01 across all tests. Following these 48.6% of workers at 10 p.m. Eastern Time reside in the U.S. adjustments, no results with an unadjusted p value above Eastern time zone, while 18.9% of workers were from the .0007 are reported as statistically significant, and of the sig- U.S. Pacific time zone. nificant results that we report, only four are expected to be The proportion of Asian American respondents also false positives observed by chance alone. Table 5 includes increased over the course of the day, growing from 5.9% at the 33 statistically significant differences among the 403 10 a.m. to 7.6% at 3 p.m. to 9% at 10 p.m. The proportion of comparisons. Asian Americans was significantly lower than average at 10 a.m. (β = −.016, Wald χ = 15.24, p < .0001, d = .08) and Day of week effects. Of our 217 day-of-week comparisons, significantly higher than average at 10 p.m. (β = .016, Wald we found seven instances in which the attributes of partici- χ = 15.49, ps < .0001, d = .08). This effect was no longer pants recruited on a particular day of the week significantly significant, however, when controlling for time zone, sug- differed from the sample as a whole. These findings are sum- gesting that this difference reflects that more Asian American marized in Table 5. workers live on the west coast. The average age of respondents varied as a function of the Other differences were observed that were not an artifact day of the week. Participants on Wednesday (M = 32.4, SD = of time zone. The proportion of single workers increased lin- 10.78) and Thursday (M = 32.46 SD = 10.67; β = −1.04, early throughout the day from 29.1% at 10 a.m. to 32.2% at 2 2 Wald χ = 12.47, p < .001, d = .07 and β = −1.44, Wald χ = 3 p.m. to 34.9% at 10 p.m. The proportion of workers who 32.14, p < .0001, d = .11, respectively) were somewhat are single was significantly lower than average at 10 a.m. (β younger than the sample as a whole (M = 33.51, SD = 11.31). = −.03, Wald χ = 16.91, p < .0001, d = .08) and significantly Respondents completing the survey on Saturday were some- higher than average at 10 p.m. (β = .03, Wald χ = 16.62, p < what older than average (M = 35.84, SD = 12.47; β = 1.88, .0001, d = .08). Wald χ = 35.09, p < .0001, d = .12). More workers who completed the survey at 10 p.m. used People completing HITs on Sundays were more likely to smartphones (5.8%) than across the sample as a whole be employed full-time (52%) than the sample as a whole (3.7%; β = .014, Wald χ = 18.01, p < .0001 d = .09). Workers (48.5%; β = .21, Wald χ = 14.01 p = .0002, d = .08), with a recruited at 10 p.m. also reported being less conscientious (M corresponding decrease in the proportion of individuals = 5.18, SD = 1.31) than the sample as a whole (M = 5.24, SD without any formal employment (31.2% as compared with = 1.27; β = −.06, Wald χ = 11.53, p = .0007, d = .07). 35.7%). The proportion of workers employed part-time was Workers who completed the HIT at 10 a.m. were less roughly the same across all days of the week. likely to report having found the HIT outside of the MTurk Workers were less likely to find the survey outside of interface (8.5%) than the sample as a whole (9%; β = −.014, MTurk on Saturday (3.4%) or Sunday (6%) than the sample Wald χ = 16.01, p < .0001, d = .08). Workers who completed as a whole (9%; β = −.04, Wald χ = 35.87 p < .0001, d = .12 the HIT at 3 p.m. were more likely (9.7%) to have found the 2 2 and β = −.02, Wald χ = 12.12 p = .0005, d = .07, respec- HIT outside of the MTurk interface (β = .013, Wald χ = tively). Workers who completed the survey on a Thursday 13.07, p = .0003, d = .07). were much more likely to have found it on a source outside of Finally, relative to the sample as a whole (M = 4.67, SD = MTurk (15.3%; β = .04, Wald χ = 35.52 p < .0001, d = .12). 10.04), more experienced workers tended to participate in the morning (M = 5.02, SD = 10.57; β = .43, Wald χ = 27.77, Time of day effects. Of our 93 time-of-day compari- p < .0001, d = .11) and less likely to do so at night (M = 4.75, sons, we found 12 instances in which attributes of par- SD = 8.29; β = −.43, Wald χ = 43.67, p < .0001, d = .13). ticipants recruited at a particular time of day differed significantly from the grand mean. These differences Overall serial position effects. Of our 31 positional com- generally reflected linear trends in the composition of the parisons, we found seven instances in which the attributes MTurk workforce throughout the day, and are summarized of participants differed over time. Workers who com- in Table 5. pleted HITs earlier in the data collection process reported As might be expected, one of the most pronounced conse- higher levels of emotional stability, conscientiousness, quences of posting at different times was variation in the pro- and agreeableness. Participants who completed earlier portion of workers from different time zones. People in batches of HITs also tended to be older were more likely earlier time zones were more likely than average to complete to have a full-time job and live in smaller households. HITs posted at 10 a.m. (β = −.15, Wald χ = 71.92, p < .0001, Workers who completed HITs earlier were also substan- d = .17). Conversely, people in later time zones were more tially more experienced than workers recruited later in the likely to complete HITs posted at 10 p.m. (β = .13, Wald χ = study (Table 6). 68.11, p < .0001, d = .17). As an illustration of the conse- quences of this shift, 56.8% of respondents at 10 a.m. Eastern Within-batch serial position effects. Of our 31 positional Time were from the U.S. Eastern time zone while only 10.9% comparisons within batch, we found five instances in which of workers were from the Pacific Time zone. In contrast, the attributes of participants recruited earlier in a given batch 10 SAGE Open Table 6. Worker Characteristics as a Function of Serial Position Across Study. Respondent 2065 (−1 SD) Respondent 7706 (+1 SD) Linear trend Age 34.79 [34.41, 3.16] 32.62 [32.01, 33.22] β = −.00038, Wald χ = 26.67, p < .001, d = .10 Household size 2.73 [2.69, 2.78] 2.92 [2.85, 2.99] β = .00003, Wald χ = 12.36, p < .001, d = .07 Employed full-time 50% [48, 52] 45% [42, 47] β = −.00004, Wald χ = 12.55, p < .001, d = .07 Conscientiousness 5.34 [5.29, 5.38] 5.10 [5.03, 5.17] β = −.00004, Wald χ = 23.96 p < .001, d = .10 Agreeableness 5.20 [5.15, 5.24] 4.97 [4.91, 5.04] β = −.00004, Wald χ = 23.44 p < .001, d = .10 Emotional stability 4.83 [4.78, 4.88] 4.49 [4.41, 4.57] β = −.00006, Wald χ = 38.20 p < .001, d = .13 Worker experience 6.52 [6.29, 5.6.76] 2.66 [2.51, 2.83] β = −.00016, Wald χ = 460.68, p < .0001, d = .44 Table 7. Worker Characteristics as a Function of Serial Position Within Batches. First respondent in batch 100th responder in batch (−1 SD) (+1 SD) Linear trend Age 34.14 [33.81, 34.48] 33.26 [32.85, 33.67] β = −.0089, Wald χ = 14.06, p < .0001, d = .08 Female 56% [44%, 57%] 50% [48%, 52%] β = .0022, Wald χ = 26.99, p < .0001, d = .11 Asian American 7% [6%, 8%] 9% [8%, 10%] β = .0031, Wald χ = 18.52, p < .0001, d = .09 Found survey outside of 5% [4%, 6%] 10% [9%, 11%] β =.0074, Wald χ = 159.38, p < .0001, d = .26 Mechanical Turk Worker experience 4.46 [4.31, 4.61] 3.89 [3.74, 4.06] β = −.0013, Wald χ = 35.27, p < .0001, d = .12 differed from the attributes recruited later in the same batch. we recruited workers without allowing for replacement— Workers who completed an available HIT earlier in a given that is, workers could only participate once. Differences batch were on average older, more likely to be female, and between samples may be larger or smaller if workers are not less likely to be Asian American. Workers who completed restricted from participating more than once. HITs sooner were also less likely to have found the survey on a source outside of MTurk but tended to be more experienced Demographic Differences by Day and Time than workers recruited later in the study (Table 7). Day of the week influenced few (2%, or 4/203) demographic Pay effects. Pay effects were included primarily to con- characteristics, and these effects were small (M = 0.09). To trol for a change in design part way through data collec- the extent that these effects were detectable, they suggest that tion. Of the 31 payment comparisons, we found evidence samples collected over the weekend are more likely to of only two characteristics that changed once we offered to include older and more fully employed respondents. These pay more. Controlling for other variables, workers in the differences seem plausible, but the lack of differences across high-pay condition reported higher emotional stability (M other characteristics suggests that potential day of week = 4.77 SD = 1.90) than workers paid less (M = 4.56, SD = effects can be safely ignored. 2.30; β = .32, Wald χ = 13.35, p = .0003, d = .07). Work- Time of day resulted in similarly small effects (M = 0.10) ers were also more experienced when pay was higher (M = but within a larger proportion (9%, or 8/87) of measured 4.77, SD = 1.90) than when pay was lower (M = 4.77, SD variables. In almost all cases, these differences represented = 1.90; β = .47, Wald χ = 78.69, p < .0001, d = .18). These linear trends in sample composition across the day, and thus results and all other significant intertemporal differences when considering the potential impact of recruiting in the are summarized in Table 5. morning or in the evening, the combined impact of both effect size estimates should be considered. Of particular note, contrary to previous research (Komarov Discussion et al., 2013), we found that workers were more likely to use In this article, we have described demographic characteris- mobile devices late at night (5.8% of HITs posted at 10 p.m. tics of a large sample of MTurk workers and examined dif- were submitted from mobile phones, compared with 3.7% of ferences across time, day, and serial position. Of our 403 HITs submitted during the rest of the day). Mobile device use demographic comparisons, we found 33 differences (8.2% of can have adverse effects on data quality, including increased tested effects), and significant effects had an average effect rates of attrition (Mavletova, 2013; Sommer, Diedenhofen, size of d = 0.11. These findings provide evidence that MTurk & Musch, 2016; Wells, Bailey, & Link, 2013) and shorter samples vary intertemporally, but that in general these differ- and fewer open-ended responses (Mavletova, 2013; ences are small. An important caveat to these findings is that Struminskaya, Weyandt, & Bosnjak, 2015). As a result, Casey et al. 11 researchers might consider adjusting the time of day at which particular relevance, we found variations in the “Big Five” they post research studies or collect data if they hope to opti- personality factors as a function of serial position. Workers mize mobile completion or collect open-ended responses. who completed HITs earlier in the data collection process The large proportion of observed differences suggest that reported being slightly more emotionally stable, more con- time of day effects might be a fruitful area of future research, scientious, and more agreeable. These traits are associated both through expanding the range of variables that are exam- with and may moderate other important variables including ined and with a particular effort to understand how regional respondent data quality, or political behaviors and attitudes differences, differences in the active user population across that might bias samples (for an excellent review, see Gerber, time within regions, and changes in individual responses Huber, Doherty, & Dowling, 2011), or data quality. throughout the day combine to produce these differences. Variations in demographic characteristics associated with serial position within batches of HITs are important when considering whether to recruit respondents in large batch or Demographic Differences by Serial Position small batches. It is particularly important to understand The effects of serial position were more extensive than time- potential within-batch serial position effects because several of-day and day-of-week effects; 21% (6/29) of across-sam- third-party solutions (e.g., TurkGate, Goldin & Darlow, ple serial position effects were significant, with an average 2013; and TurkPrime, Litman, Robinson, & Abberbock, effect size of M = = .10 and 10% (3/29) of within-batch 2017) make it easy to divide data collection efforts into a serial position effects were significant, with M = = .09. large number of very small batches. By and large, we find Many of these across-sample findings are compatible with that smaller batches will lead samples to be older and have earlier studies of serial position effects. As observed in uni- more women, but will attenuate the overrepresentation of versity subject pools, early respondents report higher levels Asian American workers. of conscientiousness (Aviv et al., 2002; Ebersole et al., 2016). In general population samples, those who responded Differences in Worker Experience and Forum Use to surveys first tended to be older (Filion, 1975; Sigman et al., 2014). We observe similar results both across our Time of day and serial position were strongly related to how entire sample and within individual batches of HITs. While much MTurk experience respondents had and how workers other studies find that women are more likely to respond to found the survey. More experienced workers completed the requests to complete both mail surveys (Gannon et al., 1971) survey earlier in data collection (both within and across and web surveys (Sigman et al., 2014) quickly (Cooper et al., batches). Variations in worker experience may be associated 1991; Ebersole et al., 2016), we find that women respond with greater exposure to survey tactics, experimental manip- more quickly within batches, but not across the sample as a ulations, which can have various effects on data quality. On whole. Contrary to studies of race and serial position effects one hand, more experienced workers are more familiar with in other modes (Gannon et al., 1971; Sigman et al., 2014; common research questions, leading to practice effects Voigt et al., 2003), we found little evidence that racial diver- (Chandler et al., 2014), potentially smaller effect sizes on sity increased over time. Typically, later survey respondents commonly used experimental paradigms (Chandler et al., belong to groups that are possible but difficult to contact. 2015) and potentially more extreme and less malleable atti- Only those who register with MTurk can take part in surveys tudes toward topics that respondents are frequently asked posted on the platform. African American and Latino popula- about (Sturgis, Allum, & Brunton-Smith, 2009). On the other tions are underrepresented on MTurk, and so it may be that hand, more experienced workers may be more attentive and those individuals who may be possible but difficult to con- therefore may provide higher quality responses. tact through other modes of survey data collection are simply We also observed substantial intertemporal variation in impossible to reach on MTurk. workers using forums, with more referrals from links shared When sampling error is unsystematic, larger samples outside of MTurk happening in the afternoon and on more closely approximate the population. This is not so in Thursdays and less in evenings and weekends. These differ- the presence of systematic bias. As our sample increased, ences may be relevant if researchers are concerned about some biases (e.g., the democratic tendencies of respondents) respondents who have potentially seen information about a remained the same. In other cases, biases actually increases study prior to completing it. The longer a HIT is available, (e.g., age, employment, conscientiousness, and emotional the more opportunity workers have to find it on an outside stability). Thus, it is not a given that making a sample more forum. representative of the U.S. MTurk worker population will also Although we did not vary pay rates experimentally, we make it more representative of the U.S. population as a nonetheless found that when we increased pay, there was a whole. Variations in demographic characteristics across the concomitant increase in the experience of survey partici- entire sample are also relevant to researchers who recruit pants. Together, we thus observed two separate patterns: (a) workers from the available pool without replacement (e.g., to Early responders to the survey tended to be more experi- prevent workers from completing the same study twice). Of enced workers and (b) when we increased the pay, 12 SAGE Open the proportion of more experienced workers increased even As MTurk and other similar online convenience samples further. If researchers are concerned that worker savviness become more widely used, it is increasingly important that might affect their findings (Krupnikov & Levine, 2014), they we better understand who participates in these subject pools should be attentive to these possibilities when they post their and when certain kinds of respondents are more likely to opt- studies. in relative to others. Such examinations will help researchers assess published results, especially (though not limited to) their generalizability across populations and over time. Conclusion This project suggests several directions for future This study is the largest and most comprehensive description research. Beyond extending the analysis of temporal effects of MTurk demographics that we are aware of and the first to new variables, or examining intertemporal variation in large-scale effort to examine intertemporal differences in other sources of data, future work could examine how other sample composition (however, for a similar project, see design choices affect sample composition, including whether Arechar, Kraft-Todd, & Rand, 2016). Data from our study of researchers with poor ratings or tasks with low pay get sub- approximately 10,000 MTurk workers have allowed us to stantively different samples than researchers with better rat- examine three key possible sources of temporal variation in ings or tasks with higher pay. This is an important area for MTurk sample composition: (a) time of day, (b) day of week, future research to examine, particularly as researchers con- and serial position both (c) across the entire data collection tinue or increase reliance on online data collection. and (d) within specific batches. Taken as a whole, our results should serve as a source of Declaration of Conflicting Interests both comfort and caution to scholars who use MTurk to The author(s) declared no potential conflicts of interest with respect recruit participants for their research. On one hand, we to the research, authorship, and/or publication of this article. found only minimal day-of-week differences. However, we also showed that there are small but significant time- Funding of-day variations in demographic composition—variations The author(s) received no financial support for the research, author- that bear closer scrutiny. The effects of serial position also ship, and/or publication of this article. warrant further study, as they emerged as persistent influ- ences across multiple variables, including characteristics Notes known to affect political and psychological attitudes (e.g., 1. People of color is a commonly used umbrella term denoting Big Five personality traits; Dietrich, Lasley, Mondak, racial and ethnic minorities in America, including African Remmel, & Turner, 2012; Gerber et al., 2011). Differences Americans, Latinos, Asians, and others. in sample composition can compromise claims to general- 2. We followed this general procedure when it was time to repost izability and might lead to challenges with reproducing the HIT: first, close the existing HIT; second, prevent the work- research findings as well (Peterson & Merunka, 2014). As ers who participated in the existing HIT from participating in is often the case, larger samples (and/or those recruited in future postings (using qualifications; see Chandler, Mueller, & such a way to be more representative) are especially criti- Paolacci, 2014); third, post the new HIT. cal when researchers are concerned about heterogeneous 3. The Benjamini–Hochberg adjustment does not identify spe- cific false positives, but rather holds the number of false posi- treatment effects may reduce the external validity of a tives constant across many tests to a specified level. given sample. 4. Seven days by 31 variables produces 217 comparisons. Researchers should bear our findings in mind as they 5. Three times of day (10 a.m., 3 p.m., 10 p.m.) by 31 demo- consider how best to recruit samples from MTurk. The inter- graphic variables produces 93 comparisons. temporal dynamics we have detailed are likely to be most 6. Thirty-one demographic variables, treating time as a linear relevant to researchers attempting to collect representative effect by batch number. samples of the MTurk worker population, such as studies of 7. The size of these effects will depend on both the magnitude MTurk worker behavior and attitudes that attempt to under- of difference between the samples on a given variable and the stand the dynamics of contract labor and piece-work in the magnitude of the moderating effect this variable has on the “gig economy” (Aguinis & Lawal, 2013; Brawley & Pury, theoretical relationship of interest (Ho, Imai, King, & Stuart, 2016). But researchers interested in other topics should pay 2007). attention to relationships such as those between serial posi- tion and psychological characteristics and consider includ- References ing information about when and how many times they Aguinis, H., & Lawal, S. O. (2013). eLancing: A review and posted their HIT when reporting results. Perhaps most research agenda for bridging the science–practice gap. Human importantly, these findings demonstrate that the number of Resource Management Review, 23, 6-17. doi:10.1016/j. workers recruited and the size of batches used to recruit hrmr.2012.06.003 them can have a large effect on the average experience of American National Election Studies. (2012). The ANES 2012 Time sample respondents. Series Study [dataset]. Stanford University and University of Casey et al. 13 Michigan [producers]. Available from www.electionstudies. Dietrich, B. J., Lasley, S., Mondak, J. J., Remmel, M. L., & Turner, org J. (2012). Personality and legislative politics: The Big Five trait Arechar, A. A., Kraft-Todd, G. T., & Rand, D. G. (2016). Turking dimensions among U.S. state legislators. Political Psychology, overtime: How participant characteristics and behavior vary 33, 195-210. over time and day on Amazon Mechanical Turk. Retrieved Ebersole, C. R., Atherton, O. E., Belanger, A. L., Skulborstad, H. from https://ssrn.com/abstract=2836946 M., Allen, J., Banks, J. B., . . . Nosek, B. A. (2016, June 28). Aviv, A. L., Zelenski, J. M., Rallo, L., & Larsen, R. J. (2002). Who Many labs 3: Evaluating participant pool quality across the comes when: Personality differences in early and later partici- academic semester via replication. Retrieved from osf.io/ct89g pation in a university subject pool. Personality and Individual Filion, F. L. (1975). Estimating bias due to nonresponse in Differences, 33, 487-496. doi:10.1016/S0191-8869(01)00199-4 mail surveys. Public Opinion Quarterly, 39, 482-492. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false dis- doi:10.1086/268245 covery rate: A practical and powerful approach to multiple Flores, A., Herman, J. L., Gates, G. J., & Brown, T. N. T. (2016). testing. Journal of the Royal Statistical Society, Series B: How many adults identify as transgender in the United States? Methodological, 57, 289-300. doi:10.2307/2346101 The Williams Institute. Retrieved from https://williamsinsti- Berinsky, A. J., Huber, G. A., & Lenz, G. S. (2012). Evaluating tute.law.ucla.edu/research/how-many-adults-identify-as-trans- online labor markets for experimental research: Amazon.com’s gender-in-the-united-states/ Mechanical Turk. Political Analysis, 20, 351-368. doi:10.1093/ Fraley, R. C., & Vazire, S. (2014). The N-pact factor: Evaluating pan/mpr057 the quality of empirical journals with respect to sample size Brawley, A. M., & Pury, C. L. (2016). Work experiences on MTurk: and statistical power. PLoS ONE, 9(10), e109019. doi:10.1371/ Job satisfaction, turnover, and information sharing. Computers journal.pone.0109019 in Human Behavior, 54, 531-546. Gannon, M. J., Nothern, J. C., & Carroll, S. J. (1971). Characteristics Bureau of Labor Statistics, U.S. Department of Labor. (2016). of nonrespondents among workers. Journal of Applied Reasons people give for not being in the labor force, 2004 and Psychology, 55, 586-588. doi:10.1037/h0031907 2014 on the Internet. The Economics Daily. Retrieved from Gates, G. (2014). LGBT demographics: Comparisons among pop- https://www.bls.gov/opub/ted/2016/reasons-people-give-for- ulation-based surveys. The Williams Institute. Retrieved from not-being-in-the-labor-force-2004-and-2014.htm https://williamsinstitute.law.ucla.edu/wp-content/uploads/ Casler, K., Bickel, L., & Hackett, E. (2013). Separate but equal? lgbt-demogs-sep-2014.pdf A comparison of participants and data gathered via Amazon’s Gates, G., & Newport, F. (2012, October 18). Special report: 3.4% MTurk, social media, and face-to-face behavioral testing. of U.S. adults identify as LGBT. Retrieved from http://opin- Computers in Human Behavior, 29, 2156-2160. doi:10.1016/j. iontoday.com/2012/10/18/special-report-3-4-of-u-s-adults- chb.2013.05.009 identify-as-lgbt/ Chandler, J., Mueller, P., & Paolacci, G. (2014). Nonnaive among Gerber, A. S., Huber, G. A., Doherty, D., & Dowling, C. M. (2011). Amazon Mechanical Turk workers: Consequences and solu- The Big Five personality traits in the political arena. Annual tions for behavioral researchers. Behavioral Research Methods Review of Political Science, 14, 265-287. doi:10.1146/annurev- Science, 46, 112-130. polisci-051010-111659 Chandler, J., Paolacci, G., Peer, E., Mueller, P., & Ratliff, Goldin, G., & Darlow, A. (2013). TurkGate (Version 0.4. 0) [Software]. K. A. (2015). Using nonnative participants can reduce Retrieved from http://gideongoldin.github.io/TurkGate/ effect sizes. Psychological Science, 26, 1131-1139. Gosling, S. D., Rentfrow, P. J., & Potter, J. (2014). Norms for the doi:10.1177/0956797615585115 Ten Item Personality Inventory (Unpublished data).. Retrieved Chandler, J., & Shapiro, D. (2016). Conducting clinical research from http://gosling.psy.utexas.edu/scales-weve-developed/ten- using crowdsourced convenience samples. Annual Review item-personality-measure-tipi/. of Clinical Psychology, 12, 53-81. doi:10.1146/annurev- Gosling, S. D., Rentfrow, P. J., & Swann, W. B. (2003). A very clinpsy-021815-093623 brief measure of the Big-Five personality domains. Journal of Cooper, H., Baumgardner, A. H., & Strathman, A. (1991). Do stu- Research in Personality, 37, 504-528. dents with different characteristics take part in psychology exper- Guidelines for academic requesters. (2014). Version 1.1. Retrieved iments at different times of the semester? Journal of Personality, from https://irb.northwestern.edu/sites/irb/files/documents/ 59, 109-127. doi:10.1111/j.1467-6494.1991.tb00770.x guidelinesforacademicrequesters.pdf Corrigan, P. W., Bink, A. B., Fokuo, J. K., & Schmidt, A. (2015). Haupert, M. L., Gesselman, A. N., Moors, A. C., Fisher, H. E., The public stigma of mental illness means a difference between & Garcia, J. R. (2016). Prevalence of experiences with con- you and me. Psychiatry Research, 226, 186-191. doi:10.1016/j. sensual nonmonogamous relationships: Findings from two psychres.2014.12.047 national samples of single Americans. Journal of Sex & Marital Couper, M. P. (2000). Review: Web surveys: A review of issues Therapy, 22, 1-17. and approaches. Public Opinion Quarterly, 64, 464-494. Hillygus, D. S., Jackson, N., & Young, M. (2014). Professional doi:10.1086/318641 respondents in nonprobability online panels. In M. Callegaro, de Bruijne, M., & Wijnant, A. (2014a). Improving response rates R. Baker, J. Bethlehem, A. S. Göritz, J. A. Krosnick, & P. and questionnaire design for mobile web surveys. Public J. Lavrakas (Eds.), Online panel research (pp. 219-237). Opinion Quarterly, 78, 951-962. doi:10.1093/poq/nfu046 John Wiley. Retrieved from http://onlinelibrary.wiley.com/ de Bruijne, M., & Wijnant, A. (2014b). Mobile response in web doi/10.1002/9781118763520.ch10/summary panels. Social Science Computer Review, 32, 728-742. Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2007). Matching as doi:10.1177/0894439314525918 nonparametric preprocessing for reducing model dependence 14 SAGE Open in parametric causal inference. Political Analysis, 15, 199-236. Pew Research Center. (2016a). A new estimate of the U.S. Muslim doi:10.1093/pan/mpl013 population. Retrieved from http://www.pewresearch.org/fact- Howe, L. D., Hargreaves, J. R., Ploubidis, G. B., De Stavola, B. L., tank/2016/01/06/a-new-estimate-of-the-u-s-muslim-popula- & Huttly, S. R. A. (2011). Subjective measures of socio-eco- tion/ nomic position and the wealth index: A comparative analysis. Pew Research Center. (2016b). 10 facts about atheists. Retrieved Health Policy and Planning, 26, 223-232. doi:10.1093/heapol/ from http://www.pewresearch.org/fact-tank/2016/06/01/10- czq043 facts-about-atheists/ John, O. P., & Srivastava, S. (1999). The Big Five trait taxon- Ravallion, M., & Lokshin, M. (1999). Subjective economic welfare. omy: History, measurement, and theoretical perspectives. In The World Bank. Retrieved from http://elibrary.worldbank. Handbook of personality: Theory and research (Vol. 2, pp. org/doi/abs/10.1596/1813-9450-2106 102-138). New York: The Guilford Press. . Reidy, D. E., Berke, D. S., Gentile, B., & Zeichner, A. (2014). Man Komarov, S., Reinecke, K., & Gajos, K. Z. (2013). Crowdsourcing enough? Masculine discrepancy stress and intimate partner performance evaluations of user interfaces. In Proceedings violence. Personality and Individual Differences, 68, 160-164. of the SIGCHI Conference on Human Factors in Computing doi:10.1016/j.paid.2014.04.021 Systems (pp. 207-216). New York, NY: ACM. doi:10.1145/ Shapiro, D. N., Chandler, J., & Mueller, P. A. (2013). Using 2470654.2470684 Mechanical Turk to study clinical populations. Clinical Krupnikov, Y., & Levine, A. S. (2014). Cross-sample compari- Psychological Science, 1(2), 213-220. doi:10.1177/ sons and external validity. Journal of Experimental Political 2167702612469015 Science, 1, 59-80. doi:10.1017/xps.2014.7 Sigman, R., Lewis, T., Yount, N. D., & Lee, K. (2014). Does the Lakkaraju, K. (2015). A study of daily sample composition on length of fielding period matter? Examining response scores of Amazon Mechanical Turk. In N. Agarwal, K. Xu, & N. Osgood early versus late responders. Journal of Official Statistics, 30, (Eds.), Social computing, behavioral-cultural modeling, and 651-674. doi:10.2478/jos-2014-0042 prediction (pp. 333-338). Springer. Retrieved from http://link. Sommer, J., Diedenhofen, B., & Musch, J. (2016). Not to be con- springer.com/chapter/10.1007/978-3-319-16268-3_39 sidered harmful: Mobile-device users do not spoil data quality Litman, L., Robinson, J., & Abberbock, T. (2017). TurkPrime.com: in web surveys. Social Science Computer Review, 35(3): 378- A versatile crowdsourcing data acquisition platform for the 387. doi:10.1177/0894439316633452 behavioral sciences. Behavior Research Methods, 49, 433-442. Stewart, N., Ungemach, C., Harris, A. J. L., Bartels, D. M., Newell, doi:10.3758/s13428-016-0727-z B. R., Paolacci, G., & Chandler, J. (2015). The average labora- Mavletova, A. (2013). Data quality in PC and mobile web sur- tory samples a population of 7,300 Amazon Mechanical Turk veys. Social Science Computer Review, 31(6), 725-743.. workers. Judgment and Decision Making, 10, 479-491. doi:10.1177/0894439313485201 Struminskaya, B., Weyandt, K., & Bosnjak, M. (2015). The effects Moore, P. (2015). A third of young Americans say they aren’t 100% of questionnaire completion using mobile devices on data heterosexual. YouGov. Retrieved from https://today.yougov. quality: Evidence from a probability-based general population com/news/2015/08/20/third-young-americans-exclusively- panel. Methods, Data, Analyses: A Journal for Quantitative heterosexual/ Methods and Survey Methodology, 9, 261-292. doi:10.12758/ Mullinix, K. J., Leeper, T. J., Druckman, J. N., & Freese, J. mda.2015.014 (2015). The generalizability of survey experiments. Journal Sturgis, P., Allum, N., & Brunton-Smith, I. (2009). Attitudes of Experimental Political Science, 2, 109-138. doi:10.1017/ over time: The psychology of panel conditioning. In P. Lynn XPS.2015.19 (Ed.), Methodology of longitudinal surveys (pp. 113-126). Paolacci, G., & Chandler, J. (2014). Inside the turk: Understanding John Wiley. Retrieved from http://onlinelibrary.wiley.com/ Mechanical Turk as a participant pool. Current Directions doi/10.1002/9780470743874.ch7/summary in Psychological Science, 23, 184-188. doi:10.1177/ U.S. Census Bureau. (2011-2015). American community survey 0963721414531598 5-year estimates. Retrieved from https://www.census.gov/data/ Paolacci, G., Chandler, J., & Ipeirotis, P. G. (2010). Running developers/data-sets/acs-5year.html experiments on Amazon Mechanical Turk (SSRN Scholarly U.S. Census Bureau. (2016). Current population survey, annual Paper No. ID 1626226). Rochester, NY: Social Science social and economic supplement. Retrieved from https://www. Research Network. Retrieved from http://papers.ssrn.com/ census.gov/cps/data/cpstablecreator.html abstract=1626226 Voigt, L. F., Koepsell, T. D., & Daling, J. R. (2003). Characteristics Peer, E., Brandimarte, L., Samat, S., & Acquisti, A. (2017). Beyond of telephone survey respondents according to willingness to the turk: Alternative platforms for crowdsourcing behavioral participate. American Journal of Epidemiology, 157, 66-73. research. Journal of Experimental Social Psychology, 70, doi:10.1093/aje/kwf185 153-163. Weinberg, J., Freese, J., & McElhattan, D. (2014). Comparing data Peer, E., Vosgerau, J., & Acquisti, A. (2014). Reputation as a suf- characteristics and results of an online factorial survey between ficient condition for data quality on Amazon Mechanical Turk. a population-based and a crowdsource-recruited sample. Behavior Research Methods, 46, 1023-1031. doi:10.3758/ Sociological Science, 1, 292-310. doi:10.15195/v1.a19 s13428-013-0434-y Wells, T., Bailey, J., & Link, M. (2013). Filling the void: Gaining a Peterson, R. A., & Merunka, D. R. (2014). Convenience samples of better understanding of tablet-based surveys. Survey Practice, college students and research reproducibility. Journal of Business 6(1). Retrieved from http://www.surveypractice.org/index. Research, 67, 1035-1041. doi:10.1016/j.jbusres.2013.08.010 php/SurveyPractice/article/view/25 Casey et al. 15 Author Biographies Andrew Proctor is a doctoral candidate in the Department of Politics at Princeton University. His primary research interests are Logan S. Casey received his PhD from the University of Michigan. in LGBT politics, identity and political mobilization. He is a research analyst in Public Opinion at the Harvard Opinion Research Program in the Harvard T.H. Chan School of Public Dara Z. Strolovitch is an associate professor at Princeton Health. His research examines political psychology, emotion, and University, where she holds appointments in Gender and Sexuality public opinion, particularly in the context of LGBTQ politics. Studies, African American Studies, and the Department of Politics. Jesse Chandler received his PhD from the University of Michigan. Her teaching and research focus on interest groups and social He is a researcher at Mathematica Policy Research and adjunct fac- movements, political representation, and the intersecting politics ulty at the Institute for Social Research. He is interested in survey of race, class, gender, and sexuality. Her book, Affirmative methodology, online research studies, decision-making, and human Advocacy, addressed these issues through an examination of the computation. ways in which advocates for women, people of colour, and low- income people represent intersectionally marginalized subgroups Adam Seth Levine is an assistant professor of Government at Cornell University. of their constituencies.
SAGE Open – SAGE
Published: Jun 14, 2017
Keywords: political methodology; political science; social sciences; politics and social sciences; research methods; data collection; research methodology and design; reliability and validity; political behavior/psychology; psychology
Access the full text.
Sign up today, get DeepDyve free for 14 days.