The past decade has seen data collection in survey research migrate from paper-and-pencil measures to online surveys. Conducting survey research using online media is often more convenient and flexible, permitting the researcher to quickly and easily obtain data from a large number of participants (Truell, Bartlett, & Alexander, 2002). Other benefits include lower cost (Kraut, Olson, Banaji, Bruckman, Cohen, & Couper, 2004), fewer physical resources, simplified logistics, and the elimination of data entry errors. Initially, there was some concern over the equivalence of online data collection, as compared with in-person paper-and-pencil methods. However, there appears to be some consensus that these two approaches are largely equivalent in terms of the psychometric properties that can be expected (Cole, Bedeian, & Field, 2006; De Beuckalaer & Lievens, 2009; Meade, Michels, & Lautenschlager, 2007; Meyerson & Tryon, 2003; Stanton, 1998), as well as impression management/social desirability (Booth-Kewley, Edwards, & Rosenfeld, 1992) and data completeness (Stanton, 1998).

One limitation of this previous work, however, is that some studies have conflated differences in administration medium with differences in the population reached by using the medium. For example, Booth-Kewley et al. (1992) found differences between administration formats on an attitudes survey when a college population was used, with no differences found in a sample of professional Navy recruits. Moreover, Manfreda, Bosnjak, Berzelak, Haas, and Vehovar (2008) reported meta-analytic results of 45 studies comparing Web-based and other survey media, finding that differences in criteria such as response rate or dropout rate were dependent on the sample recruitment base in question. Online panels (i.e., pools of respondents who have agreed to be contacted for multiple survey opportunities) behaved differently than one-time respondents, generally showing a smaller difference across media. Other authors have also noted that the effects of administration medium may depend on the purpose and significance of the questionnaire (e.g., Ployhart, Weekley, Holtz, & Kemp, 2003; Richman, Kiesler, Weisband, & Drasgow, 1999).

These studies underscore an important and often unrealized benefit of using online media for organizational research: the potential to reach a wider and more diverse population (Barchard & Williams, 2008; Dandurand, Shultz, & Onishi, 2008). Often, survey research relies on a homogeneous sample of undergraduates from Western, educated, industrialized, rich, and democratic societies(WEIRD; Henrich, Heine, & Norenzayan, 2010a, 2010b).This is of special concern when the survey in question relates to job- or career-related constructs of interest to organizational researchers (Anderson, 2003; Landy, 2008; Locke, 1986; Ward, 1993). Some past work has shown that even among “particularistic” research (e.g., research that is concerned with narrowly defined independent and dependent variables), differences exist between students and working adults, and these differences can severely limit the generalizability of findings (Ward, 1993).

The recent mainstream introduction of globally-reaching Internet technologies such as crowdsourcing may be a solution to the limited participant pool with which researchers must sometimes work (Gosling, Sandy, John, & Potter, 2010). Previous research has begun to shed light on the general demographic makeup of Mechanical Turk workers (Ipeirotis, 2010). Additionally, research has compared the quality of data from acceptability judgment experiments between Mechanical Turk workers and in-lab participants (Sprouse, 2011). Unfortunately, few studies have examined the types of people from crowdsourcing communities that participate in organizational psychologyresearch. Thus, the goal of the present study is to provide a primer on the use of crowdsourcing for organizational survey research. In this article, we provide an overview of crowdsourcing, examine the demographic makeup of a crowdsourced sample, and systematically investigate the viability of crowdsourcing for providing quality data for survey research. Specifically, we compare group means from crowdsourced and university samples with respect to several commonly used measures in organizational research. We assess the quality of the data garnered from both samples by examining social desirability, reliability of scales, completion time, length of open-ended responses, and data consistency and completeness. We also compare the psychometric functioning of the measures across samples via invariance tests.

Crowdsourcing

The etymology of the term crowdsourcing can be traced to a Wired magazine article where the term outsourcing was modified to describe the recruitment of a global online workforce without the need for a traditional outsourcing company (Howe, 2006).

Although Howe (2006) did not clearly define crowdsourcing when he coined the term, he indicated that it was limited to for-profit businesses leveraging the Internet workforce. The term has been recently defined as “the intentional mobilization for commercial exploitation of creative ideas and other forms of work performed by consumers” (Kleemann, Voß, & Rieder, 2008, p. 22). On the basis of the current and emerging uses of crowdsourcing technologies, these definitions have become too narrow and should be expanded to include other uses for leveraging an independent global workforce. For example, when adventurer Steve Fossett was reported missing in 2007 after failing to return from a solo plane ride over the Sierra Nevada Mountains, an Internet-based initiative was employed where thousands of independent, unpaid workers were tasked with using recently uploaded satellite images to search for any signs of a crash site ("Turk and Rescue," 2007). Recently, academic uses of crowdsourcing such as online research studies have become increasingly popular (e.g., Kittur, Chi, & Suh, 2008; Little, Chilton, Goldman, & Miller, 2009). For example, Heilman and Smith (2010)recruited 178 participants who produced 6,000 ratings on the quality of computer-generated questions on subject matter sourced from Wikipedia articles. In less than 24 h, Sharek (2010) used a crowdsourcing service to recruit 169 people to participate in an online study that used a video game to help measure user engagement. In another study, designed to investigate how people interpret line drawings and shaded images, 560 crowdsourced participants were asked to orient 250,000 gauges onto 3-D objects (F. Cole, Sanik, DeCarlo, Finkelstein, Funkhouser, Rusinkiewicz, & Singh, 2009). In the past, similar but noncrowdsourced studies were limited to a small number of motivated participants that were willing to spend up to 12 h in order to place the large number of gauges.

For the purposes of this article, crowdsourcing is operationally defined as the paid recruitment of an online, independent global workforce for the objective of working on a specifically defined task or set of tasks. The key features of this definition are that (1) workers are paid, (2) they can be recruited online from any geographic location, and (3) they are hired only to complete a defined task or set of tasks. In some ways, workers mirror undergraduate research pools in that the interaction between researcher and participant is of short duration and participants are assumed to be motivated primarily by extrinsic factors (e.g., financial compensation for workers or course credit for undergraduates).

The mechanisms by which individuals are recruited and the examples described in the literature indicate that there is reason to believe that crowdsourcing can be a vehicle for recruiting respondents that are more representative of the working adult population than is a university participant pool. Not only are large samples of participants readily available to complete surveys at a relatively low cost, but also it is likely that many of these participants will have more relevant work experience in career-oriented jobs than a typical sample composed largely of college freshmen and sophomores. A crowdsourced pool may also be more ethnically and educationally diverse. In sum, the use of crowdsourcing for organizational psychology research is a promising approach to collecting more representative samples, as compared with the commonly used undergraduate participant pool. However, the use of this crowdsourcing raises a number of important questions that need to be empirically addressed. Specifically, we investigate the following questions:

  1. Research question 1

    What are the demographic characteristics of respondents from a crowdsourcing pool? Do they differ from the characteristics of a university participant pool?

  2. Research question 2

    How does the quality of the data obtained using crowdsourcing compare with that of a university participant pool?

  3. Research question 3

    Do the psychometric properties of commonly used organizational research surveys differ across undergraduate and crowdsourcing samples?

  4. Research question 4

    Do mean differences across undergraduate and crowdsourcing samples exist with respect to personality traits and attitudes of interest to organizational researchers?

  5. Research question 5

    Why do users participate in crowdsourcing?

Mechanical Turk

The most well-known crowdsourcing Website is Amazon’s Mechanical Turk (Amazon.com, 2010). It was chosen for the present study due to its growing popularity as a viable means for recruiting participants in academic research. Although initially an internal tool, Amazon’s impetus for releasing the Mechanical Turk service to the public in 2005 was based on the idea that there are many tasks that people can do better than computers, such as identifying and listing objects in a photograph. Traditionally, those tasks required large-scale, costly outsourcing initiatives. Mechanical Turk provides a means for businesses or individuals (known as requesters) to outsource small tasks referred to as human intelligence tasks (HITs) to a global workforce. For example, a business launching an online shopping Website may need to provide descriptive tags for potentially millions of product images, a task difficult for computer algorithms. Rather than hire temporary employees, the business could source individuals through Mechanical Turk and pay them a few cents per image description. Recent data from the Mechanical Turk Website revealed that there were over 270,000 available HITs, ranging from US $0.01 to US $13.00 (Amazon.com, 2010).

Once a business or individual signs up for a requester account, a job request (HIT) can be completed using either a blank template or adjusting the supplied example HIT templates (e.g., Standard Survey) to suit the task. Requesters then enter a title, task description, relevant keywords, and how much money they will pay for each assignment. All of this information is supplied to the workers when they search for HITs to complete. Additionally, requesters must enter how many assignments (unique workers) they need for the HIT, expiration date for the HIT, and length of time before a worker’s submission is automatically approved. Requesters can also filter workers by location (at the country level) and HIT approval rate, which reflects the typical quality of the worker’s submission as indicated by previous requesters.

Individuals 18 years and older can sign up for a free worker account that allows them access to view and participate in the HITs. A worker is allowed only one account, and each worker is assigned an alphanumeric worker ID that is used to track his or her performance and payment records. Workers can search for HITs by keyword, date, compensation amount, and time allotted to complete the HIT. Additionally, workers can browse all available HITs and read task descriptions before deciding to participate. Some HITs cannot be accessed until a related qualification exam has been taken. Requesters may design these qualification exams to ensure that workers possess a certain degree of skill or proficiency before they are allowed to participate in an HIT. Generally, if a worker produces low-quality work, it is up to the discretion of the requester to reject the work and not pay the worker. If this happens, the worker’s HIT approval rate is lowered, and the transaction is reflected in the requester’s statistics.

Method

Participants

Two samples were collected; the first was from a traditional psychology participant pool, and the second was from Mechanical Turk. The undergraduate sample consisted of 270 undergraduate students enrolled in an entry-level psychology course at a large Southeastern research university in the U.S. Participants were compensated with course credits, as per standard university practice. The Mechanical Turk sample contained 270 adults, who were paid US $0.80 each for their participation. This level of compensation was chosen in an attempt to be close to the median pay rate for HITs requiring similar time commitments available at the time of data collection, although no centralized database exists to identify the true distribution of HIT compensation levels. Demographic information for each sample is presented in Table 1, including, age, gender, nationality, location, employment status, tenure, education, and profession.

Table 1 Sample characteristics

Procedure

Mechanical Turk

A HIT was created that contained a brief description of the study and a link to an online informed consent form and questionnaire.Footnote 1 After the questionnaire was completed, a completion code was presented to the participant. In order for the participant to receive compensation, he or she had to enter the completion code on the Mechanical Turk Website. A useful feature of Mechanical Turk is that it provides an administrative page that reveals real-time submission statistics and completion codes. Once a completion code had been entered, the experimenter reviewed and approved the code, thus automatically sending compensation to the participant’s account. This method ensured that identifying information connected to their worker ID was not connected to their responses.

Undergraduates

The study description was posted on a university Website managed by the psychology department, using language identical to that for the Mechanical Turk HIT. As with the Mechanical Turk sample, the undergraduates completed their work with a computer online in an asynchronous manner from a location of their choosing. Participants who chose to sign up after viewing the study description were given an HTML link to the questionnaire and informed consent. To receive course credit for participation, participants entered their e-mail address, using an independent questionnaire link (i.e., e-mail addresses were not connected to their responses).

Measures

A number of measures were included for their widespread usage among organizational researchers, while others were included for their relevance to previous research on online survey behavior. Reliability estimates were calculated separately for each sample; this information is presented in Table 2. Unless indicated otherwise, responses were given on a 5-point scale with anchors from strongly disagree to strongly agree.

Table 2 Scale reliabilities, mean comparisons, and descriptive statistics by sample

Internet knowledge

Internet knowledge was measured with a 13-item scale from Potosky (2007). An example item is, “I am familiar with html.”

Computer attitudes

Attitudes toward computers were measured with a 19-item scale from Garland and Noyes (2004). An example item is, “People who like computers are often not very sociable” (reverse coded).

Computer knowledge and experience

Computer knowledge and experience was measured with a 12-item scale from Potosky and Bobko (1998). An example item is, “I know how to recover deleted or ‘lost data’ on a computer or PC.”

Goal orientation

Learning goal orientation, performance-prove goal orientation, and performance avoid goal orientation (four items each) were measured with VandeWalle’s (1997) scale.

Personality

Extraversion, agreeableness, neuroticism, openness, and conscientiousness (i.e., the Big 5) were measured with 20 items each from the International Personality Item Pool (Goldberg, 1999) version of the NEO-PI–R.

Open-ended questions

A number of open-ended questions were included at the end of the survey. These questions included the following: “Why did you take this survey?” “What was the best/worst thing about this survey?” “Would you be interested in participating in future studies on this topic? Why/why not?”

Mechanical Turk experience

For the Mechanical Turk sample only, a number of open-ended questions were included to assess experience, motivation, and usage patterns for the Mechanical Turk Website. These questions were the following: “How did you first hear of Mechanical Turk?” “How many HITs have you completed?” “How long have you been using Mechanical Turk?” “How many hours per month do you spend using Mechanical Turk?”“Why do you use Mechanical Turk?” Responses were content-coded by two raters; after discussion, there were no discrepancies in the coding.

Demographic measures

We also asked several demographic questions, such asage, gender, ethnicity, nationality, education level, profession, years of work experience, and current employment status.

Response behavior

Several measures of survey-taking behavior were collected to gain additional information about the quality of each sample’s responses. Response time was assessed as the number of minutes spent completing the survey, as calculated by using time stamps at survey initiation and completion. Also, the length of open-ended comments, in number of words, was assessed by obtaining the total word count for all open-ended responses. Finally, responses were flagged for deletion if respondents exited the survey before completion or if the total time spent working on the survey was less than 10 min. The proportion of cases that were flagged was then calculated.

Data quality

Excessive response consistency was assessed by selecting a pair of Likertscale items that should have opposite responses (i.e., psychometric antonyms; Goldberg & Kilkowski, 1985): “seldom feel blue” and “often feel blue.” Cases were flagged if their responses to these two items were identical. Next, random responders were identified by selecting a pair of Likertscale items that should have similar responses: “do things according to a plan” and “make a plan and stick to it.” Cases were flagged if their responses to these two items were more than 2 points apart. The total proportion of cases flagged under either rule was then computed. Finally, the Long String Index (Johnson, 2005) was calculated; this index measures the longest continuous string of identical responses for a given participant (e.g., selecting “strongly agree” for 15 consecutive items), giving an additional measure of inattentive responding.

Social desirability

Socially desirable responding was measured with the 33-item scale from Crowne and Marlowe (1960), with a true–false response format. Example items are, “I never hesitate to go out of my way to help someone in trouble” and “There have been times when I have been quite jealous of the good fortune of others” (reversed). High scores on this scale indicate a desire to “fake good” and respond to survey items in a socially desirable manner; low scores indicate more honest responding.

Scale reliability

Cronbach’s coefficient alpha was calculated for each sample to obtain a measure of internal consistency. The differences between coefficient alpha values across samples were compared via a chi-square statistic described by Feldt, Woodruff, and Salih (1987), using the AlphaTest program (Lautenschlager & Meade, 2008).

Measurement invariance analysis

The measurement invariance of each of the personality and goal orientation scales was investigated using an item response theory (IRT) based likelihood ratio test. Prior to IRT analysis, items were reverse coded where appropriate, and flagged cases (as described above) were deleted. Cases were also deleted on a scale-by-scale basis if they contained any missing data. In order to perform invariance analyses using IRT, it was necessary to first establish the dimensionality of each scale. A principal components analysis (PCA) was performed separately for each scale, and for seven of the eight scales, scree plots and eigenvalues suggested that the scales were clearly unidimensional (e.g., first eigenvalues well above 1.0, second eigenvalues near or below 1.0, with scree plots showing a clear drop). However, the conscientiousness scale suggested that more than one factor may fit the data. A follow-up exploratory factor analysis with principal factors extraction and oblique rotation revealed that item factor loadings produced factors that were not sufficiently conceptually distinct. In the end, for all eight scales, we opted to use the PCA results, retaining only one factor and using only items that loaded more than .40 on that factor in the IRT analyses. The total number of items retained for subsequent analyses for each scale is reported in Table 3.

Table 3 Differential item functioning

Data analysis involved using the IRT graded response model (Samejima, 1969), in which one item a parameter and one fewer than the number of response option b parameters are estimated for each item (see Embretson & Reise, 2000, for a description). The b parameters identify the boundary of probability of responding with one option (e.g., 1) with those of the next option and higher (e.g., 2 through 5). Preliminary analysis suggested that when fewer than 20 respondents in either group endorsed a given response option, standard errors associated with the associated b parameter were quite large. As a result, prior to the analyses, response categories were collapsed into a single category for such items. For instance, if 15 persons responded with a “1” for a given item, those persons’ responses were recoded as “2.” As a result, different items had a different number of (recoded) response options. All items were then recoded such that the lowest response option was “0,” as required by the software used for these tests.

The invariance of the eight scales was examined, one at a time, with the likelihood ratio test (LRT; Thissen, Steinberg, & Wainer, 1993), using the IRTLRDIF program (Thissen, 2001). The IRTLRDIF program first estimates a baseline model in which all item parameters are constrained to be equal across groups for a given scale. This baseline is then compared with a series of augmented models in which the parameters for a single item are free to vary across groups. The improvement in model fit associated with freeing these constraints is distributed as chi-square with degrees of freedom corresponding to the number of freed parameters. As with all chi-square-based statistics, the LRT is sensitive to sample size and has very high power when samples are large (Rivas, Gabriel, Stark, & Chernyshenko, 2009), potentially detecting even trivial noninvariance (Meade, 2010). As such, we also computed invariance effect sizes to indicate the practical importance of a lack of invariance (or differential functioning [DF]). Meade recently developed a taxonomy of potential invariance effect size measures based on item and scale expected scores. The most basic of these is the signed test difference in the sample (STDS), which can be interpreted as simply the difference in the two groups’ mean scores expected because of DF alone. A second index is the unsigned expected test score difference in the sample (UETSDS), which can be interpreted as the difference in scale scores due to DF alone, had the differences in scale scores uniformly “favored” one of the groups. The UETSDS is equivalent to the square root of Raju et al.’s (1995) NCDIF index. Additionally, the expected test score standardized difference (ETSSD) is reported as a test-level DF version of Cohen’s d (Meade, 2010).

Results

Research question 1 concerned the demographic makeup of the crowdsourcing sample, as compared with a university sample. Table 1 contains a comparison of the two samples for demographic characteristics, including age, gender, ethnicity, nationality, education completed, employment status, and profession. The samples were similar in terms of gender and ethnicity; both samples were predominantly female and Caucasian. The crowdsourced sample was markedly more diverse in terms of education, employment status, and profession, with a wide range of professions and education levels represented. There were significant mean differences in age, such that the crowdsourcing sample (M = 32.93, SD = 10.68) was significantly older than the undergraduate sample (M = 18.68, SD = 1.35), t(527) = 21.48, p < .001, d = 1.87, consistent with expectations. A dramatically higher percentage of the crowdsourced sample was employed, either full-time or part-time. Additionally, for respondents from either sample who were employed, tenure in their current job was considerably longer in the crowdsourced sample (M = 5.11 years, SD = 5.33) than in the university sample (M = 1.54, SD = 1.51), t(291) = 5.96, p < .001, d = .78. As a whole, this information suggests that the crowdsourced sample was more attractive in terms of generalizability for organizational researchers.

Research question 2 concerned differences in response quality, as measured by differences in social desirability, reliability of scales, completion time, length of open-ended responses, and data consistency and completeness. It was demonstrated by t tests that the crowdsourced sample was significantly higher in social desirability (see Table 4).However, internal consistency estimates tended to be higher in the Mechanical Turk sample than in the undergraduate sample (see Table 2), with the exception of the Internet knowledge measure. No significant differences were found with respect to completion time or word count. The Long String Index showed no differences between samples. Finally, a similar proportion of cases in each sample were flagged, due to incompleteness or data consistency. As a whole, this information indicates that the data were of equal or perhaps better quality in the crowdsourcing sample, although slightly more susceptible to socially desirable responding.

Table 4 Data quality

Research question 3 concerned the measurement invariance of commonly used scales—namely, Big 5 measures of personality and goal orientation. As can be seen in Table 3, on the whole, most items tended to function equivalently across samples, with only one or two DF items per scale. Exceptions to these general findings were the openness (four DF items) and conscientiousness (three DF items) scales. Items in these scales displaying DF were as follows: “am full of ideas,” “have a rich vocabulary,” “love to read challenging reading material,” “have difficulty imagining things,” “get chores done right away, “am exacting in my work,” and “shirk my duties.” An examination of these items suggests that individuals in the crowdsourced sample with more work experience might reasonably interpret these items differently than those with little work experience. In contrast, items such as “tend to vote for liberal political candidates” would not be expected to vary in their interpretation on the basis of work experience, and indeed, items like these did not display DF in the present study.

Despite these statistically significant differences, across all scales, DF effect sizes were quite small. For instance, for the conscientiousness scale, the potential scale score range was from 17 to 56 (range = 39). However, the expected mean difference in observed scores between the two groups due to DF alone was .151, less than one fifth of one scale point out of a potential 39. Effect sizes were slightly higher for the openness and agreeableness scales, although still not especially large. For instance, the ETSSD indices for the openness scale indicated that group mean differences would be expected to be 0.065 SD higher in the Mechanical Turk sample due to DF alone.

Given the minimal role of DF in the observed data, we were able to examine research question 4, which concerned mean differences with respect to individual differences, including personality, attitudes, and computer knowledge/experience. The Mechanical Turk sample was significantly higher in computer and Internet knowledge. The Mechanical Turk sample was also higher in openness to experience and learning goal orientation and was lower in extraversion (see Table 2). Effect sizes (d) were typically small, according to Cohen’s (1969) criteria. Bivariate correlations by sample are presented in Fig. 1.

Fig. 1
figure 1

Bivariate correlations

Finally, research question 5 concerned the primary motivations for persons’ participating in crowdsourcing. Most respondents indicated that financial incentives were the primary reason for using Mechanical Turk, though educational and entertainment benefits were also listed (see Table 5). The majority of respondents self-identified as casual users, although a very small subset of users indicated having completed over 1,000 individual HITs and having spent over 100 h per month on the site, making their experience equivalent to part-time employment. Thus, the financial element of participating in crowdsourcing is important to users, although participation is still voluntary and the attractiveness of a given study may carry more weight in participation decisions than does the precise dollar amount of the compensation. For long, involved, or repetitive studies, one may need to compensate participants at a higher rate, while for engaging and interesting studies, one may be able to pay participants slightly less.

Table 5 Mechanical Turk usage patterns and motivations

Discussion

In this study, we sought to identify whether crowdsourcing is a viable alternative to the use of university subject pools, which are often criticized for their homogeneous makeup and limited work experience (Anderson, 2003; Landy, 2008; Locke, 1986; Ward, 1993). We administered a survey to samples drawn from both crowdsourcing and university participant pools to examine the quality of data gathered from each source, as well as to understand more about the crowdsourcing participants. Overall, the crowdsourcing sample behaved similarly to participants from a traditional psychology participant pool, a finding that is consistent with Sprouse’s (2011) comparison of the quality of acceptability judgment data between Mechanical Turk workers and in-lab participants. Where differences existed, effect sizes were typically small. A few noticeable differences were found in terms of data quality, with the slightly higher levels of social desirability in the crowdsourcing sample offset by the slightly better reliability of the data from that sample. Additionally, there were some clear advantages gained from using crowdsourcing; namely, the resulting sample was more diverse, was older, and had more relevant experience, making them an attractive pool for organizational researchers. Thus, it would seem that crowdsourcing tools are a viable option for organizational researchers.

Given the relatively small monetary compensation offered in this study, we were also interested in exploring the motivation of crowdsourcing participants. Approximately 70% of respondents indicated that their primary motivation for using Mechanical Turk, in general, was financial, although the remainder of respondents listed other benefits, such as entertainment or education. It is worth noting that when asked why they volunteered to complete this particular survey, almost all respondents from the undergraduate sample listed course credit as their primary reason, significantly more than in the Mechanical Turk sample, in which respondents gave a number of other reasons, such as an interest in taking surveys and a general interest in personality and related topics.

Implications

Practical

Our findings show that the use of crowdsourcing can be a potentially viable resource for researchers wishing to collect survey data on many types of organizational phenomena. This is especially relevant to those researchers who commonly recruit participants from undergraduate populations. Specifically, the ability to select participants at the country level may reduce the barriers of recruiting only WEIRD populations. It is important to point out, however, that Amazon’s Mechanical Turk can disburse earnings only to U.S. or Indian bank accounts. Workers from other countries are paid through Amazon.com gift certificates. The respondents to this survey were, on average, employed, had several years of full-time work experience, and came from a wide range of organizations and occupations, making them a more representative population for many types of organizational studies. In addition, these portals provide access to researchers who either have small participant pools or no access to an undergraduate psychology pool or its equivalent. Other practical benefits should also be noted; primary among these is the significant time savings that can be realized by using crowdsourcing. In the present study, over 250 surveys were completed in 2 days, while a similar sample size drawn from the undergraduate psychology pool took several weeks. Moreover, the availability of crowdsourcing participants is not limited by semester schedules or the size of a university’s undergraduate population.

Ethical

Mechanical Turk users are evaluated by the requesters, and their rating is visible to others who post work on the site. Thus, there is the potential that Mechanical Turk users will feel undue pressure to complete the survey even if they want to exit. This makes the use of informed consent forms that much more important. However, there is some evidence that online informed consent documentation is not always read carefully (Stanton & Rogelberg, 2001), leaving open the possibility that Mechanical Turk workers will not fully understand that their rating and compensation are not connected to their survey responses. Institutional review boards may or may not be familiar with crowdsourcing as a participant source, and researchers will need to work carefully to make sure participants are treated ethically. A cursory review of psychology surveys posted on Mechanical Turk discovered that almost one third had no informed consent information posted at all.

Additionally, conducting survey or other research is not the intended purpose of labor portals such as Mechanical Turk. The observed increase in the social desirability of Mechanical Turk responses could be due to perceptions that compensation will be based on the nature of the responses given; that is, Mechanical Turk workers may respond in socially desirable ways to avoid being docked pay. This type of subtle coercion has long been cited as a concern in the use of university participant pools (e.g., Rosnow & Rosenthal, 1976), although the problems may be exacerbated in an online labor portal setting.

Considerations for researchers wishing to use Mechanical Turk

Technical

Although most features of Mechanical Turk can be accessed using a Web browser interface, some features are accessible only from text-only programming commands or using Amazon Web Service’s application programming interface. To fully utilize Mechanical Turk, some degree of computer programming proficiency is required. This is especially true when using Mechanical Turk as a medium for implementing psychology-based experiments that go beyond the scope of simple input field data collection.

Financial

The benefits of using Mechanical Turk or other crowdsourcing options do not come for free. Participants recruited in this way must be offered financial compensation. In the present study, participants were paid US $0.80 to complete a survey that took approximately 30 min to complete. Using this level of compensation, we were able to recruit several hundred participants in less than 48 h. However, more involved surveys or experimental studies may require higher levels of compensation. Mechanical Turk workers are generally aware of what the fair wage for a given task should look like and respond favorably to requesters who match or exceed this rate.

Oversurveying

Given the relative ease and short timeframe promised by the use of crowdsourcing, the possibility of oversurveying becomes a concern (Thompson & Surface, 2007; Tippins, 2002). While members of a given labor portal are never obligated to accept a particular survey, it is possible that an excessive number of surveys will make the marketplace less attractive. However, anecdotal comments from Mechanical Turk participants in this study indicated that surveys were often welcome jobs and were relatively more engaging than other tasks available on the site.

Limitations and future research

The use of Internet-based psychological research has been discussed for well over a decade (Reips, 2002; Stanton, 1998), and many studies have highlighted the advantages and disadvantages of implementing online research. The next step in this discussion needs to bean investigation into how the online recruitment process can be improved both from a technological standpoint and, perhaps more importantly, from a psychometric standpoint where the generalizability of participants is held in a higher regard.

It is important to acknowledge the limitations of this form of data collection. As with any method involving monetary compensation, the motives of participants may be called into question. Indeed, an important avenue for future study is the effect of varying levels of compensation. In the present study, we selected a compensation level slightly above the median for the timeframe required. It is possible that paying much less than that would result in fewer participants signing up, while paying much more would attract participants who were not truly interested in completing the survey. Alternatively, paying less may also impart to participants a feeling that they have less of an obligation to to provide thoughtful responses, potentially resulting in low-quality data. Future research is required to explore the potential relationship between compensation and data quality and methods for embedding internal checks for undesirable motivations on the part of participants.

It will also be important to identify the degree to which these results are idiosyncratic to the specific community of Mechanical Turk users and to what degree they are generalizable across users of all crowdsourcing marketplaces. Additionally, the current lack of random sampling techniques for Internet users needs to be addressed so that Internet-sourced data can be more confidently generalized across Internet users as a whole (Kraut et al., 2004).Along the same lines, it will be important to discover the utility of this tool for investigating research questions that do not rely on survey methodology (e.g., experimental designs, diary studies).

In conclusion, the promise of crowdsourcing tools will go unrealized if researchers cannot be confident in the quality of the data they will obtain. This study provides initial evidence that data quality is as good as that from undergraduate pools and that diverse samples can be obtained, using these tools.