Introduction

Despite the prominence of self-reports in many areas of social science research, self-reports are prone to various distortions (cf. Chan, 2009), particularly for the assessment of socially undesirable topics such as stigmatized behaviors or illegal activities. Individuals frequently under-report behaviors that are in contrast to prevalent social norms and regulations, even when interviewed in anonymous surveys where respondents do not have to fear negative consequences. For example, typical self-report surveys estimated prevalence rates for smoking that were up to 9 percentage points lower than respective rates based on objective biomarkers (Gorber, Schofield-Hurwitz, Hardt, Levasseur, & Tremblay, 2009). To increase the validity of self-reports on sensitive behaviors, survey researchers have proposed several solutions (see Tourangeau & Yan, 2007, for a review): among others, the introduction of computerized survey modes has been suggested to increase respondents’ anonymity (Buchanan, 2000; Joinson, 1999; Trau, Härtel, & Härtel, 2013) and, as a consequence, should result in more truthful responding. This assumption was examined in a meta-analysis of mode experiments across self-administered paper-and-pencil and computerized surveys for several behaviors conventionally viewed as socially undesirable (e.g., illegal drug use). Moreover, several procedural characteristics associated with the survey process were examined to identify conditions under which computerized surveys are particularly effective in increasing self-disclosure.

Self-disclosure of sensitive Behaviors

Sensitive questions address highly personal and sometimes even distressing topics which are often in conflict with social norms and frequently result in socially desirable answers or even non-response. Three aspects can make a question sensitive (Tourangeau, Rips, & Rasinski, 2000): First, a question can be seen as intrusive when it addresses a taboo topic, independent of what the respondent’s answers might actually be. Second, fears that answers to a question might be disclosed to a third party can make it sensitive, particularly if there are concerns about potentially negative consequences associated with a response. Third, questions evoking answers that are in conflict with the prevalent social norm can be perceived as sensitive. Prototypical examples for sensitive topics in many Western societies are the consumption of alcohol and illicit substances (Tourangeau & Yan, 2007), sexual activities (Langhaug, Sherr, & Cowan, 2010; McCallum & Peterson, 2012), or delinquency (Kleck & Roberts, 2012). Due to the private nature of these behaviors, researchers interested in studying them usually have to rely on individuals’ self-reports; objective measurements are typically rare (see van der Pol et al., 2013, for an example on drug use) or nearly impossible (e.g., in the context of sexual research). However, frequently people are reluctant to answer questions they consider sensitive. Even if they relinquish information the validity of their responses is sometimes in question. Data quality does not only depend on the accurate recall of facts but also depends on the degree of peoples’ self-disclosure, that is, the amount of personal information an individual is willing to provide to others, for example to an interviewer (Jourard, 1971). Self-disclosure is commonly threatened by an individual’s inherent need to create and maintain favorable impressions of oneself in the eyes of others (Paulhus, 2002) or, occasionally, to show factitious disorders to excite compassion or interest (Maldonado, 2002). Therefore, respondents tend to misrepresent their true attitudes and behaviors if they believe them to be in conflict with prevalent social norms.

Survey mode effects on self-disclosure

For a long time, survey researchers have scrutinized factors that might increase self-disclosure of sensitive behaviors (for qualitative reviews see Kleck & Roberts, 2012; Langhaug et al., 2010; McCallum & Peterson, 2012; for quantitative reviews see Richman et al., 1999; Tourangeau & Yan, 2007; Ye, Fulton, & Tourangeau, 2011). Among the studied features, the survey mode was identified as a key variable. A bulk of studies demonstrated that motivated misrepresentation tends to decline for more anonymous surveys that limit personal interactions with an interviewer (e.g., in telephone surveys) or remove the interviewer entirely from the survey process (e.g., postal surveys). Moreover, computer-administered self-interviews have been suggested to produce even greater self-disclosure as compared to self-administered paper-and-pencil questionnaires because they are presumably perceived as more anonymous (Buchanan, 2000; Joinson, 1999; Trau et al., 2013). Frequently, computerized conduct evokes an experience of being immersed into another, a virtual, world (cf. also the concept of transportation; Gnambs, Appel, Schreiner, Richter, & Isberner, 2014), letting people forget their immediate surrounding and, thus, creating an illusion of privacy; responses seemingly “‘disappear’ into the computer” (Weisband & Kiesler, 1996, p. 3). Therefore, computers are frequently perceived as impartial counterparts reducing respondents’ fear of negative evaluations. The more respondents believe that their responses are not currently being observed by others, the more likely they answer candidly on sensitive issues. Indeed, merely believing that computerized responses will not be observed by a human interviewer affect responses, not whether they are actually observed (Lucas, Gratch, King, & Morency, 2014).

Several qualitative reviews supported this assertion and highlighted the advantages of computerized surveys on sexual practices (Langhaug et al., 2010) or delinquent behaviors (Kleck & Roberts, 2012). Two meta-analyses (Richman et al., 1999; Tourangeau & Yan, 2007) even identified small (but generally insignificant) advantages of computer-assisted as compared to paper-and-pencil formats. However, conclusions from the latter are not readily transferable to the assessment of behavioral outcomes; Richman and colleagues (1999) did not examine sensitive behaviors but focused on the social desirability of personality traits, whereas the analyses by Tourangeau and Yan (2007) were based on a rather limited database of only ten samples combining attitudinal, personality, and behavioral scales. Yet another impetus for research on survey mode effects was received with the advent of web-based testing, a variant of computerized surveys administered over the Internet. According to the ‘candor’ hypothesis (Buchanan, 2000), web-based surveys were assumed to elicit higher self-disclosure because they are perceived to be more anonymous. However, existing empirical support for this assumption is inconclusive. Some studies identified the hypothesized effect (e.g., Kays, Gathercoal, & Burow, 2012; Wang, Lee, Lew-Ting, Hsiao, Chen, & Chen, 2005), whereas others did not (e.g., Lucia, Herrmann, & Killias, 2007; McCabe, Boyd, Young, Crawford, & Pope, 2005). Thus, hidden moderators might determine the effectiveness of computerized surveys for the disclosure of sensitive information.

Potential moderators of mode effects

Computerized surveys come in many forms (see Couper, 2011, for an overview). For example, some surveys extended traditional computer-assisted formats to audio-enhanced variants in which questions and response options are presented on the computer screen while respondents listen to spoken recordings of the presented item over a headset. Similarly, web-based testing represents a form of unproctored computerized surveying (Gnambs, Batinic, & Hertel, 2011) characterized by specific procedural features (e.g., no direct interaction with an interviewer and no standardized survey setting). Previous research (cf. Aquilino, Wright, & Supple, 2000; Brener et al., 2006; Richman et al., 1999; Tourangeau & Yan, 2007) indicated that a set of survey mode specific conditions associated with the different forms of computerized surveys could moderate the disclosure of sensitive behaviors across survey modes. In addition, mode effects might also depend on specifics of the item content and individual differences of the respondents. Therefore, we examined three groups of moderators referring to item, procedural, or sample characteristics:

Item sensitivity

Survey respondents are frequently reluctant to discuss sensitive issues with others, particularly people they do not know well (e.g., an interviewer), and refuse to provide answers that might invade their privacy or may violate social norms. As a consequence, response rates to personal questions tend to decrease as the level of sensitivity increases (Bosnjak & Tuten, 2001; Krumpal, 2013; Shoemaker, Eichholz, & Skeews, 2002). Issue sensitivity might also interact with characteristics of the survey process because self-disclosure is strongly connected to the perceived anonymity of the assessment procedure (Joinson, 1999; Joinson, Reips, Buchanan, & Schofield, 2010; Stiglbauer, Gnambs, & Gamsjäger, 2011). Computerized, particularly web-based, surveys are frequently considered more anonymous and presumably increase the feelings of privacy for the respondents than personal interviews or paper-and-pencil surveys. As a consequence, they yield higher self-disclosure on sensitive topics (Booth-Kewley et al., 2007; Kays et al., 2012). Thus, stronger survey mode differences are expected for the disclosure of highly sensitive behaviors because under-reporting of moderately sensitive issues is generally less severe.

Procedural characteristics

Interviewer presence

Survey mode experiments repeatedly showed that eliminating the interviewer from the survey process increases self-disclosure of sensitive behaviors (e.g., Chang & Krosnick, 2009, 2010; Ye et al., 2011). Accordingly, Tourangeau and Yan (2007) estimated a median increase of self-reported illicit drug use across seven studies by a factor of 1.3 when the survey was self- as compared to interviewer-administered. It might be speculated that similar effects also manifest in self-administered surveys: the presence of an interviewer might inhibit self-disclosure to some degree if respondents fear that their answers might be accidently divulged to someone standing nearby. Indeed, there is evidence (Richman et al., 1999) that social desirability effects tend to reduce when respondents are completely alone during test taking (i.e. when no interviewer is present and test taking is conducted alone instead of in group settings). Thus, survey mode differences on self-disclosure are expected to be higher when no interviewer is present during test taking.

Group administration

Bystander effects might contribute to under-reporting of sensitive behaviors (Aquilino et al., 2000). If significant others (e.g., parents or spouses) are present during an interview which might be suspected to notice the recorded responses, under-reporting is more likely. For example, experimental studies showed that adolescents under-report their alcohol consumption and marijuana use when their parents are present during the interview (cf. the meta-analysis in Tourangeau and Yan, 2007). Moreover, this effect was qualified by an interaction with the survey mode (cf. Aquilino et al. 2000). The bystander effect was observed in paper-and-pencil surveys whereas computerized forms showed no effect (presumably because the computer form was perceived as more anonymous). Moreover, the mere presence of others, even if they do not directly interact with a respondent, unconsciously activates goals and perceived norms associated with these individuals (Parks-Stamm, Oettingen, & Gollwitzer, 2010). As a consequence, responses are more likely to reflect prevalent social norms when assessed in group settings. Therefore, surveys administered individually without other test takers being present should result in larger mode differences on the disclosure of sensitive topics than comparable group-administered surveys.

Standardization of setting

Standardized settings create comparable, highly controlled conditions for all respondents, for example by testing in a dedicated laboratory or room at school. Some authors suggested that standardized survey settings should yield higher prevalence estimates than unstandardized settings testing in respondents’ homes (Brener et al., 2006). Fendrich and Johnson (2001) observed in three national surveys on drug abuse that the two surveys being conducted at school resulted in significantly higher prevalence rates of the same behavior than a household survey. This effect was also replicated in respective mode experiments (e.g., Brener et al., 2006; Gfroerer, Wright, & Kopstein, 1997). Adolescents’ self-reports of sensitive behaviors were found to result in significantly lower prevalence rates when conducted at home as compared to school settings. However, the pattern of effects is not without dispute because some contradictory evidence has also been found. For example, the hypothesized effect of standardization did not emerge in an experimental study in which respondents were either interviewed at home or in a neutral setting outside home (Tourangeau, Rasinski, Jobe, Smith, & Pratt, 1997). Moreover, the putative effect of standardization is also at odds with evidence from web-based assessments. Unstandardized surveys administered over the Internet are supposed to increase the perceived anonymity and, thus, facilitate disclosure of sensitive information (e.g., Booth-Kewley et al., 2007; Kays et al., 2012). However, previous research confounded the effects of standardization in web-based research with effects of interviewer presence. To disentangle both effects, the present study will examine the variables as independent moderators.

Audio-enhancements

In audio-enhanced computerized surveys questions and responses are presented on the computer screen while respondents listen to spoken recordings of the presented item over a headset. Audio-enhancement seems to be especially useful to overcome literacy problems for populations with poor reading ability while maintaining high levels of anonymity comparable to traditional computer-assisted surveys (Turner et al., 1998). Existing evidence on the inclusion of an audio component in computerized surveying is mixed. Some studies that compared audio-enhanced computer surveys to interviewer-administered surveys found higher prevalence rates of sensitive behaviors in computerized interviews (e.g., Des Jarlais et al., 1999; Gorbach et al., 2013; Kelly, Soler-Hampejsek, Mensch, & Hewett, 2013; Turner et al., 1998; Yeganeh et al., 2013). However, these studies confounded the effects of audio-enhancement with self-administration. Other experimental work comparing different self-administration modes was less clear. Whereas some studies (e.g., Couper, Tourangeau, & Marvin, 2009; Langhaug, Cheung, Pascoe, Hayes, & Cowan, 2009; Tourangeau & Smith, 1996) identified modest benefits of including audio recordings in computer surveys, others did not (e.g., Couper, Singer, & Tourangeau, 2003; Nass, Robles, Heenan, Bienstock, & Treinen, 2003). Although experimental research was unable to identify a clear pattern of effects for audio-enhancements, a recent qualitative review on self-reported sexual behaviors (Langhaug et al., 2010) concluded that audio-enhanced computer surveys increased self-reports of sexual activities as compared to other self-administered survey modes. Thus, these results led us to expect larger mode differences in self-disclosure for audio-enhanced computer surveys as compared to traditional computer-assisted survey formats.

Sample characteristics

Sex of respondents

Although early research on self-disclosure across different survey modes failed to identify significant gender differences (e.g., Miles & Wesley, 1998), more recent studies suggested that male respondents exhibit increased self-disclosure in computerized assessments (Booth-Kewley et al., 2007; Kays et al., 2012). These sex differences might be a consequence of differences in computer familiarity that tend to be higher for men. They report using the Internet more often (Joiner et al., 2005, 2012) and engaging in more computer-related activities than women (Epstein, 2012). On the other hand, females report more negative attitudes toward computers and the Internet, less computer-related self-efficacy, and more computer-related anxiety (Appel, 2012; Broos, 2005; Hu, Zhang, Dai, & Zhang, 2012). Therefore, it is expected that the increased familiarity with computerized surveys results in an increased likelihood of self-disclosure on sensitive topics for male respondents.

Age of respondents

Compared to adolescents who frequently place less consideration on privacy-related risks, many adults report being more cautious and do not to divulge personal information they consider sensitive (e.g., Earp & Baumer, 2003). For example, teenagers are more inclined to provide personal information to businesses (e.g., for marketing purposes) in exchange for minor incentives, for example free gifts (Walrave & Heirman, 2013). The increase in privacy concerns with increasing age becomes particularly evident on the Internet where children and young adults are less concerned about online privacy (Hoofnagle, King, Li, & Turow, 2010). For example, teenagers share more sensitive information such as sexual preferences or political views on social networking sites such as Facebook (Christofides, Muise, Desmarais, 2009, 2012; Walrave, Vanweesenbeck, & Heirman, 2012). These age-related differences have been attributed to effects of computer-related insecurities that have been shown to increase with age (Laguna & Babcock, 1997). Older individuals tend to report less experience and a lack of confidence with computers (Hawthorn, 2007; Marquie, Jourdan-Boddaert & Huet, 2002). However, this effect seems to have decreased within the last decades (Smith & Oosthuizen, 2006). Thus, it is expected that survey mode effects on self-disclosure are more pronounced for adolescents and young adults than for older age groups.

Present review

Prevalence rates of sensitive behaviors are examined in a meta-analysis of published mode experiments across paper-and-pencil and computer-assisted survey modes. This meta-analysis complements two related reviews on several important accounts. Whereas Richman and colleagues (1999) primarily studied mode effects with respect to personality and social desirability scales, the present meta-analyses focuses on self-reported behaviors. In addition, new technological advancements made available during the last two decades to survey researchers are taken into account by also including audio-enhanced and web-based surveys, two survey modes that were excluded in Richman et al. (1999). The results in Tourangeau and Yan (2007) are extended by including more than five times as many samples and, more importantly, examining several moderator hypotheses not previously addressed. Thus, the present meta-analysis will provide a more exhaustive understanding of mode effects for computerized surveys than available so far.

The specific hypotheses derived for this meta-analysis are summarized in Table 1. The research focus pertains to computerized survey formats that are expected to yield higher prevalence estimates of self-reported, sensitive behaviors than paper-and-pencil surveys (proposition 1). The difference between survey modes is hypothesized to be contingent on several moderators: survey mode effects are expected to be more pronounced for highly sensitive behaviors (proposition 2) in standardized settings (proposition 3a), when neither an interviewer (proposition 3b) nor other test takers are present during the interview (proposition 3c), and when using computerized surveys including an audio component (proposition 3d). With regard to characteristics of the respondents, these differences are hypothesized to be most pronounced for adolescent men (propositions 4a and 4b).

Table 1 Overview of study propositions

Method

Literature search

Primary studies comparing disclosure of sensitive behaviors in paper-and-pencil and computerized surveys were identified from multiple sources: first, several bibliographic databases (PsycINFO, Psyndex, Psychology & Behavioral Sciences Collection, and EconLit) were searched using the keywords sensitive questions, self-disclosure, candor, alcohol, substance use, sexual behavior, or delinquency in combination with computer-based, computerized, web-based, CASI or ACASI. Second, the respective search was repeated in Google Scholar. Since it seemed infeasible to inspect each of the over 300,000 hits, the search was limited to the first 1,000 results. Because the Google search algorithm ranks search results by importance (Brin & Page, 1998), we are confident to have identified most of the relevant publications from this source. Third, additional studies were taken from the references of previous reviews on social desirability effects in computerized testing (Kleck & Roberts, 2012; Langhaug et al., 2010; McCallum & Peterson, 2012; Richman et al., 1999; Tourangeau & Yan, 2007).

Selection of sensitive behaviors

Four rationales guided the selection of sensitive behaviors: first, we focused on socially undesirable practices (e.g., drug use) and did not consider socially desirable behaviors (e.g., voting) because previous research suggested that context factors might differentially affect approach and avoidance behaviors (e.g., Meier, D’Agostino, Elliot, Maier, & Wilkowski, 2012). Second, the behavior should be similarly undesirable across diverse groups of respondents (e.g., being pregnant might be socially undesirable for teenage girls, but seems less undesirable for adult women). Third, because our moderator hypotheses also addressed potential differences between men and women, sex-specific behaviors (e.g., abortion) were not considered. Finally, we only considered sensitive behaviors that have been routinely examined in previous research (cf. Eaton et al., 2010; Tourangeau & Yan, 2007) and for which relevant effect sizes could be retrieved from published research reports. As a consequence, the meta-analysis focused on four topics conventionally viewed as sensitive (see Table 2): (a) substance use, including the consumption of alcohol, tobacco, or illicit drugs (e.g., marijuana, cocaine), (b) sexuality, referring to questions about homosexual intercourse, specific sexual practices (e.g., masturbation, oral sex), or sexual activities in exchange for money (e.g., prostitution), (c) delinquency, inquiring about carrying a weapon, impersonal offenses (e.g., shoplifting, driving under the influence), or crimes involving physical harm of others (e.g., assault), and (d) victimizations, asking about being a victim of physical or sexual abuse, or having attempted suicide.

Table 2 Examples of sensitive questions with sensitivity indices

Inclusion criteria

A study was included in the meta-analysis when it met the following criteria: (a) the study included a question on at least one of the sensitive behaviors presented in Table 2. (b) The question was administered as a self-administered questionnaire in written form on paper and on computer. Studies that compared computerized assessments to personal or telephone interviews were not included. Mode effects for the latter have been reviewed recently by Ye and colleagues (2011; see also De Leeuw & Van der Zouwen, 1988). (c) Participants were either randomly allocated to the two administration modes or provided measures for both modes in a within-subject design. Studies that allowed participants to choose the preferred mode of administration were not included. (d) The assessment procedure was anonymous. Studies that made respondents personally identifiable and linked responses to sensitive questions to specific individuals were excluded. Previous research (e.g., Brown & Vanable, 2009; Richman et al., 1999) indicated that mode effects of computerized surveys are limited to anonymous assessment scenarios. (e) Studies on psychiatric patients with severe mental illness were not considered in order to exclude individuals with impaired cognitive capacity. (f) The study reported relevant statistics to compute an effect size. This search resulted in 39 primary articles including 48 independent samples (see Table 3).

Table 3 Summary of samples included in the meta-analysis

Moderators

Coded moderators

Several moderators were extracted from the primary studies including four variables that describe features of the assessment procedure (a–d), two sample characteristics (e and f), and the survey year (g): (a) Group administrations were coded as 1 when surveys were administered to groups of test takers (e.g., in a class room). When respondents were alone or respondents could choose their company during the assessment as in web-based testing it was coded as −1. (b) Proctored administrations (coded as 1) where a test administrator supervised the whole testing process and remained present during test taking were contrasted with unproctored administrations (coded as −1) where participants remained alone and unsupervised. (c) Assessment settings that were standardized for all participants (coded as 1) – for example, by testing in a dedicated laboratory, test center, or room at school – were compared to unstandardized settings with varying assessment locations (coded as −1) where each respondent could choose the place to take the survey (e.g., at home or the workplace). (d) The interview type was coded as 1 if the computerized assessment procedure included an audio component and −1 if not. Moreover, two sample characteristics that are typically reported in research reports were recorded: (e) the proportion of female participants and (f) the mean age (in years) of the sample. (g) Finally, because the perceived sensitivity of a given topic might change over time (e.g., see Ruel & Campbell, 2006, for the changing stigmatization of HIV), we also extracted the survey year as a control variable to examine potential cohort effects. About 29 % of studies did not report the year of data collection. For these studies the survey year was approximated using the respective publication year. Because the median difference between the survey year and the respective publication year was 3 years for studies reporting both sets of information, the publication year minus 3 was used to impute missing survey years. The correlations between all moderators are summarized in Table 4.

Table 4 Correlations between moderators

Sensitivity of behavior

Previous research showed that response rates to personal questions reflect the perceived sensitivity of an item (Bosnjak & Tuten, 2001; Krumpal, 2013; Shoemaker et al., 2002). For example, in an unpublished study by Tourangeau et al. (1997, cited in Tourangeau et al., 2000), demographic items received more valid responses than questions on sexual behaviors. Moreover, non-response to sensitive questions was also a significant predictor of unit non-response, that is, complete study attrition, in panel studies (Loosveldt, Pickery, & Billiet, 2002). Therefore, an objective index reflecting the degree of item sensitivity was derived by examining item non-response in the Youth Risk Behavior Survey (YRBS; Brener et al., 2013), a biannual representative survey (N ≈15,000) on adolescent risk behaviors in the United States. For each sensitive behavior in the YRBS, the percentage of item non-response was estimated. To account for normative differences in behaviors item sensitivity was calculated as the odds ratio of missing responses to the number of affirmative responses. The median of this index from the years 2001 to 2011 was used to guard against potential outliers in a given year. The survey allowed the calculation of sensitivity indices for 15 sensitive behaviors (see Table 2): sensitivity indices were available for substance use and most items on delinquency and victimizations; for sexual behaviors respective indices could not be obtained. The thus calculated index for LSD use fell three standard deviations above the mean index and represented an outlier. Therefore, the presented analyses were limited to the rank information of the sensitivity index. To cross-validate the index we derived a comparable index for ten behaviors on substance use from the Monitoring the Future studies (MTF; Johnston, Bachman, O’Malley, & Schulenberg, 2011) and the National Surveys on Drug Use and Health (NSDUH; Center for Behavioral Health Statistics and Quality, 2013), annual representative surveys on drug abuse among American youths (MTF; N ≈15,000) or adults (NSDUH; N ≈55,000). The sensitivity rank from the YRBS correlated with the respective values from the MTF and NSDUH at r = .94 and r = .77.Footnote 1 Thus, the derived index showed considerable convergent validity across three independent representative surveys. Consequently, the sensitivity ranks from the YRBS that provided sensitivity information for the largest number of behaviors were used (see Table 2).

Meta-analytic procedure

The meta-analysis focused on differences in prevalence rates of risk behaviors; therefore, the odds ratio (OR) was adopted as effect size. The effect sizes were computed as OR = p C/p P with p C as the proportion of respondents agreeing to an item in the computerized survey and p P as the respective proportion in the paper-and-pencil survey. Therefore, ORs greater than 1 indicated higher prevalence rates and, as such, higher self-disclosure in computerized surveys. Using the studentized deleted residual (Viechtbauer & Cheung, 2010), three effects were identified as outliers (α = .01), less than 1 % of all available ORs. To reduce the impact of these outliers, we followed the approach in Gnambs (2013) and truncated the respective effect sizes to the lower or upper bound of the 90 % credibility interval of the true effect calculated from a dataset from which the outliers had been removed.

The effect sizes were aggregated using a random effects meta-analysis (cf. Cheung, 2014a). Following recommendations by Marín-Martínez and Sánchez-Meca (2010), each effect was weighted by the inverse of its variance to account for sampling error. Before calculating these variances, the sample sizes of the 10 % largest studies were truncated to the largest sample size of the remaining studies (cf. Gnambs, 2014). Otherwise, the aggregated effect would primarily reflect the effect of these large-sample studies and give hardly any weight to the other studies. Because several studies reported multiple mode comparisons (e.g., obtained for different sensitive behaviors), the meta-analysis was specified as a multilevel model (see Cheung, 2014a). This approach acknowledges the dependencies between the individual effects and models the data on three hierarchical levels: (a) Level 1 refers to the individual effect sizes. (b) Level 2 refers to the effect sizes using different types of sensitive behaviors within a sample; thus, the random level 2 variance τ2 (2) reflects the heterogeneity of effects due to differences in sensitive behaviors. (c) Level 3 refers to the different samples; thus, the random level 3 variance τ2 (3) indicates the heterogeneity of effect sizes across samples after controlling for the different types of sensitive behaviors at level 2. The influence of various covariates on the aggregated effect was examined using weighted, mixed-effects regression analyses (Kalaian & Raudenbush, 1996). All analyses were conducted in R using the metaSEM software (Cheung, 2014b).

Results

Sample characteristics

This meta-analysis included 48 independent samples (see Table 3) with a total of 125,672 participants (range of the individual studies’ Ns: 27 to 80,515) reporting 460 effect sizes. These samples included, on average, more women than men – the median percentage of female respondents was 59 – primarily adolescents and young adults, and the median age was 19 years. On average, each sample contributed four to five effect sizes. Most effect sizes were available for the comparison of prevalence rates in substance use (65 %), whereas the rest focused on victimizations (12 %), delinquent behaviors (12 %), or sexual behaviors (11 %). Over two-thirds of the studies were conducted in the United States (67 %), 15 % in Asia, and about 10 % inEuropean countries.Footnote 2 The surveys were administered between the years 1991 and 2010.

Overall effect of computerized assessments

The results of the meta-analysis are summarized in Table 5. The observed, uncorrected odds ratio for all available effect sizes was OR =1.24, which hardly changed after correcting for sampling error, Ω =1.19. Because the effect sizes were computed in such a way that ORs greater than 1 indicate higher prevalence rates of sensitive behaviors on the computer, these results demonstrated that computerized assessments resulted in significantly (p < .05) higher self-disclosure than respective paper-and-pencil modes. This overall effect was also replicated for several subgroups of different types of sensitive behaviors. Various forms of substance use, Ω =1.17, and sexual behaviors, Ω =1.29, showed significantly (p < .05) higher prevalence rates in computerized as compared to paper-and-pencil surveys. Self-reported delinquent behaviors, Ω =1.14, and victimizations, Ω =1.07, revealed a similar trend. However, these effects did not reach statistical significance: p = .09 and p = .22, respectively. Detailed cross-cultural examinations did not seem feasible because very few effects were available from geographical regions outside the United States (see Table 5). However, exploratory comparisons of the mean effect sizes calculated for several geographical regions revealed highly similar trends in American, European, African, and Asian samples, with computerized assessments eliciting higher self-disclosure.

Table 5 Meta-analysis of sensitive questions in computerized assessments

Overall, these results support the hypothesized survey mode effect on self-disclosure of sensitive behaviors. However, the significant (p < .05) random variances of Ω also pointed at unaccounted heterogeneity that might be accounted for by various moderators.

Moderator analyses

The random variance of the aggregated effect was inspected more closely by meta-regression analysis that used the coded moderators (see Method section) as predictors of the individual effect sizes. In these analyses the categorical moderators were contrast (−1 and 1) instead of dummy coded (0 and 1). As a consequence, the intercept in these regression models reflects the mean population effect after controlling for the moderators. Moreover, the continuous moderators (survey year, item sensitivity, sex ratio, and age) were recoded in such a way (as deviations from 2008, 8, 50, and 15, respectively) that the intercept reflects the true mode effect for a behavior of median sensitivity in the year 2008 for samples with a balanced sex ratio and a mean age of 15 years. To guard against potential confounds resulting from crosscultural differences in self-disclosure (cf. Chen, 1995; Johnson & van de Vijver, 2002) and perceived sensitivity of the studied behaviors (Roster, Albaum, & Smith, 2014), all moderator analyses were limited to the American samples. However, sensitivity analyses including all samples identified highly similar effects.

Survey year

Potential changes across time were examined by modeling the effect sizes dependent on the survey year (see Model 1 in Table 6). Initially, several regression models including higher-order polynomials were also inspected; but only the linear and quadratic terms remained significant, both p < .06, and, thus, were retained for the analyses. The effect of computerized assessments on self-disclosure of sensitive behaviors was subject to a moderate time trend (see Fig. 1). During the 1990s mode effects slightly declined and dropped from a predicted Ω =1.25 to a predicted Ω =1.08 in the year 2000; the last decade registered a new increase with a predicted Ω =1.19 in the year 2005. The survey year accounted for about 13 % of the between-sample heterogeneity τ2 (3).

Table 7 Tests for publication bias
Fig. 1
figure 1

Effect of computerized assessment on self-disclosure across time. Odds ratios greater 1 indicate higher prevalence rates of self-reported sensitive behaviors in computerized than in paper-and-pencil surveys. The solid line represents the model implied change trajectory from regression 1 in Table 6; dots represent the aggregated true effects for the respective year (dot sizes correspond to the number of included effects).

Sensitivity of behavior

Sensitivity information was available for a subsample of 283 out of all 343 effects sizes. Regressing these effects on the sensitivity rank, γ =0.02, SE =0.00, p < .01, highlighted an increase of survey mode differences for more sensitive behaviors (see Fig. 2). This effect was rather robust and remained significant after controlling for the previously identified time trend (see Model 2 in Table 6). Highly sensitive behaviors (predicted Ω =1.63), such as the use of heroin or cocaine, resulted in larger differences in prevalence rates across survey modes as compared to less sensitive behaviors (predicted Ω =1.43), such as smoking or the consumption of alcoholic beverages. The sensitivity rank accounted for about 22 % of the random level 2 variance τ2 (2). Although the sensitivity of the studied behaviors significantly moderated the survey mode differences, it was not equally predictive for all types of behaviors. For example, as depicted in Fig. 2, sexual abuse was classified as a highly sensitive topic. But the empirical, aggregated mode effect was considerably smaller than the predicted effect from the regression model. Thus, additional moderators related to specific types of sensitive behaviors might be unaccounted by the chosen sensitivity index.

Fig. 2
figure 2

Effect of computerized assessment on self-disclosure by sensitivity of behavior. Odds ratios greater 1 indicate higher prevalence rates of self-reported sensitive behaviors in computerized than in paper-andpencil surveys. The solid line represents the regression line. Letters indicate the mean effects for different types of sensitive behavior (for abbreviations see Table 2); font sizes correspond to the number of included effects.

Procedural characteristics

Survey mode differences were examined in relation to four procedural characteristics: group administration, interviewer presence, standardization of the survey setting, and inclusion of an audio component. Although some moderators were moderately correlated (see Table 4), variance inflation indices (VIF) did not indicate serious multicollinearity (all VIFs <2). Moreover, sensitivity analyses that removed moderators from the regression models one at a time identified the same effects as the full model (Model 3a in Table 6). Among the procedural characteristics, only group administration emerged as a significant moderator; mode differences were more pronounced when respondents were alone without the presence of other test takers (predicted Ω =1.61) as opposed to settings where other test takers were nearby (predicted Ω =1.18). Group administration explained ΔR 2 = .50 of the random between-study variance (τ2 3) in addition to the time trend. The remaining procedural characteristics explained the heterogeneity of effect sizes across studies insufficiently. To examine the robustness of this moderator effect, the respective analyses were also repeated controlling for the item sensitivity. Within the subsample of effects with sensitivity indices available, the respective moderation effect remained significant, p <.05 (see Model 3b in Table 6).

Sample characteristics

For the examination of individual differences between respondents rather few samples are available (about half of all coded samples) because many studies neglected to report relevant sociodemographic information (see Table 3). Moreover, the age range of the available samples was very limited: most studies reported on adolescent samples; in contrast, only two adult samples were available that included respondents with a mean age of 40 years or older. Therefore, the respective analyses should be interpreted with due caution. Moderation analyses (see Model 4 in Table 6) that included the percentage of female participants and the mean age of the studied samples did not identify differences between men and women. However, a marginally significant (p = .07), age-related effect emerged. Age explained about ΔR 2 = .33 of the random between-study variance (τ2 3) in addition to the time trend. Contrary to our expectations samples predominantly including adult respondents, predicted Ω =1.45 at age 30, exhibited stronger self-disclosure in computerized surveys than adolescent samples, predicted Ω =1.29 at age 15. Because the age of the two adult samples might be considered outliers, we repeated theses analyses using the logarithmized age of the respondents as moderator. However, this robustness check failed to replicate the age trend, p = .12. Therefore, this result should be regarded as preliminary until a larger body of effects from older respondents is available.

Publication bias

To determine whether systematically missing studies might have distorted the accuracy of the synthesized effects, Rosenberg’s (2005) Fail-Safe N was calculated which indicates the number of studies with null results that one had to add for the estimated Ω to become non-significant. As a rough rule-of-thumb Rosenthal (1979) recommended Fail-Safe Ns that are about five times larger than the number of included effects. These indicate robust effects that are unlikely to be distorted by publication bias. As summarized in Table 7, the estimated Ω for the overall effect can be considered robust. Some authors (e.g., Kepes, Banks, McDaniel, & Whetzel, 2012) evaluated the Fail-Safe N approach for the analysis of publication bias rather critically. Therefore, we also examined the contour-enhanced funnel plot (Peters, Sutton, Jones, Abrams, & Rushton, 2008), including the odds ratios and their standard errors. A visual inspection of the funnel plot (Fig. 3) did not indicate publication bias but revealed a largely symmetric distribution around the population effect. Moreover, we also tested the funnel plot statistically for asymmetry by regressing the individual effect sizes on the inverse of their respective sample sizes (cf. Moreno et al., 2009; Peters, Sutton, Jones, Abrams, & Rushton, 2006). A significant effect would indicate funnel plot asymmetry and, thus, a potential publication bias. However, the test failed to identify a significant effect, B =7.01, SE =6.43, p = .28 (cf. Table 7), therefore showing no publication bias.

Table 6 Moderator analyses for sensitive behaviors in computerized assessments
Fig. 3
figure 3

Contour-enhanced funnel plots with 90 % (white), 95 % (light gray), and 99 % (dark gray) confidence intervals around the aggregated true effect (horizontal line)

Discussion

Motivated misreporting remains a pervasive problem in survey research, particularly for questions involving behaviors that are contrary to prevalent social norms and, as a consequence, are perceived as embarrassing or even threatening. In these cases, self-reports are more prone to distortions the stronger the specific survey mode requires interpersonal contact with others. Therefore, modes removing the person of the interviewer from the survey process have been shown to elicit higher self-disclosure of sensitive behaviors than, for example, telephone or personal interviews (cf. Chang & Krosnick, 2009, 2010; Richman et al., 1999; Ye et al., 2011). In addition, it has been suggested that computerization of self-administered surveys would add another level of abstraction leading to even more self-disclosure. Because computers are viewed as impartial communicators that are perceived as more anonymous (e.g., Buchanan, 2000; Joinson, 1999; Richman et al., 1999; Trau et al.; 2013), respondents should feel less social pressure to answer in line with prevalent social norms and give more honest answers. In line with this premise, the presented meta-analysis identified significantly higher prevalence rates of sensitive behaviors in computerized as compared to paper-and-pencil surveys. The respective effect was quite robust and replicated across different types of sensitive behaviors (i.e. substance use, sexuality, delinquency, victimizations) and also different geographical regions. Although the identified mode effect might be considered small, Ω =1.51 after correcting for several moderators (see Table 6), it was considerably larger than previous research (Tourangeau & Yan, 2007) indicated, Ω =1.08. However, when point estimates of rare events are of central importance – as in epidemiological research on sensitive topics such as illicit drug use – even the identified small mode effect can be of practical importance, for example when facing costly decisions on the design and implementation of prevention and counseling programs for substance abuse patients.

Interestingly, the studied mode effect showed a marked time trend following an inverted U-shaped function (see Fig. 1) that might reflect changes in the respondents’ familiarity with the survey technology. Tourangeau and colleagues (2000) suggested the novelty of using computers for interviewing – which was still rather rare in the 1990s – might have signaled a form of importance and legitimacy for most respondents; in turn, computers might have also increased the disclosure of sensitive behaviors. The increased exposure of respondents to computers might explain the downward trend of this effect in Fig. 1. Similarly, the rise of web-based survey modes that gradually gained broader acceptance in psychological research only in the last decade (Gosling, Vazire, Srivastava, & John, 2004) might account for the slight increase in subsequent years.

With regard to the hypothesized moderators (see Table 1), the meta-analysis reached three main conclusions: first, computerization seemed to be particularly advantageous for highly sensitive behaviors such as cocaine use, whereas respective effects were less pronounced for moderately sensitive behaviors, for example smoking or alcohol consumption. Thus, computerized surveying is most effective for the most controversial issues which are strongly in contrast to social norms and regulations. Second, among the studied procedural survey characteristics co-test takers were most predictive of mode differences. Computerized surveys that were administered alone resulted in significantly higher prevalence estimates of sensitive behaviors than surveys presented to groups of respondents. Thus, traditional web-based surveys seem particularly effective for the collection of sensitive behaviors because test takers can respond alone, without fearing that others might see their responses to sensitive items. Contrary to previous experiments on inter-racial bias (Evans et al., 2003), other features of the unproctored computer mode such as the absence of an interviewer did not emerge as an additional moderator. Third, in contrast to some previous findings (e.g., Couper et al., 2009; Langhaug et al., 2009; Tourangeau & Smith, 1996; Turner et al., 1998), computerized surveys experimenting with audio enhancements did not have an additional advantage with regard to self-disclosure. This is somewhat at odds with a recent qualitative review of mode effects in developing countries that reported minor advantages for audio-enhanced computer surveys (Langhaug et al., 2010). The different conclusions from these studies might hint at additional moderators not included in the present meta-analysis. The included moderators accounted for only about half the between-study heterogeneity (see Table 6). Thus, sample characteristics, for example related to the educational level, might explain the discrepant findings. It could be speculated that audio-enhancements would be more effective for specific subgroups with low literacy that were underrepresented in the current meta-analysis.

Overall, the presented results demonstrated that the seemingly minor switch from paper to computer tends to result in higher self-disclosure rates of sensitive behaviors in self-administered surveys.

Accuracy of self-reported sensitive behaviors

Generally it is assumed that higher prevalence rates of self-reported sensitive behaviors are also more accurate indicators of respondents’ real behaviors. However, this “more is better” assumption (Tourangeau & Yan, 2007, p. 863) represents a mostly untested hypothesis. So far, there are few studies explicitly focusing on the accuracy of self-reported behaviors across survey modes by validating respondents’ answers against objective criteria. The available evidence suggests that the identified increase in prevalence rates is also accompanied by an increase in accuracy (e.g., Hewett et al., 2008; Kreuter, Presser, & Tourangeau, 2008; Langhaug et al., 2010; van Griensven et al., 2006). For example, in a mode experiment Kreuter and colleagues (2008; see also Sakshaug, Yan, & Tourangeau, 2010) validated self-reported academic performance of students against available university records. For socially undesirable questions (e.g., receiving bad grades or having a low grade point average) web-based surveys resulted in significantly less under-reporting of true performance than telephone interviews. Similarly, self-reported sexual risk behaviors predicted actual sexually transmitted infections better when respondents were interviewed via audio-enhanced computer surveys as compared to personal interviews (Hewett et al., 2008). Finally, van Griensven and colleagues (2006) validated self-reported substance use including several illicit drugs against objective biomarkers. Descriptive analyses revealed a higher accuracy for computerized assessments than for questionnaires administered on paper. Overall, these studies support the assumption that the different prevalence rates identified for different survey modes are also linked to higher accuracies of these self-reports.

A matter of anonymity?

Increased self-disclosure in computerized as compared to paper-and-pencil surveys has been frequently attributed to increases in anonymity perceptions (e.g., Buchanan, 2000; Joinson, 1999; Richman et al., 1999; Trau et al., 2013). However, recent research cast doubts on anonymity as the mediating process because an increase in anonymity can sometimes decrease accountability (Lelkes et al., 2012). Although people tend to report more undesirable behaviors under anonymity conditions, the accuracy of the reported behavior decreases. Moreover, many people when given the opportunity to behave unethically also do so (Zhong, Bohns, & Gino, 2010). This is also reflected in the online disinhibition effect resulting in, for example, a decreased willingness to cooperate with others (Cress & Kimmerle, 2008) or increased inflammatory behavior (i.e. hostility towards others in web-based communication; Alonzo & Aiken, 2004). Thus, other explanations might account for differences in self-disclosure across self-administered survey modes.

On the one hand, survey mode effects could be a result of increases in confidentiality and privacy (Joinson & Paine, 2006; Joinson et al., 2010). Some survey mode experiments tend to support this notion (DiLillo, DeGue, Kras, DiLoreto-Colgan, & Nash, 2006). Whereas self-administered computerized and paper-and-pencil surveys do not differ with regard to perceived anonymity, that is, whether respondents are personally identifiable and answers to sensitive questions can be linked to specific individuals, the former are perceived as more confidential – computerized modes are attributed with greater privacy, that is, whether significant others are expected to see one’s responses to sensitive questions. Thus, privacy perceptions, particularly when respondents have control over who gets and does not get access to their responses, seem to increase the willingness to disclose sensitive information (Brandimarte, Acquisti, & Loewenstein, 2012). However, empirical evidence on this point is all but conclusive: it is also conceivable that under certain conditions computerized surveys might be perceived as less private, for example when several respondents sitting close to each other might glance at the computer screen of others (Beebe, Harrison, McRae, Anderson, & Fulkerson, 1998; Brener et al., 2006). Moreover, given the ongoing debate on data security and privacy on the Internet, future research that scrutinizes the implied mediation mechanism of privacy perceptions on survey modes and self-disclosure is highly warranted.

On the other hand, survey mode effects might be attributed to cognitive distortions in risk perceptions because people tend to underestimate objective risks of events when presented on the computer. For example, many individuals exhibit greater confidence in their abilities (Ackerman & Goldsmith, 2011) and are more likely to hold an illusion of control (i.e. the belief that they can influence even random events; MacKay & Hodgins, 2012) when identical problems are presented on the computer as compared to other media. Following social-exchange theory (cf. Dillman, Smyth, & Christian, 2014) respondents weigh the potential risks in answering a sensitive question against the potential benefits: if the perceived risk outweighs the benefits respondents are more likely to lie or refuse to answer. However, if computerization evokes cognitive distortions that decrease the perceived risk associated with an honest answer, respondents are more likely to disclose a sensitive behavior. As a consequence, prevalence rates of socially undesirable behaviors should be higher in computerized as compared to paper-and-pencil surveys. However, so far, this mediation process has not been examined in the context of survey research and, thus, remains speculative.

Limitations and outlook

Some limitations might impair the generalization of the presented findings: First, despite showing convergent validity across three large-scale representative surveys, the sensitivity index adopted for this study was not equally capable of predicting survey mode differences for all types of behaviors (e.g., sexual abuse; see Fig. 2). Unaccounted for confounding factors might have biased the chosen indicator to some degree. For example, Beatty and Herrmann (2002) argued that item non-response is no pure indicator of item sensitivity. Albeit reflecting the anticipated psychological and social costs of an honest response (i.e. item sensitivity), non-response also reflects respondents’ cognitive effort due to item complexity or simply motivational constraints (e.g., a lack of interest). Future research should further scrutinize the domain effect of self-disclosure across survey modes by adopting more elaborate methods, for example, using the randomized response or unmatched count technique (cf. Coutts & Jann, 2011; Lensvelt-Mulders, Hox, Heijden, & Mass, 2005).

Second, respondent characteristics might account for some between-study heterogeneity in the aggregated effect sizes. Sociodemographic characteristics and even personality traits such as an individual’s propensity to trust or willingness to take risks could represent further characteristics differentially affecting reactions to survey computerization. In the present meta-analysis sociodemographic differences were insufficiently able to explain survey mode differences. Although age exhibited a trend-significant effect, this result should be considered with due caution because it is based on rather few samples including predominantly adolescent respondents. Thus, future research should consider systematically examining the sample composition to identify subgroups of respondents for whom computerized survey modes might be particularly effective.

Third, anecdotal evidence also hints at potential mode differences across cultures. For example, North Americans tend to disclose more than Chinese (Chen, 1995), Japanese (Schug, Yuki, &Maddux, 2010), or East Europeans (Maier, Zhang, & Clark, 2013) under face-to-face conditions. However, in computer-mediated environments self-disclosure increases for Asians, which has been attributed to the fact that members of collectivistic cultures are more reserved in face-to-face interactions to avoid violating social norms (Zhao, Hinds, & Gao, 2012). Descriptive results could not corroborate these results in the current meta-analysis (see Table 5) because few effects were available from outside the United States. Therefore, future studies are encouraged to explicitly address cultural effects on self-disclosure in computerized surveys.

Finally, the present study was limited to a selection of sensitive behaviors (see Table 2) that has been frequently scrutinized in previous research. We do not want to imply that these are the most important or even only behaviors affected by survey modes. Rather, future research should extend this line research to other content domains that might be considered sensitive such as, for example, political participation (e.g., voting) or self-reported wealth (e.g., income). Indeed, there is evidence that respondents’ willingness to report a lower socio-economic status is differentially affected by the survey mode (Pascoe, Hargreaves, Langhaug, Hayes, & Cowan, 2013). Moreover, it might also be worthwhile to extend research on survey mode effects and its moderators to attitudinal questions that dominate public opinion research.

Implications for survey research

What are the practical implications of these results? On the one hand, it might be argued that with the widespread availability of web-based and mobile devices (cf. Mavletova & Couper, 2013; Van Heerden, Norris, Tollman, Stein, & Richter, 2014; Wells, Bailey, & Link, 2014), paper-and-pencil surveys will soon become outdated and mode differences should be of no major concern to survey specialists. For example, data from Germany show that in the year 2000 market research firms administered paper-and-pencil surveys about four times more often than computerized formats, whereas this ratio reversed during the subsequent decade; today computerized surveys are administered over four times more often than paper-and-pencil formats (ADM, 2014). Thus, in the near future paper-and-pencil questionnaires might be negligible in survey research. On the other hand, an increasing number of researchers adopt mixed-mode designs which assign respondents to different survey modes to maximize response rates (De Leeuw & Hox, 2011). For example, a study might be designed as a web-based survey; however, to also reach respondents with no or limited Internet access, this web-based survey might be supplemented by a postal survey – as, for example, in the nationally representative GESIS panel, a mixed-mode survey of the general population in Germany (cf. Struminskaya, Kaczmirek, Schaurer, & Bandilla, 2014). Given the present results, the assessment of sensitive behaviors might be biased in mixed-mode surveys when individuals systematically under-report socially undesirable behaviors in paper-and pencil as compared to computer-assisted survey modes.

Conclusions

During the past decades various forms of computerization have been introduced to the survey process, thus considerably enlarging researchers’ degrees of freedom on how to appropriately collect their data (cf. Couper, 2011): from simple paper questionnaires adapted for presentation on computer screens, more sophisticated variants including multimedia components, such as audio or video recordings up to surveys administered over the Internet. In particular, web-based surveys have received considerable attention in recent years (e.g., Kays et al., 2012; McCabe et al., 2005), partly because they have been credited with greater anonymity that supposedly should lead to higher self-disclosure of respondents (Buchanan, 2000; Joinson, 1999; Richman et al., 1999; Trau et al.; 2013). The presented meta-analysis seized this assertion and empirically confirmed the effect of survey computerization on the disclosure of sensitive behaviors. Computer-assisted surveys resulted in prevalence rates of sensitive behaviors that were about 1.51 times higher than comparable reports obtained via paper-and-pencil questionnaires; for highly sensitive issues this mode effect was even larger. Thus, surveys on issues conventionally perceived as sensitive tend to benefit from a switch to modern technologies; particularly when respondents are interviewed alone without the presence of other test takers such as in web-based surveys.