Introduction
Many questionnaires have been developed and validated as paper-and-pencil (P&P) questionnaires. However, over the past few decades, many of these questionnaires have increasingly been administered using electronic formats, in particular as Web-based questionnaires [
1]. Advantages of data collection over the Internet include reduced administrative burden, prevention of item nonresponse, avoidance of data entry and coding errors, automatic application of skip patterns, and in many cases cost savings [
1]. A Web-based questionnaire that has been adapted from a P&P instrument ought to produce data that are equivalent to the original P&P version [
2]. Measurement equivalence means that a Web-based questionnaire measures the same construct in the same way as the original P&P questionnaire, and that, consequently, results obtained with a Web-based questionnaire can be interpreted in the same way as those obtained using the original P&P questionnaire. However, migration of a well-established P&P questionnaire to a Web-based platform does not guarantee that the Web-based instrument preserves the measurement properties of the original P&P questionnaire. Necessary modifications in layout, instructions, and sometimes item wording and response options might alter item response behavior. Therefore, it is recommended that measurement equivalence between a Web-based questionnaire and the original P&P questionnaire be supported by appropriate evidence [
2]. Four reviews of such equivalence studies suggested that, in most instances, electronic questionnaires and P&P questionnaires produce equivalent results [
3‐
6]. However, this is not always the case [
4]. In this paper, we will demonstrate the use of modern psychometric methods to assess the equivalence across two modalities of a questionnaire. This is illustrated by analyzing the Web-based and P&P versions of the Four-Dimensional Symptom Questionnaire (4DSQ), a self-report questionnaire measuring distress, depression, anxiety, and somatization.
In 2009, the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) electronic patient-reported outcomes (ePRO) Good Research Practices Task Force published recommendations on the evidence needed to support measurement equivalence between electronic and paper-based patient-reported outcome (PRO) measures [
2]. The task force specifically recommended two types of study designs, the randomized parallel groups design and the randomized crossover design. In the former design, participants are randomly assigned to one of two study arms in which they complete either the P&P PRO or the corresponding ePRO. Mean scores can then be compared between groups. This is a fairly weak design to assess measurement equivalence [
7]. In the latter design, participants are randomly assigned to one of two study arms in which they either first complete the P&P PRO and then the ePRO or the other way around. Then, in addition to comparing mean scores, the correlation between the P&P score and the ePRO score can be calculated. The correlation, however, is little informative about the true extent of equivalence because of measurement error and retest effects. Measurement error attenuates the correlation, making it difficult to assess the true equivalence. Retest effects may further aggravate the problem. Retest effects are thought to be due to memory effects and specific item features eliciting the same response in repeated measurements [
8]. Retest effects are assumed to diminish with longer intervals between measurements. However, longer intervals carry the risk of the construct of interest changing in between measurements, leading to underestimation of the true correlation.
The research designs, discussed above, assess only a small aspect of true measurement equivalence because they fail to address equivalence of item-level responses [
7,
8]. Contemporary approaches to measurement equivalence employ differential item functioning (DIF) analysis [
9]. Addressing equivalence of item-level information, DIF analysis has been used extensively to assess measurement equivalence across different age, gender, education, or ethnicity groups (e.g., [
10]), or to assess the equivalence of different translations of a questionnaire (e.g., [
11]). Whereas DIF analysis dates back to at least the 1980s [
12], the method is relatively new in mode of administration equivalence research. In a non-systematic search, we identified only a dozen such studies (e.g., [
7,
8,
13‐
15]). The ISPOR ePRO good research practice task force report briefly mentioned DIF analysis as ‘another approach’ without giving the method much attention [
2]. The recent meta-analysis by Rutherford et al. did not include any studies using DIF analysis [
6].
The idea behind DIF analysis is that responses to the items of a questionnaire reflect the underlying dimension (or latent trait) that the questionnaire intends to measure, and that two versions of a questionnaire are equivalent when the corresponding items demonstrate the same item-trait relationships. There are various approaches to DIF analysis including non-parametric (Mantel–Haenszel/standardization) [
16,
17] and parametric approaches (ordinal logistic regression, item response theory, and structural equation modeling) [
18‐
20]. In the present paper, we demonstrate the use of DIF analysis within the item response theory (IRT) framework by assessing measurement equivalence across a Web-based and the original P&P versions of the Four-Dimensional Symptom Questionnaire (4DSQ).
Discussion
We examined measurement equivalence of the 4DSQ across the traditional P&P version and a modern Web-based version using DIF and DTF analysis. We identified DIF in five items from two scales. In terms of effect size, the DIF was small. The impact of DIF on the scale level (DTF) was negligible.
We employed a rigorous method to assess the dimensionality of the 4DSQ scales, using Yen’s Q3 [
40]. In combination with Christensen’s method to determine critical Q3-values [
42], the method turned out to be more sensitive to multidimensionality than more traditional fit statistics like the RMSEA. Our results indicate that the 4DSQ scales are essentially unidimensional, i.e., unidimensional enough to be treated as unidimensional in the context of IRT. Interestingly, the 4DSQ scales appeared to be slightly more unidimensional in the Web group than in the P&P group as evidenced by the slightly greater variance explained by the general factors (Online Resource 2). Apparently, the Web-based 4DSQ performs somewhat better, but certainly not worse, than the original P&P version.
DIF analysis is often concerned with inherently different groups (e.g., gender), in which case randomization is not feasible. In theory, DIF analysis is not hindered by group differences in trait levels because comparisons are matched at the trait level so that DIF only emerges when there is measurement bias rather than genuine trait differences. However, when groups differ in more respects than the trait level and the factor of interest (e.g., gender), the interpretation of the source of possible DIF may become problematic. In other words, when DIF is found, any aspect (other than trait level) in which the groups differ can potentially be the source of that DIF. Applied to the field of mode of administration equivalence research, no need for randomization may be an advantage of DIF analysis, but potential problems in the interpretation of DIF constitute a disadvantage. To avoid interpretation problems in this particular field, subjects can be randomly allocated to different mode of administration groups. But then data must specifically be collected for the evaluation of measurement equivalence, whereas data from different groups are often available ‘on the shelf,’ which is much cheaper.
In conclusion, using IRT-based DIF and DTF analysis to examine measurement equivalence across Web-based and P&P versions of the 4DSQ yielded few items with negligible DIF. Results obtained with the Web-based 4DSQ are equivalent to results obtained using the original P&P version of the questionnaire.