Social scientists have long been interested in studying group differences on key outcome measures. For examples, gender differences in internalizing symptoms (Hankin and Abramson
2002; Sanders et al.
1999), racial differences in externalizing symptoms (Krueger et al.
2003; Ruchkin et al.
2006), and differences in growth trajectories between children from different family structures (Curran
2000; Beyers and Loeber
2003) are all areas in which researchers have reported differences between groups. However, whether group differences are real or the result of measurement bias is not always clear. When groups are compared on a given construct, the impacts of real group differences and bias should be recognized and differentiated (Dorans and Holland
1993; Millsap and Everson
1993). Detection of differential item functioning (DIF) and correction to items that exhibit DIF are extremely important to implement so that researchers can make more valid comparisons between groups.
Measuring Antisocial Behavior
The self-report antisocial behavior scale created by Elliott, Ageton, and Huizinga (
1985) for the original National Youth Survey (NYS) is perhaps the most widely known measure of antisocial behavior, and its pool of items is arguably the most widely used. In fact, items from this self-report antisocial behavior scale are so well known and widely used that three major longitudinal studies funded by the Office of Juvenile Justice and Delinquency Behavior Prevention—studies centered in Rochester (Thornberry et al.
1993), Denver (Huizinga et al.
1991), and Pittsburgh (Loeber et al.
1998)—and other major studies, including the Dunedin Multidisciplinary Health and Development Study (Moffitt et al.
1996), all use items from the original Elliott et al. scale or modified versions of them to measure antisocial behavior (Piquero et al.
2002).
The NYS began in 1976 as a national probability sample of over 1700 youth (Elliott et al.
1985), and the study protocol included an antisocial behavior inventory with 47 items designed to capture a wide range of behaviors, including Uniform Crime Report offenses. However, most researchers use or adapt items only from the general delinquency scale, a subset of 24 items, ranging in severity from serious and violent offenses (e.g., “had [or tried to have] sexual relations with someone against their will;” “used force to get money or things from other people [not students or teachers]”), to more common offenses such as theft and vandalism, to relatively minor offenses such as skipping classes. In the NYS, 177 respondents were randomly selected and reinterviewed approximately 4 weeks after their initial assessment during the 5th wave of the study. Test-retest correlations were 0.84 and .75 for the general antisocial behavior frequency and variety scores, respectively, and 0.52 to 0.93 for the general delinquency subset, with a mean of 0.74 across 22 estimates of test-retest reliability (Huizinga and Elliott
1986).
Measurement Equivalence Across Groups
When investigating measurement equivalence across groups, the developmental appropriateness of items is a concern. One strength of the antisocial items from the NYS (Elliott et al.
1985,
1989) is the fact that item content was adjusted to the developmental level of respondents. In our study, we investigate measurement equivalence in two different sets of items—one set that is developmentally appropriate for adolescents, and the other designed for young adults. Ideally, these forms should be linked so that scores from the adolescent and adult versions fall on the same scale, making it possible to make comparisons across adolescence and young adulthood. Although a detailed explanation for linking procedures is beyond the scope of the current study, the mandatory data collection design for linking scales and investigating DIF is that two different measures should have common items. The antisocial behavior measures used in this study have six items that are common across the adolescent and young adult forms. Hence, the scales for these two forms may be linked and placed on a common metric when there are no method (age) effects, a good spread of thresholds, and a number of other factors (Reise and Waller
2009). In the current study, these common items are the focus of testing DIF.
Methods for detecting measurement bias may identify either “an observed conditional invariance” or “an unobserved conditional invariance” (Millsap and Everson
1993). In particular, the second category contains likelihood ratio (LR) tests based on either IRT or confirmatory factor analysis (CFA) (see Reise et al.
1993, for comparison of LR tests based on IRT and CFA; Kim et al.
2007, for application of DIF detection methods). For short tests, Finch (
2005) claimed that LR tests based on IRT more accurately detected uniform bias—that all categories in an item consistently behave in a fashion favoring one group over another group—than did a variety of other methods. In this study, we used the LR test to detect measurement bias on an antisocial behavior scale for adolescents and young adults.
Parameter estimation in item response theory can be categorized into parametric and nonparametric methods, scaling dichotomous as well as polytomous items. Specifically, parametric IRT model10 are based on either normal ogive or logistic functions, whereas nonparametric IRT model10 do not assume any specific parametric function (see Embretson and Reise
2000; Meijer and Baneke
2004; Sijtsma
1998, for further description of parametric and nonparametric IRT model10). When parametric IRT model10 are employed, likelihood ratio tests may be implemented by BILOG-MG (Zimowski et al.
2002) and MULTILOG-MG (Thissen
2003) for dichotomous and polytomous items, respectively. In contrast, when nonparametric IRT model10 are used, DIF tests may be conduced using TESTGraf (Ramsay
2001).
Thissen, Steinberg, and Gerrard (
1986) discussed the LR test in parametric IRT (IRT-LR) in the context of analyses of dichotomous items. Later, Thissen and Steinberg (
1988) extended the original parametric IRT-LR method for tests comprised of polytomous items. In this method, the basic idea is to compare the log likelihood values from two model10, where one model, which is more restricted, is nested within the other, which is less restricted (i.e., has more parameter estimates than the restricted model).
In certain situations, the selection of items that serve as anchor items is obvious; in such situations, the computation of
LR and investigation of item DIF is relatively simple. More commonly, which items should be considered anchor items and which should be studied items is less clear, and the assessment of item bias is therefore more complicated. Based on a simulation study, Candell and Drasgow (
1988) recommended an iterative procedure for linking metrics from two separate groups and detecting which items exhibited DIF when anchor and studied items are mixed or unknown. Using an iterative method, Segall (
1983) proposed four steps to detect items exhibiting DIF. First, item parameter metrics from two independently calibrated groups are initially linked. Second, after equating item parameters from the two groups by employing linking coefficients from the previous step, all items from a test are examined for item bias. Third, the item with the most extreme DIF is identified, and linking coefficients are recalculated after excluding this item. Fourth, with linking coefficients that are calculated without the item exhibiting most extreme DIF,
G
2
is recomputed for all remaining items in a scale and evaluated with the corresponding critical value based on the alpha level.
Our aim in the present study was to investigate DIF for two scales of antisocial behavior derived from commonly used items from the NYS, one scale for adolescents and the second for young adults. Both scales were assumed to fit the graded response model10 (GRM; Samejima
1969). Children from two-parent families typically exhibit lower levels of antisocial behavior relative to children from single-parent families (Dawson
1991; Hoffmann
2006). However, a direct way of comparing levels of antisocial behavior across these two groups is possible only if antisocial behavior items do not show DIF. Therefore, we conducted DIF analyses on the adolescent and young adult forms of the antisocial behavior scale to determine whether items exhibited DIF across two groups of participants, one from two-parent families and the second group from single-parent families.
Discussion
Our analyses revealed different types of DIF—uniform and nonuniform—which must be treated in different manners, as they imply different types of bias. A common approach to managing item
bias is to delete any item showing DIF from a measure. However, retaining all items is vastly preferable due to the expensive and time-consuming development of a scale and confirmatory tests of factor structure in a scale. Hence, one way to handle scales that exhibit DIF is to correct the
bias by retaining matching items with opposite
biases that cancel out the DIF at a scale level (Teresi
2006). In order to match up the appropriate corresponding items and cancel the item DIF at a scale level, the direction and type of DIF should be recognized properly.
In our study, we found evidence of uniform bias in item del12. However this bias appeared to be counteracted at the scale level because the bias by which an adolescent from SPP has a higher probability of selecting a higher category than an individual from IYFP was cancelled out by slight bias favoring IYFP over SPP on remaining items in the adolescent scale, even though the bias from the remaining items was not statistically significant. Because significant uniform bias on one item (item del12) from the adolescent scale was counteracted by small and nonsignificant bias in the opposite direction on the remaining items, across-group DTF on the total scale was negligible. Similarly, the non-significant DTF for the antisocial behavior scale for young adults suggests that any bias in items del16 and del17 was also counteracted by opposite types and directions in biases on remaining items.
Despite the item DIF revealed in our analyses, comparisons between participants from single- and two-parent families using these scales appear sound, given our non-significant results with regard to scale DTF on these instruments. In general, when analyses reveal significant DTF, a researcher should scrutinize the magnitude and direction of DIF for all items in a scale and ensure the inclusion of items showing opposite directions and corresponding sizes of DIF to the original items in a test in order to have non-significant DTF. Because our analyses revealed non-significant DTF even in the presence of some significant differential functioning at the item level, this extra step of balancing bias across items was not necessary in the current study.
Similar to Dunifon and Kowaleski-Jones (
2002), we found that adolescents from single-parent families exhibit more antisocial behavior than adolescents from two-parent families. Specifically, for 1994, the average estimated delinquency levels were −.04 (
SD = .67) and .18 (
SD = .82) in IYFP and SPP, respectively. For 1995, the average values were −.10 (
SD = .63) and .11 (
SD = .76), respectively. Two additional findings were noteworthy. First, items exhibiting DIF in 1994 were items representing more extreme forms of antisocial, whereas items exhibiting DIF in 1995 reflected milder forms of antisocial behavior. In 1994, for a given level of theta, adolescents from one-parent families had a higher probability of endorsing a given option on the extreme item del11 (beat up somebody because they made you angry) than adolescents from two-parent families. Conversely, in 1995, again for a given level of theta, adolescents from one-parent families had lower probability of endorsing a given option on the relatively mild items del16 (drive a car recklessly) and del17 (cheat at school or other places) compared to adolescents from two-parent families. Second, the two items showing DIF in 1995 indicate a different relationship with the construct of antisocial behavior depending on the family structure (i.e., single-parent and two-parent family). That is, the two DIF items representing relatively mild antisocial behaviors are better indicators of antisocial behavior in two-parent families than one-parent families:
a
j
parameters of 1.75 vs .93 and 1.90 vs .49 for del16 and del17, respectively.
The present study has several strengths and is of practical importance for longitudinal research on antisocial behavior in adolescents and young adults. First, this study highlights the importance of testing for DIF in the adolescent and young adult antisocial behavior scales, which are widely used in studies of problem behavior. In order to reveal the actual magnitude of the mean difference on the two scales between two groups such as single-parent and two-parents families, DIF tests at the item and scale levels were implemented. Second, having established the importance of detecting both item and scale bias, our study illustrated methods to make these comparisons. To conduct DIF tests on the two scales, we employed an iterative procedure for detecting DIF using the likelihood method. The iterative approach enabled us to identify bias arising from DIF on certain items and isolate this from any bias arising from DIF items on remaining items. Third, we showed how the two types of DIF—uniform and nonuniform—can be differentiated and treated. Uniform DIF was identified for one item on the adolescent scale, and nonuniform DIF was detected for two items on the young adult scale. Fourth, we suggested that scale level tests, such as DTF, are important and may support a conclusion that the overall scale is not biased even though significant item-level DIF bias is found.
Furthermore, the present study has another practical implication for longitudinal studies on antisocial behavior in adolescents and young adults. The two scales employed in this study have three common items with which the two scales can be linked onto a common scaling metric. By eliminating or controlling for DIF on the three common items, a researcher in a longitudinal study on antisocial behavior can compare trajectories of individuals from adolescence through young adulthood for individuals from both single-parent and two-parent families. That is, one could estimate scores on the underlying antisocial behavior latent variable that were on a comparable metric across adolescence and young adulthood, enabling one to study change in antisocial behavior tendencies across these age periods even though only a relatively small number of items are in use across age levels.
The current study also has limitations that should be noted. First, the sample size (n = 556) of the present study was not extremely large, and the number of individuals from single-parent families was relatively small (n = 109). Given the sample size, we had lower power to detect DIF than if we had a larger sample of participants available for analysis. Thus, it is possible that important levels of item DIF existed, but went undetected in our analyses. Additionally, item del17 was not completely investigated for DIF because no individuals from single-parent families responded using the highest category on del17. Not all items in the two antisocial behavior scales were investigated for DIF and DTF due to insufficient responses above the lowest category on the response scale. As a result, future research with larger samples that exhibit higher levels of antisocial behavior would be able to extend our research by examining DIF on the items we had to delete due to low frequencies of response. Finally, even for items with sufficient numbers of responses in higher categories, the frequency of these responses was not large. In such situations, a researcher might re-score item responses on a 0-to-3 scale into 0-1 scoring. Then, future research could investigate whether use of dichotomous IRT model10 including 2PLMs or 3PLMs to model these responses leads to approximately the same levels of measurement precision as does the use of polytomous IRT model10 such as the GRM.
Despite these limitations, the present study has practical implications for applied researchers: non-significant DTF can exist at the scale level, despite significant levels of item DIF on common items in two scales of antisocial behavior for adolescents and young adults. Using these results, we have a useful way of linking scores on the two antisocial behavior scales across age groups. Our study also illustrated how the two types of DIF—uniform and nonuniform—differ in terms of item and category parameters in GRMs. Given the non-significant levels of DTF on the total scale scores, comparing respondents from single- and two-parent families on raw scale scores appears justified.