Gender bias and construct validity in vocational interest measurement: Differential item functioning in the Strong Interest Inventory

https://doi.org/10.1016/j.jvb.2009.01.003Get rights and content

Abstract

Item response theory was used to address gender bias in interest measurement. Differential item functioning (DIF) technique, SIBTEST and DIMTEST for dimensionality, were applied to the items of the six General Occupational Theme (GOT) and 25 Basic Interest (BI) scales in the Strong Interest Inventory. A sample of 1860 women and 1105 men was used. The scales were not unidimensional and contain both primary and minor dimensions. Gender-related DIF was detected in two-thirds of the items. Item type (i.e., occupations, activities, school subjects, types of people) did not differ in DIF. A sex-type dimension was found to influence the responses of men and women differently. When the biased items were removed from the GOT scales, gender differences favoring men were reduced in the R and I scales but gender differences favoring women remained in the A and S scales. Implications for the development, validation and use of interest measures are discussed.

Introduction

Since the pioneering work of Strong (1943), researchers have reported large differences in the vocational interests of men and women. Women tend to express interests that fit their traditional gender role, whereas men express more interests in domains that have been considered masculine (Betz and Fitzgerald, 1987, Hackett and Lonborg, 1994). Research and debate on the issue of gender differences and possible bias in interest measurement reached a peak in the 1970s (then referred to as sex-bias and fairness see Diamond, 1975, Tittle and Zytowski, 1978). Much of the debate centered on the Strong Interest Inventory, one of the oldest and most widely used interest measures. The debate resulted in new perspectives and guidelines to reduce bias in interest inventories based on the psychometric knowledge and techniques of the time. After 1980, the sex bias debate seemed to fade away, but as is evident in the major interest inventories used today, a common agreement on how to best resolve gender bias in interest measurement has not been reached (cf. Donnay et al., 2005, Harmon et al., 1994, Holland et al., 1994, Swaney, 1995).

Recent expansion of sophisticated psychometric modeling grounded in item response theory (IRT) has provide new methods to address the issue of bias and fairness (Bolt & Rounds, 2000). These methods and developments in validity theory may also offer new insights into the nature of the construct of vocational interest, especially those factors that differently affect the responses of men and women (Smith, 2002). The purpose of this study is to apply differential item functioning (DIF) techniques, to examine gender bias in items, explore its sources and influence on gender differences detected in the General Occupational Theme (GOT) scales and the Basic Interest (BI) scales of the Strong Interest Inventory (SII).

Gender differences in the responses to interest inventories have been observed both at the scale and item level. Women tend to score higher on Holland’s Artistic, Social and Conventional types and men score higher on the Realistic, Investigative and Enterprising types (Betz and Fitzgerald, 1987, Hackett and Lonborg, 1994). One of the main concern in the sex-bias and fairness debate in the seventies was that differences between men and women in vocational interest assessment can have consequences for individuals seeking career counseling and for society as a whole. In particular, scale level differences can lead to sex-restrictive career options being suggested to students (Cole and Hanson, 1975, Prediger and Hanson, 1974). Interest inventories may serve to maintain and perpetuate the limited range of occupations considered appropriate for men and women.

Two main positions were taken on the issue. Prediger and Cole (1975) stated that the primary purpose of using an interest inventory is occupational exploration (also, see Prediger, 1977). Since differences between men and women are extraneous to the goal of occupational exploration, these differences should be removed from interest measures. In contrast, Gottfredson and Holland (1978) argued that because the constructs measured are dependent on differential experiences of men and women, the removal of sex differences from interest scores would decrease the predictive validity of the measure. These positions foreshadowed the wider debate in psychology on construct validity, measurement bias and its social consequences (Cole and Moss, 1989, Linn, 1997, Messick, 1989, Messick, 1995, Shepard, 1997).

A consensus on how to define sex-bias or what now would be termed gender-bias in interest measures has yet to be reached. Nevertheless, several strategies have been used to eliminate bias and sex-restrictiveness. In the 1974 revision of the Strong and construction of one form for both women and men, Campbell (1974) changed the wording of items (e.g., policeman to police officer) and used a variety of norms for reporting standard scores (i.e., both same and combined sex-norms). Each revision of the Strong since 1974 has focused on removing sex-role bias in items and norming the scales with both female and male samples (Hansen and Campbell, 1985, Harmon et al., 1994). The most recent revisions uses only combined norms (Donnay et al., 2005). Another strategy has been to remove items showing large gender differences during test development. For the Strong, items that show large gender differences in endorsement have been eliminated during the 1994 revision (Harmon et al., 1994). These strategies, indicative of a classical test theory approach for reducing bias, are necessary but not sufficient to optimally reduce gender bias in interest measures. The removal of items showing gender differences can be confounded by real group differences in the trait being measured.

The lack of consensus about how to deal with gender differences is not surprising because it has not yet been adequately explained why measured interests are different for men and women. It is possible that these differences may be partly explained by item bias in interest inventories and the influence of construct-irrelevant factors (Messick, 1989) on the scales used in counseling. Fouad and Walker (2005) suggested that perceived barriers and opportunities may be such a factor influencing the assessment of interests of ethnically diverse clients. They examined racial/ethnic group differences in the SII using differential bundle functioning (DBF). Large racial/ethnic DBF was detected, implying that the items were influenced by other constructs in addition to the traits the Holland scales were designed to measure. This is also likely to be the case for men and women who work in occupations that are largely sex-segregated. Numerous barriers for entering certain types of jobs have been identified for women (Betz, 1994). It is possible that gendered opportunity structure and stereotyping of the job market differently influences the interest trait being measured for men and women.

Aros, Henly, and Curtis (1998) showed that occupational stereotypes influence the responses to items in interest inventories. They used DIF, specifically Mantel-Haenszel log-odds ratios, to explore gender differences in responses to 28 occupational title items from the GOT scales measuring the six RIASEC interest types in the SII. Gender-related DIF was detected on most of the items. However, they only explore DIF in a few items and focused their investigation on one item type. Occupational titles, for example, may be more susceptible to stereotyping than activities (Crites, 1969, Kuder, 1977, Osipow, 1983). The present study examines the full range of items used in interest inventories that may influence the gender-related differences found at the scale level.

Item response theory differs from classical test theory by modeling the interaction of the person and the individual items to a latent trait. By modeling responses in terms of their relations to a common underling trait, IRT models have an important feature that allows us to determine if people from two groups respond differently to the same item given that they have the same level of a trait (Bolt and Rounds, 2000, Embretson and Reise, 2000). For example, by using DIF techniques it can be determined if women and men who are equally realistic in their interests (a trait being measured) are as likely to endorse a highly sex-stereotyped occupations like “auto racer” or “nurse.”

A theoretical framework called multidimensional item response theory has been developed to account for how item bias as defined by DIF relates to item and test validity (Ackerman, 1992, Bolt and Stout, 1996, Kok, 1988, Shealy and Stout, 1993). The underlying mechanism producing the DIF is addressed by making a distinction between the main trait that the researcher intends to measure, alternately called the target trait or primary dimension, and other factors influencing test performance not intended to be measured by the test, such as nuisance determinants and secondary dimensions (Roussos and Stout, 1996a, Shealy and Stout, 1993). The construct validity of a test is threatened if it contains items that capture traits or dimensions the test developer does not intend to measure by the test (construct irrelevant). Item bias can arise if two groups differ in their underlying distribution of this extraneous trait or secondary dimension that is not intended to be captured by the scale (Ackerman, 1992, Bolt and Stout, 1996). In interest measurement, items may function differently for men and women when the two groups differ, for example, in their distribution of a sex-type dimension (a secondary dimension) found to underlie responses to the interest items in the Strong (Aros et al., 1998, Einarsdóttir and Rounds, 2000).

When applying an IRT framework and the multidimensional model of DIF to vocational interest measures, questions arise about the primary traits that are being measured and the dimensionality of the scales. Item bias is determined in reference to a criterion internal to the test. A collection of items defining the internal criterion is referred to as a valid subtest. Determination of a valid subtest is an empirical decision based on expert opinion or external data (Shealy & Stout, 1993). In the Strong Interest Inventory the GOT scales and the BI scales are established scales that are considered valid and useful for counseling purposes especially occupational exploration (Donnay et al., 2005, Harmon et al., 1994). These two types of scales served therefore as valid subtests in the DIF analysis.

Dimensionality of a scale is a psychometric issue that is also conceptually important but neglected in the domain of vocational interest assessment. The six General Occupational Theme scales measure six broad interest types (Donnay et al., 2005, Holland, 1997). The dimensionality of the GOT and BI scales in the Strong has not been directly evaluated. In the present study Stout, 1987, Stout, 1990 conception of essential unidimensionality is applied because it is less restrictive and more realistic for most measurement practices than the traditional conception of local independence in IRT based model. Dimensionality and DIF analysis can give valuable insights into the concept being measured and the construct validity of the scales (Smith, 2002, Smith and Reise, 1998). Application of DIF analysis with the multidimensional model provides information on whether scale level differences are the result of impact or bias in interest assessment. Impact is defined as a between group difference on a construct valid trait (Ackerman, 1992).

In the present study, an IRT-based approach is used evaluate the dimensionality of the SII scales and to test whether the SII items function the same way for men and women. Our study differs from the Aros et al. (1998) study in three important ways. First, SIBTEST, a similar but more recent DIF detection method than the Mantel-Haenszel statistic, was applied. Simulation studies (Roussos & Stout, 1996b) have shown that the SIBTEST procedure detects DIF better than the Mantel-Haenszel statistic. Second, all the items used to define the GOT and the BI scales in the SII were tested, allowing an exploration of how the different item types (e.g., occupational titles, activities) function. Third, the overall influence of DIF on GOT scale scores was estimated and the construct validity of the scales purged of DIF items was evaluated. We expected gender differences to be related to the sex-typing of the occupations, as was the case in Aros et al. study. Because prestige has also been shown to influence responses to interest inventory items (Einarsdóttir and Rounds, 2000, Tracey and Rounds, 1996) both sex-type and prestige were examined as possible secondary dimensions contributing to DIF.

Section snippets

Participants

The responses of 2965 college students to the Strong Interest Inventory (SII) were sampled using simple random sampling strategy from the test publisher’s database (Consulting Psychologist Press) in 2000. The SII had been administered to a college student population and sent to the publisher for scoring. The sample consisted of 1860 (62.7%) women and 1105 (37.3%) men. Information about the ethnicity of the participants showed that, 2.3% identified as American Indian or Alaskan Native, 4.5% as

GOT and BI mean scores

The mean gender differences for the General Occupational Theme (GOT) scales and the Basic Interest (BI) scales were examined to determine if the differences in the present sample are similar to differences detected in previous research. Table 1 shows the mean gender differences on the GOT scales, five of them being statistically significant. The differences are expressed in standard deviation units shown as Cohen’s d effect sizes. Women tended to score higher on the Social, Artistic and

Discussion

The results of this study showed that about two-thirds of the General Occupational Themes and Basic Interest items function differently for women and men, indicating that there is extensive gender-related item bias in the Strong. The six GOT scales were not essentially unidimensional, but contained dimensionally distinct sets of items. A major dimension was detected in the BI scales, but these were also found to contain some minor dimensions. The relationship of DIF with the sex-type ratings

References (66)

  • N.E. Betz et al.

    The career psychology of women

    (1987)
  • D. Bolt et al.

    Advances in psychometric theory and methods

  • D. Bolt et al.

    Differential item functioning: Its multidimensional model and resulting sibtest procedure

    Behaviormetrika

    (1996)
  • D.P. Campbell

    Manual for the Strong–Campbell interest inventory

    (1974)
  • N.S. Cole et al.

    Bias in test use

  • N.S. Cole et al.

    Impact of interest inventories on career choice

  • E.A. Cooper et al.

    Comparison of different methods of determining sex type of an occupation

    Psychological Reports

    (1985)
  • H-H. Chang et al.

    Detecting DIF for polytomously scored items: An adaption of the SIBTEST procedure

    Journal of Educational Measurement

    (1996)
  • J.O. Crites

    Vocational psychology: The study of vocational behavior and development

    (1969)
  • R.V. Dawis

    Vocational interests, values and preferences

  • D.A.C. Donnay et al.

    Strong Interest Inventory manual: Research, development, and strategies for interpretation

    (2005)
  • J.A. Douglas et al.

    Item-bundle DIF hypothesis testing: Identifying suspect bundles and assessing their differential functioning

    Journal of Educational Measurement

    (1996)
  • S.E. Embretson et al.

    Item response theory for psychologists

    (2000)
  • DIF procedures (1994 supplement)

  • H. Finch et al.

    Performance of DIMTEST- and NOHARM-based statistics for testing unidimensionality

    Applied Psychological Measurement

    (2007)
  • N.A. Fouad

    Cross-cultural differences in vocational interests: Between-groups difference on the Strong Interest Inventory

    Journal of Counseling Psychology

    (2002)
  • G.D. Gottfredson et al.

    Toward beneficial resolution of the interest inventory controversy

  • J.P. Guilford et al.

    A factor analytic study of human interests

    Psychological Monographs

    (1954)
  • G. Hackett et al.

    Career assessment and counseling for women

  • J.C. Hansen et al.

    Manual for the SVIB-SCII

    (1985)
  • L.W. Harmon et al.

    Strong Interest Inventory: Applications and technical guide

    (1994)
  • J. Hattie

    Methodological review: Assessing unidimensionality of tests and items

    Applied Psychological Measurement

    (1985)
  • Cited by (31)

    • Simplified Chinese version of the Strong Interest Inventory®: Structure and psychometric properties

      2018, Journal of Vocational Behavior
      Citation Excerpt :

      Another application of IRT is to examine sex differences in interest inventories. For example, Einarsdóttir and Rounds (2009) investigated gender bias in the 1994 Strong using the differential item functioning approach. They found substantial gender-related differences at the item level, but such differences generally cancelled out at the scale level.

    • Do Raven's Colored Progressive Matrices function in the same way in typical and clinical populations? Insights from the intellectual disability field

      2011, Intelligence
      Citation Excerpt :

      Several statistical approaches have been designed for detecting differential item functioning (Camilli & Shepard 1994; Magis et al., 2010; Osterlind 1983; Osterlind & Everson 2009; Penfield & Camilli 2007; Sireci et al., 2005). These methods (e.g., conditional p-value, Mantel–Haenszel, chi-squared tests, logistic regression modeling, or item-response theory methods) have been widely used to ensure the internal validity of test scores across gender, ethnic, socioeconomic or linguistic groups (e.g., Abad et al., 2004; Einarsdóttir & Rounds 2009; Hauger & Sireci 2008; Pedraza et al., 2009; Ponsoda et al., 2008; Scherbaum & Goldstein 2008). However, these same techniques are never used in the field of developmental disabilities.

    • Development of an abbreviated Personal Globe Inventory using item response theory: The PGI-Short

      2010, Journal of Vocational Behavior
      Citation Excerpt :

      The relative lack of DIF evidenced in the PGI and the PGI-S in the two samples was surprising. Past research on interests have found substantial DIF in Strong Interest Inventory items across gender (Aros, Henly, & Curtis, 1998; Einarsdóttir & Rounds, 2009) and ethnicity (Fouad & Walker, 2005) thus making interpretation of SII results across gender and ethnicity difficult. Past research has also demonstrated both gender and ethnicity differences on the scale means (Fouad, 2002; Hansen, 1978, 1988; Harmon et al., 1994; Tracey, 2002; Tracey & Robbins, 2005).

    • Facilitating major choice with and without typology assessments: An action research project in an introduction to business classroom

      2021, Teaching and Learning for Social Justice and Equity in Higher Education: Co-Curricular Environments
    View all citing articles on Scopus

    We thank David A. Donnay and Consulting Psychologist Press for providing the archived data and Daniel Bolt and Terry A. Ackerman for technical assistance. We also thank Patrick I. Armstrong and especially, Christopher A. Moyer for their close reading of an earlier draft of the manuscript. This article is based on a first authors dissertation submitted to the University of Illinois at Urbana-Champaign. This research was supported in part by a grant from Iceland University of Education.

    View full text