Skip to main content
Free AccessEditorial

Getting Entangled in the Nomological Net

Thoughts on Validity and Conceptual Overlap

Published Online:https://doi.org/10.1027/1015-5759/a000173

Psychological research relies heavily on tests and questionnaires to measure constructs and traits. Thus, tests and questionnaires are not only in high demand, but are constantly being developed anew. Likewise, researchers have suggested many new traits or constructs. Such suggestions then cause a wave of test and questionnaire development. This investment of time and research resources is necessary to ensure high-quality measurement tools that can be trusted by other researchers and practitioners as well to assess the intended trait or construct.

For this reason, journals such as the European Journal of Psychological Assessment publish studies evaluating the psychometric properties of such new measurement tools. Typically, such evaluation studies include some estimate of reliability (Schweizer, 2011) and concentrate on demonstrating the validity of the score derived from the new measure. A look at the history of this journal reveals that the number of publications that apply some form of factor analysis has risen from around 28% in the 1990s to 40% since the year 2000 (Alonso-Arbiol & van de Vijver, 2010). From this one can assume that the factorial validity of the published measurement tools has been a central theme of published research. Factorial validity is of course an extremely important issue and provides information necessary to many scoring procedures. Since it is trait scores (in their various forms) which are most commonly used in applied studies, factorial validity should not be neglected. However, while a newly devised measurement tool may demonstrate factorial validity and produce reliable test scores, its utility in the field is far from assured. Construct validity-related evidence is still necessary to ensure that the new measure truly captures the trait it was intended to capture. Campbell and Fiske (1959) asserted this as follows:

We believe that before one can test the relationships between a specific trait and other traits, one must have some confidence in one’s measures of that trait. Such confidence can be supported by evidence of convergent and discriminant validation. (p. 100)

Within this Editorial we would like to outline some problems related to construct validity and suggest some research lines to solve these. Schweizer (2012) already discussed problems with convergent validity at length in an editorial in this journal. Therefore, we will try to enlarge on the focus and include some additional issues we deem important as well.

The Idea of Convergent and Discriminant Validity as Proposed by Campbell and Fiske (1959)

Campbell and Fiske (1959) started their seminal paper by pointing out four aspects important to a validation process. They stated, first, that convergent validity necessarily requires independent measurement procedures, i.e., it is necessary to apply different measurement approaches (e.g., paper-pencil and observation). Second, besides convergent validity-related evidence, discriminant validity-related evidence is also required. Only then does a full picture of validity emerge. Third, each measure includes variance due to a trait and variance due to method. Without disentangling these different variance sources, validity estimates for a test score might be inflated. Finally, in order to achieve these goals, it is necessary to employ more than one method and to assess more than one trait. The approach they suggest – the multitrait-multimethod matrix (MTMM) – allows all of these aspects to be part of a single analysis. Such a matrix summarizes the correlations computed based on data for several traits, all assessed with the same methods. Importantly, it should be done with more than one method. Within the matrix Campbell and Fiske differentiate reliability diagonals, validity diagonals, heterotrait-monomethod triangles, and heterotrait-heteromethod triangles. Moreover, by specifying the relationships between validity diagonals and triangles as well as between the correlational patterns within the triangles, Campbell and Fiske defined what evidence is needed to speak of convergent and discriminant validity. Schweizer (2012) outlined some of the problems around this approach, which we do not need to repeat here.

Two Implications From Campbell and Fiske’s MTMM Approach

Issue 1: Selecting Traits for the Study of Convergent and Discriminant Validity

We want to focus on two important issues. The first is important to discriminant validity-related evidence. In selecting discriminant traits, Campbell and Fiske emphasized the importance of providing a definition as well as positioning the discriminant trait within a nomological net of the trait to be measured by the new instrument. Such a framework provides the necessary depth of information to select appropriate discriminant traits. Surprisingly, validity studies often include correlations with numerous different operationalizations of the same trait, suggesting convergent validity-related evidence. When it comes to discriminant validity-related evidence, sometimes no clear underlying rationale for selecting exactly these traits becomes obvious, making it look arbitrary. However, as Campbell and Fiske already pointed out:

When a dimension of personality is hypothesized, when a construct is proposed, the proponent invariably has in mind distinctions between the new dimension and other constructs already in use. One cannot define without implying distinctions, and the verification of these distinctions is an important part of the validational process. (p. 84)

Following this statement and the demands for selecting discriminant traits, it seems necessary to again recall the requirement to clearly define the trait to be measured, embed it in a nomological net, and base the selection of discriminant traits on this network. This way, obtaining discriminant validity-related evidence is more difficult but more informative. It is necessary to show that a new measure assessing a trait can be distinguished from an existing measure, capturing a (closely) related trait. Nevertheless, these findings tell us a lot more about discriminant validity than do correlations with measures assessing very distant traits.

Visualizing the Nomological Net of Personality

Pace and Brannick (2010) conducted a bare bones meta-analysis (corrections only for sampling error) for different Big Five questionnaires. The underlying assumption here was that all questionnaires should basically capture the same trait. Pace and Brannick concluded:

Convergent validities were lower than expected, indicating substantial differences among tests. Such a result begs for an explanation of the differences among tests as well as a consideration of the implications of such differences for theory and practice. (p. 674)

In fact, the largest overall convergent correlation was found for Extraversion at .56, whereas the estimated overall reliability of Extraversion measures was .83. Thus, even in the best case, about 50% of the instrument’s reliable variance is not shared but rather unique to the specific questionnaires.

What the Pace and Brannick (2010) study highlights is what has become known as the “jingle-jangle” fallacy, namely, that scales with the same name may measure different things, and that scales with a different name may measure the same thing. Here we demonstrate the utility of network diagrams (see Epskamp, Cramer, Waldorp, Schmittmann, & Borsboom, 2012) in representing cross-sectional association matrices to visualize the “jingle-jangle” within personality inventories and to highlight the challenge of selecting traits in discriminant and convergent validity studies.

Figure 1 shows a network representation of the correlation matrix between 113 personality facet scale scores from the NEO-PI-R, HEXACO, 6FPQ, 16PF, MPQ, and JPI taken from the Eugene-Springfield Community Sample (Goldberg, 2005). Correlations are based on a sample of 459 participants for whom complete data were available. Within the figure, each facet scale score is a node (circle) and the magnitude of the correlations between them is depicted as an edge (line). The thickness of the edge represents the magnitude of the associations. For clarity, associations less than r = .35 have been suppressed.

Figure 1. A network diagram of the correlations between 113 facet scales from the NEO-PI-R, HEXACO, 6FPQ, 16PF, MPQ and JPI derived from the Eugene-Springfield Community Sample (n = 459). The diagram was constructed using the qgraph package in R (Epskamp, Cramer, Waldrop, Schmittmann & Borsboom, 2012). The graph uses the “spring” option and produces a diagram in which the length of the edges is dependent on the weight (correlation) between nodes. This has the visual effect of drawing more closely associated nodes together in the graph layout.

Marked in gray are three clusters and two pairs of facets that represent a series of situations with respect to the jingle-jangle fallacy and new test construction. First, consider the two pairs of scales on the right hand side of Figure 1. The two nodes labeled TR are the facet scales of Traditionalism from the MPQ and JPI. As might be expected from two scales that share a label, they are highly associated (r = .77). However, the two nodes labeled CR and IN are the Creativity facet of the HEXACO and the Innovation scale of the CPI, respectively. Despite being labeled differently, the pairwise association between these scales is nearly identical (r = .76) to that of the Traditionalism scales. Thus, the question is whether this correlation is evidence for convergent or discriminant validity?

This same question arises when we consider both broader clusters of traits within the nomological net and traits that may be considered to be among the most highly researched in the field. For example, consider the cluster of nodes at the bottom center of Figure 1. Two nodes are the Social Boldness (SB; r = .70) facets of the 16PF and HEXACO. The remaining three nodes, which share equivalent associations (mean r = .72; range = .61 to .79) with the other nodes in the cluster, are the Exhibition facet of the 6FPQ (EX), Social Potency facet of the MPQ (SP), and the Social Confidence facet of the JPI (SC). If we regard the magnitude of associations between the two Social Boldness scales as being indicative of their convergent validity, then we have four different labels for the same construct within this single cluster. Next, consider the cluster of four nodes to the left of Figure 1, which represent the Anxiety (AX) facet from the NEO-PI-R, JPI, and HEXACO, and the Stress Reaction (SR) facet of the MPQ. The situation within this cluster is the same as that found for the sociability scales. The Anxiety scales have a mean correlation of .68, whereas the average correlation of the Stress Reaction scale with the three Anxiety scales is .72.

Finally, consider the cluster of nodes at the top of Figure 1. Two are the Order (OR) facets of the 6FPQ and NEO-PI-R, two are the Perfectionism (PF) facets of the 16PF and HEXACO, and two are the Organization (OG) facets of the JPI and HEXACO. Of note here is that, while most nodes are quite highly related – something we may expect as they can be argued to all cluster under some (perhaps higher-order) Conscientiousness factor – the two Perfectionism scales have a notably different pattern of associations with other related scales, despite sharing a facet label. As such, when selecting a Perfectionism scale from an extant inventory to study the convergent or discriminant validity of a new measure, our choice of comparison Perfectionism scale may have profound implications for whether we consider our new scale to be distinct or not.

A cursory glance at the rest of the network graph shown in Figure 1 highlights many other areas of local clustering not emphasized here. Thus, when researchers follow Campbell and Fiske’s guidelines and select measures that purportedly capture the same trait to ascertain discriminant validity, they might be in for a surprise: The pattern of convergent and discriminant correlations may not be as expected. Test constructors have to be careful when selecting convergent measures and ensure the highest possible conceptual and statistical overlap. Again, this judgment requires a clearly defined construct embedded in a clearly defined nomological net. Network visualizations of the nomological net of personality facets may greatly aid such decisions during scale development.

Possible Reasons for the Jingle-Jangle

Reasons for the low convergent validities Pace and Brannick provided in their paper are item context (e.g., general context or work specific context), breadth of the instrument, and test family. The latter refers to the differences between instruments from the NEO family and those from the BFI family (see also Miller, Gaughan, Maples, & Price, 2011). However, the first as well as the second reason stated bear further implications for assessment-oriented research. It is a well-documented fact that changing the context of an item, for example, by adding “in school” changes, (mostly) improves test-criterion correlations. Reasons for this might be found within the ideas of Brunswik’s lens model (see also Miller et al., 2011). More important here though is the question how this added piece of information might change construct validity of the measurement tool used. Thus, we need empirical research to investigate these effects.

The second reason for the low convergent correlations was breadth of the measurement tool. It is no new insight that most traits can be described as being hierarchically organized: Below a rather abstract domain there are narrower facets. As before, there is evidence suggesting that such facets improve test-criterion correlations (Brunswik, 1955). However, for most traits there is no common agreement about the number and nature of such facets. Pace and Brannick stated:

Recognition of the facets measured by tests may lead toward understanding similarities and differences among personality tests, and perhaps the nature of any differential prediction by tests. (p. 675)

Issue 2: The Issue of Method Variance

The second issue we want to raise with regard to Campbell and Fiske is method variance. Campbell and Fiske (1959) wrote:

The interpretation of the validity diagonal in an absolute fashion requires the fortunate coincidence of both an independence of traits and an independence of methods, represented by zero values in the heterotrait-heteromethod triangles. ... In practice, perhaps all that can be hoped for is evidence for relative validity, that is, for common variance specific to a trait, above and beyond shared method variance. (p. 84)

This pessimistic conclusion can be mitigated today. There are different methodological approaches to modeling all kinds of method effects (e.g., Eid, Lischetzke, Nussbeck, & Trierweiler, 2003; Podsakoff, MacKenzie, Lee, & Podsakoff, 2003). Despite these new modeling techniques, Campbell and Fiske’s remark should call our attention to the fact that we still do not know enough about the nature of method variance. Oftentimes method variance is perceived as variance due to the administration mode (e.g., paper-pencil). However, method variance could also be social desirability (Ziegler & Bühner, 2009), response sets or styles (Wetzel, Carstensen, & Böhnke, 2013), or acquiescence (Rammstedt & Kemper, 2011), to name just three examples. All of these terms are well known. However, with the possible exception of social desirability (Paulhus, 2002; Ziegler, MacCann, & Roberts, 2011), elaborated theories of such method variance producing phenomena is scarce. Thus, researchers with an interest in psychological assessment should strengthen their efforts to shed light on method variance producing phenomena like social desirability, response sets and styles, or acquiescence.

Conclusion

In this Editorial we wanted to raise awareness concerning some of the problems we believe to have been identified regarding convergent and discriminant validation efforts. Summarizing the thoughts outlined above, let us stress three aspects that papers reporting validation studies should follow: (1) The trait to be measured should be clearly defined and embedded within a nomological network. (2) Besides convergent validity, discriminant validity is important in order to gain a more complete picture of the validity of an instrument. To this end, different traits have to be assessed with different methods. The nomological network should guide the selection of the discriminant trait(s). (3) Effects of method variance should be modeled.

Moreover, this Editorial also suggests the need for more research in the areas of method variance producing phenomena (e.g., acquiescence, response sets and styles, and social desirability), effects of item context (e.g., items specifically phrased for school or work context), and the facet structure underlying and defining broad domains.

We want to end this Editorial with a quote from Campbell and Fiske (1959), which in our opinion is as true today as it was in 1959:

The test constructor is asked to generate from his literary conception or private construct not one operational embodiment, but two or more, each as different in research vehicle as possible. Furthermore, he is asked to make explicit the distinction between his new variable and other variables, distinctions which are almost certainly implied in his literary definition. In his very first validational efforts, before he ever rushes into print, he is asked to apply the several methods and several traits jointly. His literary definition, his conception, is now best represented in what his independent measures of the trait hold distinctively in common. (p. 101)

References

  • Alonso-Arbiol, I. , & van de Vijver, F. J. R. (2010). A historical analysis of the European Journal of Psychological Assessment, 26, 238–247. doi 10.1027/1015-5759/a000032 First citation in articleLinkGoogle Scholar

  • Brunswik, E. (1955). Representative design and probabilistic theory in a functional psychology. Psychological Review, 62, 193–217. First citation in articleCrossrefGoogle Scholar

  • Campbell, D. T. , & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105. First citation in articleCrossrefGoogle Scholar

  • Eid, M. , Lischetzke, T. , Nussbeck, F. W. , Trierweiler, L. I. (2003). Separating trait effects from trait-specific method effects in multitrait-multimethod models: A multiple-indicator CT-C(M-1) model. Psychological Methods, 8, 38–60. doi 10.1037/1082-989x.8.1.38 First citation in articleCrossrefGoogle Scholar

  • Epskamp, S. , Cramer, A. O. J. , Waldorp, L. J. , Schmittmann, V. D. , Borsboom, D. (2012). Qgraph: Network visualizations of relationships in psychometric data. Journal of Statistical Software, 48, 1–18. First citation in articleCrossrefGoogle Scholar

  • Goldberg, L. R. (2005). The Eugene-Springfield community sample: Information available from the research participants. (Vol. 45, 1, Technical Report). Eugene, OR: Oregon Research Institute. First citation in articleGoogle Scholar

  • Miller, J. D. , Gaughan, E. T. , Maples, J. , Price, J. (2011). A comparison of agreeableness scores from the Big Five Inventory and the NEO PI-R: Consequences for the study of narcissism and psychopathy. Assessment, 18, 335–339. doi 10.1177/1073191111411671 First citation in articleCrossrefGoogle Scholar

  • Pace, V. L. , Brannick, M. T. (2010). How similar are personality scales of the “same” construct? A meta-analytic investigation. Personality and Individual Differences, 49(7), 669–676. doi 10.1016/j.paid.2010.06.014 First citation in articleCrossrefGoogle Scholar

  • Paulhus, D. L. (2002). Socially desirable responding: The evolution of a construct. In H. I. Braun, D. N. Jackson, D. E. Wiley, (Eds.), The role of constructs in psychological and educational measurement (pp. 49–69). Mahwah, NJ: Erlbaum. First citation in articleGoogle Scholar

  • Podsakoff, P. M. , MacKenzie, S. B. , Lee, J. Y. , Podsakoff, N. P. (2003). Common method biases in behavioral research: A critical review of the literature and recommended remedies. Journal of Applied Psychology, 88, 879–903. First citation in articleCrossrefGoogle Scholar

  • Rammstedt, B. , & Kemper, C. J. (2011). Measurement equivalence of the Big Five: Shedding further light on potential causes of the educational bias. Journal of Research in Personality, 45, 121–125. First citation in articleCrossrefGoogle Scholar

  • Schweizer, K. (2011). On the changing role of Cronbach’s α in the evaluation of the quality of a measure. European Journal of Psychological Assessment, 27, 143–144. doi 10.1027/1015-5759/a000069 First citation in articleLinkGoogle Scholar

  • Schweizer, K. (2012). On issues of validity and especially on the misery of convergent validity. European Journal of Psychological Assessment, 28, 249–254. doi 10.1027/1015-5759/a000156 First citation in articleLinkGoogle Scholar

  • Wetzel, E. , Carstensen, C. H. , Böhnke, J. R. (2013). Consistency of extreme response style and nonextreme response style across traits. Journal of Research in Personality, 47, 178–189. doi 10.1016/j.jrp.2012.10.010 First citation in articleCrossrefGoogle Scholar

  • Ziegler, M. , Bühner, M. (2009). Modeling socially desirable responding and its effects. Educational and Psychological Measurement, 69, 548–565. First citation in articleCrossrefGoogle Scholar

  • Ziegler, M. , MacCann, C. , Roberts, R. (Eds.). (2011). New perspectives on faking in personality assessments. New York: Oxford University Press. First citation in articleCrossrefGoogle Scholar

Matthias Ziegler, Institut für Psychologie, Humboldt University Berlin, Rudower Chaussee 18, 12489 Berlin, Germany, +49 30 2093-9447, +49 30 2093-9361,
Tom Booth, Centre for Cognitive Ageing and Cognitive Epidemiology, Department of Psychology, The University of Edinburgh, Edinburgh EH8 9AD, United Kingdom, +44 131 650-8405,
Doreen Bensch, Institut für Psychologie, Humboldt University Berlin, Rudower Chaussee 18, 12489 Berlin, Germany, +49 30 2093 9447, +49 30 2093 9361,