Background
Validity testing theory and methodology
Validity testing theory
Validity testing methodology
The Health Literacy Questionnaire (HLQ)
An example of an interpretive argument for a translated PROM
The interpretive argument (interpretation and use of scores)
Assumptions underlying the interpretive argument
Constructing a validity argument for a translated PROM
-
Supports sound initial HLQ construction and validity testing.
-
The HLQ items and response options are appropriate for and understood as intended in the target culture.
-
There is replication of the factor structure and measurement equivalence across sociodemographic groups and agencies in the target culture.
-
The HLQ scales relate to external variables in anticipated ways in the target culture, both to known predictor groups (e.g. age, gender, number of comorbidities) and to anticipated outcomes (e.g. change after effective interventions).
-
There is conceptual and measurement equivalence of the translated HLQ with the English HLQ, which is necessary to transfer the meaning of the constructs for interpretation in the target language [76] such that intended benefits of testing are more likely attained.
Components of the interpretive argument and assumptions | Evidence required for a validity argument | Examples of methods to obtain validity data, including relevant studies on the Health Literacy Questionnaire (HLQ) |
---|---|---|
1. Test content [11]—analysis of test development (relationship of item themes, wording and format with the intended construct) and administration (including scoring) | ||
1.1 The content of the source language constructs and items are appropriate for the target culture | Evidence that the PROM constructs are appropriate and relevant for members of the target culture and that item content will be understood as intended by members of the target culture | Pre-translation qualitative evaluation by in-context PROM user about the cultural appropriateness of the items and constructs for the target language and culture. For example, an ethnographer who lived with Roma populations assessed each of the HLQ scales according to how Roma people might understand them [79]. The outcome was a decision that the HLQ scales, specifically as used in needs assessment for the Ophelia process [73], resonate with elements of successful grassroots health mediation programs in Roma communities |
1.2 Application of a systematic translation protocol that provides evidence that linguistic equivalence and cultural appropriateness are highly likely to be achieved, thus supporting the argument for the successful transfer of the intended meaning of the source language constructs while maximising understanding of items, response options and administration methods in the target language and culture Construct irrelevance and construct underrepresentation in the target culture are considered | Evidence that a structured translation method, with detailed descriptions of the item intents, was appropriately implemented Evidence for the effective engagement of people from the target language/culture in the translation process Evidence of the confirmation of congruence of items and constructs between the source language and culture and the target language and culture to ensure construct relevance and avoid possible construct underrepresentation | Formal and documented translation method and process to manage translations to different languages, including documentation of participants in translation consensus meetings. A developer or other person deeply familiar with the PROM’s content and purpose oversees the concordance between the intents of original items and target language items. For example, two papers that describe the translation of the HLQ to other languages (Slovak [62] and Danish [60]) discuss the method of translation including the use of HLQ item intents, and they list the participants who took part in the detailed analysis of the translated items to collectively decide on the final items Systematic recording of steps in the translation and in the analysis of the recorded data, including changes made through the process In-depth cognitive interviews with target audience in the target culture about the items on the translated PROM, and content analysis to compare narrative data from interviews with the source language PROM item intents and construct definitions |
2. Response processes [11]—cognitive processes, and interpretation of test items by respondents and users, as measured against the intended interpretation or construct definition | ||
2.1 Respondents to both the translated PROM and source language PROM engage in the same or similar cognitive (response) processes when responding to items, and these processes align with the source language construct criteria, thus indicating that similar respondents across cultures are formulating responses to the same items in the same way | Evidence that linguistic equivalence and cultural appropriateness has been achieved in the PROM translation Evidence that the cognitive processes of respondents match construct criteria (i.e. respondents to the translated PROM are engaging with the criteria of the source language construct when answering items) and that the intended understanding of items has been achieved | Analysis of the documented translation process to determine how difficulties in translation were resolved such that each translated item retained the intent of its corresponding source language item, while accommodating linguistic and cultural nuances. For example, in the German HLQ publication, the translation method is described, as well as linguistic and cultural adaptation difficulties encountered and how these were resolved [61] Cognitive interviews with target audience in the target culture to determine how respondents formulate their answers to items (i.e. assessment of the alignment of cognitive processes with source construct criteria). For example, Maindal et al. [60] conducted interviews to test the cognitive processes of target respondents when answering the Danish HLQ items, which was a way to test if the translated items were understood as intended. Another example of cognitive testing, although not a study of translated HLQ items, is that done by Hawkins et al. [56] to determine concordances/discordances between patient and clinician HLQ responses about the patient, and to match these with the HLQ item intents. This could be a useful method to determine if the intent of translated items and scales is understood in the same way by both patients and clinicians in a new cultural setting. This is important because clinician understanding about differences in perspectives about the construct being measured (in this case health literacy) may facilitate discussions that help clinicians better understand a patient’s perspective about their health, and therefore make better decisions about the patient’s care. These qualitative interview data can then be integrated with statistical analysis of the measurement equivalence (or Differential Item Functioning (DIF) analysis) of data between target and source language PROMs (Chap. 11, pp. 193–209) [5] |
2.2 PROM users (e.g. health professionals or researchers who administer the PROM and interpret the PROM scores) of both the translated and source language PROMs engage in the same or similar cognitive processes (i.e. apply source language construct-relevant criteria) when interpreting scores, and that these processes match the intended interpretation of scores | Evidence that the cognitive processes of PROM users when evaluating respondents’ scores from a translated PROM are consistent with the source language construct criteria and with the interpretation of scores as intended by developers of the source language PROM | In-depth cognitive interviews with target PROM users, and content analysis to compare narrative data from interviews with source language PROM item intents and construct definitions, and to compare narrative data with the data from cognitive interviews with source language PROM users. For example, the Hawkins et al. study [56] shows how important it is that clinicians apply the same/similar construct criteria when interpreting HLQ scores as patients have applied when answering the items. Cognitive interviews to test response processes of the target audience of the translated PROM should also be conducted with the users of the PROM data who will interpret and use the scores to make decisions about those target audience members |
3. Internal structure [11]—the extent to which item interrelationships conform to the constructs on which claims about the score interpretations are based | ||
3.1 Item interrelationships and measurement structure of scales of the translated PROM conform to the constructs of the source language PROM The constructs of translated PROMs in diverse cultures are thus conceptually comparable, and interpretations based on statistical comparisons of scale scores are unbiased | Evidence that the translated PROM scales are homogeneous and distinct and thus items are uniquely related to the hypothesised target constructs Evidence to confirm that the measurement structure of the source language PROM has been maintained through the translation process Evidence of measurement equivalence between the constructs of translated and source language PROMs | Confirmatory factor analysis (CFA) of data in the target language culture and comparisons with CFAs from data in the source language culture Reliability of the hypothesised PROM scales in the target culture For example, studies of the HLQ translated into Danish [60], German [61] and Slovak [62] present and discuss psychometric results that confirm the nine-factor structure of the HLQs and present fit statistics, factor loading patterns, inter-factor correlations and reliability estimates of the individual scales that are comparable to those found in the original development and replication studies [55, 58] DIF or, equivalently, multi-group factor analysis studies (MGFA) to establish configural, metric and scalar measurement equivalence of source and translated scales |
4. Relations to other variables [11]—the pattern of relationships of test scores to external variables as predicted by the construct operationalised for a specific context and proposed use | ||
4.1 Convergent-discriminant validity is established for the translated PROM | Evidence that the relationships between the translated PROM and similar constructs in other tools are substantial and congruent with patterns observed in the source PROM (i.e. convergent evidence) such that score interpretation of the translated PROM is consistent with the score interpretation of the source language PROM and other PROMs measuring similar constructs Evidence that relationships between items and scales in the translated PROM and items and scales measuring unrelated constructs of tools in the broader domain of interest (e.g. health-related beliefs and attitudes more generally) are low (i.e. discriminant evidence) | Use of CFA to examine Fornell and Larker’s [80, 81] criteria for convergent-discriminant validity to compare translated PROM scales with measures of similar and contrasting constructs with other tools within the domain of interest; also comparison of these scale relationships across source and target cultures. Elsworth et al. used these criteria to assess convergent/discriminant validity of the HLQ within this multi-scale PROM [58]. This method could similarly be applied to a study of the relationships of a single or multi-scale health literacy PROM across similar health literacy assessments and PROMs assessing divergent health-related constructs Possible multitrait-multimethod (MTMM) studies of the translated PROM scales and other measures in the relevant domain [82] |
4.2 Test–criterion relationships are robust for translated PROMs | Evidence that test–criterion relationships are concordant with expectations from theory to provide general support for construct meaning and information to support decisions about score interpretation and use for specific population groups and purposes Evidence that supports theoretically indicated equalities and differences in the distribution of scale scores across cultures to further support scale interpretation and use in target cultures | Correlation and group differences, e.g. analysis of variance of translated PROM summed scores by sub-groups (i.e. gender, age, education etc.) Multi-group CFA (MGCFA) by sub-groups. For example, Maindal et al. [60] investigated relationships between the scales of the newly translated Danish HLQ and a range of sociodemographic variables (e.g. gender, age, education) and compared the results with those of an Australian study that used the source HLQ. Similarities and differences in the observed relationships were discussed |
4.3 Validity generalisation is established for a PROM that is translated across two or more cultures | Evidence of validity generalisation information to support valid score interpretation and use of translated PROMs in other cultures similar to those already studied. Validity generalisation relates the PROM constructs within a nomological net [18]. To the extent that the net is well established, coherent and in accord with theory, the PROM can be used cautiously in new contexts (settings, cultures) without full validation for each proposed interpretation and/or use in that new context | Systematic review of results of validity studies of translated PROM scales across the five categories of validity evidence in the Standards against argued criteria in related and unrelated settings and cultures. Specific targeted studies to investigate/establish validity generalisability. This has not yet been systematically conducted for the HLQ. As a start to this sort of research, the English HLQ [55, 73], as well as the Danish [60] and German [61] studies, could be considered mainstream populations, whereas the Slovak [62, 77, 79] translation is an example of a targeted study of validity generalisation in a specific population |
5. Validity and consequences of testing [11]—the intended and unintended consequences of test use, and as traced to a source of invalidity such as construct underrepresentation or construct-irrelevant components | ||
5.1 PROM users (e.g. health professionals, researchers) interpret and use respondents’ scores from a translated PROM as intended by the developers of the source language PROM and for the intended benefit | Evidence that the intended benefit from testing with the translated PROM has been realised | In-depth interviews with users of a translated PROM to assess the outcomes that arose from testing with the translated PROM (i.e. predicted or actual actions taken from score interpretation and use) and if these align with the intended benefits, as stipulated by the developers of the source language PROM. For example, the OPtimising HEalth LIteracy and Access (Ophelia) process [73] supports healthcare services to interpret and apply data from the nine HLQ scales (supported by co-design research practices) to design health literacy interventions. Evaluation cycles integrated into the Ophelia process aim to keep the interventions directed towards client, health professional, service and community health literacy. A follow-up evaluation (yet to be conducted) could determine the degree to which the users’ HLQ score interpretations generated interventions that were implemented and that delivered the intended health literacy benefits |
5.2 Claims for benefits of testing that are not based directly on the developers’ intended score interpretations and uses | Evidence to determine if there are potential testing benefits that go beyond the intended interpretation and use of the translated PROM scores. For example, the HLQ was not designed to measure the broad concept of patient experience. However, data from the HLQ could be used for this purpose because the constructs and items include information about this concept. Consequently, a hospital that sought to measure patients’ health literacy might also make claims about patients’ hospital experiences | A companion or follow-up study or a critical review by an external evaluator could identify and evaluate benefits that are directly based on intended score interpretation and use (as based on the source language PROM) and benefits that are based on grounds other than intended score interpretation and use. For example, a companion study could consist of co-administering the HLQ with specific patient experience questionnaires, auditing patient complaints records, and undertaking in-depth interviews with patients and hospital staff to determine the validity of HLQ score interpretation for measuring patient experiences |
5.3 Awareness of and mitigation of unintended consequences of testing due to construct underrepresentation and/or construct irrelevance to prevent inappropriate decisions or claims about an individual or group, i.e. to take action/intervention when not warranted, or to take no action/intervention when an action is warranted, or to falsely claim an action/intervention is a success or a failure | Evidence of sound translation method to help minimise unintended consequences related to errors in score interpretation for a given use that are due to poor equivalence between the source language and translated PROM constructs Evidence of appropriate methods to calculate and interpret scale scores, including judgements about what are large, small or null differences between populations | Collection and analysis of translation process data that verify that a structured translation method is implemented such that congruence between source language and translated PROM constructs, and potential construct underrepresentation and/or construct irrelevance, is continually addressed. For example, an as yet unpublished study has been conducted to analyse field notes from translations of the HLQ into nine languages to determine aspects of the translation method that improve congruence of item intent between the source and translated items and thus constructs Cognitive interview testing of the translated items verifies the appropriateness of the translation for target respondents [60] Analysis of process data that a PROM output interpretation guide that prescribes data analysis methods to mitigate inappropriate data claims has been effectively used |