Part B. Evaluate the measurement properties
Part B consists of steps 5–7 and concerns the evaluation of the measurement properties of the included PROMs, and consists of three sub-steps (Fig.
1). First, the methodological quality of each single study on a measurement property is assessed using the COSMIN Risk of Bias checklist [
28]. Each study is rated as either very good, adequate, doubtful or inadequate quality. Second, the result of each single study on a measurement property is rated against the updated criteria for good measurement properties [
29] on which consensus was achieved [
19] and slightly modified based on recent new insights (Table
1). Each result is rated as either sufficient (+), insufficient (−), or indeterminate (?). Third, the evidence will be summarized and the quality of the evidence will be graded by using the GRADE approach. The results of all available studies on a measurement property are quantitatively pooled or qualitatively summarized and compared against the criteria for good measurement properties to determine whether—overall—the measurement property of the PROM is sufficient (+), insufficient (−), inconsistent (±), or indeterminate (?). The focus is here on the PROM, while in the previous sub-steps the focus was on the single studies. If the ratings per study are all sufficient (or all insufficient), the results can be statistically pooled and the overall rating will be sufficient (+) (or insufficient (−)), based on the criteria of good measurement properties. If the results are inconsistent, explanations for inconsistency (e.g., different study populations or methods) should be explored. If an explanation is found, overall ratings should be provided for relevant subgroups with consistent results (e.g., adults versus children, patients with acute versus chronic disease, different (language) versions of a PROM, etc.). If no explanation is found, the overall rating will be inconsistent (±). If not enough information is available, the overall rating will be indeterminate (?). In the COSMIN user manual, detailed information can be found on how the pooled or summarized results on a measurement property can be rated against the criteria for good measurement properties [
9].
Table 1
Updated criteria for good measurement properties
Structural validity | + |
CTT
CFA: CFI or TLI or comparable measure > 0.95 OR RMSEA < 0.06 OR SRMR < 0.08a |
IRT/Rasch
No violation of unidimensionalityb: CFI or TLI or comparable measure > 0.95 OR RMSEA < 0.06 OR SRMR < 0.08
AND
no violation of local independence: residual correlations among the items after controlling for the dominant factor < 0.20 OR Q3’s < 0.37
AND
no violation of monotonicity: adequate looking graphs OR item scalability > 0.30
AND
adequate model fit IRT: χ2 > 0.001 Rasch: infit and outfit mean squares ≥ 0.5 and ≤ 1.5 OR Z-standardized values > −2 and < 2 |
? | CTT: not all information for ‘+’ reported IRT/Rasch: model fit not reported |
− | Criteria for ‘+’ not met |
Internal consistency | + | At least low evidencec for sufficient structural validityd AND Cronbach’s alpha(s) ≥ 0.70 for each unidimensional scale or subscalee |
? | Criteria for “At least low evidencec for sufficient structural validityd” not met |
− | At least low evidencec for sufficient structural validityd AND Cronbach’s alpha(s) < 0.70 for each unidimensional scale or subscalee |
Reliability | + | ICC or weighted Kappa ≥ 0.70 |
? | ICC or weighted Kappa not reported |
− | ICC or weighted Kappa < 0.70 |
Measurement error | + | SDC or LoA < MICd |
? | MIC not defined |
− | SDC or LoA > MICd |
Hypotheses testing for construct validity | + | The result is in accordance with the hypothesisf |
? | No hypothesis defined (by the review team) |
− | The result is not in accordance with the hypothesisf |
Cross-cultural validity\measurement invariance | + | No important differences found between group factors (such as age, gender, language) in multiple group factor analysis OR no important DIF for group factors (McFadden’s R2 < 0.02) |
? | No multiple group factor analysis OR DIF analysis performed |
− | Important differences between group factors OR DIF was found |
Criterion validity | + | Correlation with gold standard ≥ 0.70 OR AUC ≥ 0.70 |
? | Not all information for ‘+’ reported |
− | Correlation with gold standard < 0.70 OR AUC < 0.70 |
Responsiveness | + | The result is in accordance with the hypothesisf OR AUC ≥ 0.70 |
? | No hypothesis defined (by the review team) |
− | The result is not in accordance with the hypothesisf OR AUC < 0.70 |
The overall ratings of each measurement property [i.e., sufficient (+), insufficient (–), inconsistent (±)] will be accompanied by a grading for the quality of the evidence. This indicates how confident we are that the pooled results or overall ratings are trustworthy. Note that in case the overall rating for a specific measurement property will be indeterminate (?) one will not be able to judge the quality of the PROMs, so there will be no grading of the quality of the evidence. The GRADE approach for systematic reviews of intervention studies specifies four levels of quality evidence (i.e., high, moderate, low, or very low quality evidence), depending on the presence of five factors: risk of bias, indirectness, inconsistency, imprecision, and publication bias [
24]. Here, we introduce a modified GRADE approach for grading the quality of the evidence in systematic reviews of PROMs. The GRADE approach is used to downgrade the quality of evidence when there are concerns about the trustworthiness of the results. Four of the five GRADE factors have been adopted in the COSMIN methodology: risk of bias (i.e., the methodological quality of the studies), inconsistency (i.e., unexplained inconsistency of results across studies), imprecision (i.e., total sample size of the available studies), and indirectness (i.e., evidence from different populations than the population of interest in the review) (Table
2). The quality of the evidence is graded for each measurement property and for each PROM separately. The starting point is always the assumption that the pooled or overall result is of high quality. The quality of evidence is subsequently downgraded by one or two levels per factor to moderate, low, or very low (for definitions, see Table
3) when there is risk of bias, (unexplained) inconsistency, imprecision, or indirect results. Specific details on how to down grade are explained in the COSMIN user manual [
9]. We recommend that quality assessment is done by two reviewers independently and that consensus among the reviewers is reached, if necessary with help of a third reviewer.
Table 2
Modified GRADE approach for grading the quality of evidence
High | Risk of bias −1 Serious −2 Very serious −3 Extremely serious Inconsistency −1 Serious −2 Very serious Imprecision −1 total n = 50–100 −2 total n < 50 Indirectness −1 Serious −2 Very serious |
Moderate |
Low |
Very low |
Table 3
Definitions of quality levels
High | We are very confident that the true measurement property lies close to that of the estimate of the measurement property |
Moderate | We are moderately confident in the measurement property estimate: the true measurement property is likely to be close to the estimate of the measurement property, but there is a possibility that it is substantially different |
Low | Our confidence in the measurement property estimate is limited: the true measurement property may be substantially different from the estimate of the measurement property |
Very low | We have very little confidence in the measurement property estimate: the true measurement property is likely to be substantially different from the estimate of the measurement property |
Note that each version of the PROM should be considered separately in the review (i.e., different versions for subgroups of patients, different language versions, etc.).
Step 5. Evaluate content validity
Content validity refers to the degree to which the content of a PROM is an adequate reflection of the construct to be measured [
30]. Content validity is considered to be the most important measurement property, because it should be clear that the items of the PROM are relevant, comprehensive, and comprehensible with respect to the construct of interest and study population. The evaluation of content validity requires a subjective judgment by the reviewers. In this judgement, the PROM development study, the quality and results of additional content validity studies on the PROMs (if available), and a subjective rating of the content of the PROMs by the reviewers is taken into account. Guidance on how to evaluate the content validity of PROMs can be found elsewhere [
10].
If there is high quality evidence that the content validity of a PROM is insufficient, the PROM will not be further considered in steps 6–8 of the systematic review and one can directly draw a recommendation for this PROM in step 9.
Step 6. Evaluate internal structure
The internal structure refers to how the different items in the PROM are related, which is important to know for deciding how items might be combined into a scale or subscale. This step concerns an evaluation of structural validity (including unidimensionality), internal consistency, and cross-cultural validity and other forms of measurement invariance. Here we are referring to testing of existing PROMs; not further refinement or development of new PROMs. These three measurement properties focus on the quality of the individual items and the relationships between the items in contrast to the remaining measurement properties at step 7. We recommend to evaluate these measurement properties directly after evaluating the content validity of a PROM. As evidence for structural validity (or unidimensionality) of a scale or subscale is a prerequisite for the interpretation of internal consistency analyses (i.e., Cronbach’s alpha’s), we recommend to first evaluate structural validity (step 6.1), to be followed by internal consistency (step 6.2) and cross-cultural validity\measurement invariance (step 6.3).
Step 6 is only relevant for PROMs that are based on a reflective model that assumes that all items in a scale or subscale are manifestations of one underlying construct and are expected to be correlated. An example of a reflective model is the measurement of anxiety; anxiety manifests itself in specific characteristics, such as worrying thoughts, panic, and restlessness. By asking patients about these characteristics, we can assess the degree of anxiety (i.e., the items are a reflection of the construct) [
31]. If the items in a scale or subscale are not supposed to be correlated (i.e., a formative model), these analyses are not relevant and step 6 can be omitted. If it is not reported whether a PROM is based on a reflective or formative model, the reviewers need to decide on the content of the PROM whether it is likely based on a reflective or a formative model [
32].
Step 6.1. Evaluate structural validity
Structural validity refers to the degree to which the scores of a PROM are an adequate reflection of the dimensionality of the construct to be measured [
30] and is usually assessed by factor analysis or IRT/Rasch analysis. In a systematic review, it is helpful to make a distinction between studies where factor analysis is performed to assess structural validity, or to assess the unidimensionality of each subscale separately/per subscale. To assess structural validity, FA is performed on all items of a PROM to evaluate the (hypothesized) number of subscales of the PROM and the clustering of items within subscales (i.e., structural validity studies). To assess unidimentionality per subscale, multiple factor analyses are performed on the items of each subscale separately to assess whether each subscale on its own measures a single construct (i.e., unidimensionality studies). These analyses are sufficient for the interpretation of internal consistency analyses (step 6.2) and for IRT/Rasch analysis, but it does not provide evidence for structural validity as part of construct validity.
The evaluation of structural validity consists of the three sub-steps that are described under Part B: (1) the evaluation of the methodological quality of the included studies; (2) applying criteria for good measurement properties; and (3) summarizing the evidence and grading the quality of the evidence.
If there is high quality evidence that the structural validity of a PROM is insufficient, one should reconsider further evaluation of this PROM in the subsequent steps.
Step 6.2. Evaluate internal consistency
Internal consistency refers to the degree of interrelatedness among the items and is often assessed by Cronbach’s alpha [
30,
33]. Similar to the evaluation of structural validity, the evaluation of internal consistency also consists of three sub-steps, as described above.
Step 6.3. Evaluate cross-cultural validity\measurement invariance
Cross-cultural validity\measurement invariance refers to the degree to which the performance of the items on a translated or culturally adapted PROM are an adequate reflection of the performance of the items of the original version of the PROM [
30]. Cross-cultural validity\measurement invariance should be evaluated when a PROM is or will be used in different ‘cultural’ populations, i.e., populations that differ in ethnicity, language, gender, or age groups, but also different patient populations are considered here [
9]. Cross-cultural validity\measurement invariance is evaluated by assessing whether differential item functioning (DIF) occurs using, e.g., logistic regression analyses, or whether factor structure and factor loadings are equivalent across groups using multigroup confirmatory factor analysis (MGCFA). Measurement invariance and non-DIF refer to whether respondents from different groups with the same latent trait level (allowing for group differences) respond similarly to a particular item [
34]. The evaluation of cross-cultural validity\measurement invariance also consists of the three sub-steps described above.
Step 7. Evaluate the remaining measurement properties
Subsequently, the remaining measurement properties (reliability, measurement error, criterion validity, hypotheses testing for construct validity, and responsiveness) should be evaluated, again following the three sub-steps described above. Unlike content validity and internal structure, the evaluation of these measurement properties provides information on the quality of the scale or subscale as a whole, rather than on item level.
In the evaluation of the measurement properties of the included PROMs, there are a few important issues that should be taken into consideration. For applying the criteria for good measurement error, information is needed on the smallest detectable change (SDC) or limits of agreement (LoA), as well as on the MIC. This information may come from different studies. The MIC should have been determined using an anchor-based longitudinal approach [
35‐
38]. The MIC is best calculated from multiple studies and by using multiple anchors [
39,
40]. If not enough information is available to judge whether the SDC or LoA is smaller than the MIC, we recommend to just report the information that is available on the SDC or LoA without grading the quality of evidence (note that information on the MIC alone provides information on the interpretability of a PROM).
With regard to hypotheses testing for construct validity and responsiveness, it is recommended for reviewers to formulate hypotheses themselves to evaluate the results against [
9,
28]. These hypotheses are formulated in line with the review aim and include expected relationships, for example, between the PROM(s) under review and the comparison instrument(s) that is/are used to compare the PROM(s) against, and the expected direction and magnitude of the correlation. Examples of generic hypotheses can be found in Table
4. In this way, all results found in the included studies can be compared against the same set of hypotheses. When at least 75% of the results are in accordance with the hypotheses, the summary result is rated as ‘sufficient’. Herewith, more robust conclusions can be drawn about the construct validity of the PROM.
Table 4
Generic hypotheses to evaluate construct validity and responsiveness
1 | Correlations with (changes in) instruments measuring similar constructs should be ≥ 0.50 |
2 | Correlations with (changes in) instruments measuring related, but dissimilar constructs should be lower, i.e., 0.30–0.50 |
3 | Correlations with (changes in) instruments measuring unrelated constructs should be < 0.30 |
4 | Correlations with (changes in) instruments measuring similar constructs should differ by a minimum of 0.10 from correlations with (changes in) instruments measuring related but dissimilar constructs Correlations with (changes in) instruments measuring related but dissimilar constructs should differ by a minimum of 0.10 from correlations with (changes in) instruments measuring unrelated constructs |
5 | Meaningful changes between relevant (sub)groups (e.g., patients with expected high versus low levels of the construct of interest) |
6 | For responsiveness, AUC should be ≥ 0.70 |