13122018  Brief Communication  Uitgave 4/2019 Open Access
Assessing test–retest reliability of patientreported outcome measures using intraclass correlation coefficients: recommendations for selecting and documenting the analytical formula
 Tijdschrift:
 Quality of Life Research > Uitgave 4/2019
Belangrijke opmerkingen
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The US Food and Drug Administration (FDA) 2009 guidance for industry on patientreported outcome (PRO) measures describes how the agency will review and evaluate the development and psychometric properties of measures intended to support medical product labeling claims [
1]. Within the psychometric measurement section of the guidance, a key property for review is test–retest reliability, defined as the “stability of scores over time when no change is expected in the concept of interest.” The guidance also lists intraclass correlation coefficients (ICCs) and the time period of assessment as key considerations in FDA review of the test–retest reliability evaluations. While the guidance describes a number of factors to consider when identifying the time period most appropriate for assessments (e.g., variability of the disease state, reference period of the measure), it does not provide specific recommendations regarding the computation of ICCs.
Within the measurement literature, a variety of computational methods have been used to calculate ICCs, a finding that is further complicated by the use of different notation systems for documenting the selected ICC formula [
2–
7]. The lack of a consistent approach has resulted in confusion regarding which ICC formula is appropriate for assessing test–retest reliability and the inability to compare ICC results across PRO measures when different formulas are used. This absence of consensus regarding the most appropriate ICC formula specific for the assessment of test–retest reliability in PRO measurement and the lack of a uniform naming convention for the ICC formulas emerged as an issue within Critical Path Institute’s (CPath’s) PRO Consortium [
8]. CPath is an independent, nonprofit organization established in 2005 with public and private support to bring together scientists and others from regulatory agencies, industry, patient groups, and academia to collaborate on improving the medical product development process. CPath, in cooperation with FDA and the pharmaceutical industry, formed the PRO Consortium in 2008 to facilitate precompetitive collaboration aimed at advancing the assessment of patientreported and other patientfocused clinical outcomes in drug treatment trials. There was a realization that different, often unidentified, ICC formulas were used by the PRO Consortium’s working groups to evaluate the test–retest reliability of its developmental PRO measures without a clear rationale. This made comparison of test–retest reliability among the measures problematic and, ultimately, complicated regulatory submissions due to the absence of a coherent and consistent approach to ICC formula selection.
To address these issues, the authors reviewed the literature and developed recommendations for the most appropriate ICC formula to fulfill their test–retest reliability objective along with the rationale for the recommendations. The draft of this document was provided to a group of twelve experts including psychometricians, biostatisticians, regulators, and other scientists representing the PRO Consortium, the pharmaceutical industry, clinical research organizations, and consulting firms for review and comment. Feedback was received in written form, followed by discussion with some of the experts for further input and clarification. The authors considered the group’s input in generating the final recommendations presented in this manuscript for selecting the most appropriate ICC formula within the context of assessing the test–retest reliability of PRO measures to support regulatory review.
In the measurement literature, Shrout and Fleiss [
5] and McGraw and Wong [
6] appear to be the two most cited references for evaluating test–retest reliability. The seminal work of Shrout and Fleiss [
5] presented six computational formulas for ICCs. McGraw and Wong [
6] expanded the number from 6 to 10 by incorporating more model assumptions, various study designs, and the corresponding analysis of variance (ANOVA) models into the list of considerations for selecting an ICC formula. Because McGraw and Wong [
6] offered a more comprehensive treatment of the selection of an ICC formula and a clearer statement of model assumptions, we recommend using their notational system for clarity. However, a key limitation in the general ICC literature is the use of “raters” in the formulas and in the examples, which does not easily translate to the PRO measurement situation where different “time points” rather than different “raters” is the context for the evaluation.
McGraw and Wong present 10 ICC formulas [
6, pp 34–35] from which researchers may select based on factors that include the study design (e.g., multiple ratings per subject or multiple subjects per raters), the number of time points, and the intended generalizability of the findings. To assess test–retest reliability for PRO measures, we recommend the twoway mixedeffect ANOVA model with interaction for the absolute agreement between single scores as the preferred ICC formula based on typical study designs (Table
1).
Table 1
Twoway mixedeffect analysis of variance (ANOVA) model
Case 3 model of McGraw and Wong [
6, p34]:

Twoway mixed model with interaction
\({{\varvec{x}}_{{\varvec{i}}{\varvec{j}}}}=\varvec{\mu}+{{\varvec{p}}_{\varvec{i}}}+{{\varvec{t}}_{\varvec{j}}}+{\varvec{p}}{{\varvec{t}}_{{\varvec{i}}{\varvec{j}}}}+{{\varvec{e}}_{{\varvec{i}}{\varvec{j}}}}\), where
\(\varvec{\mu}\): grand mean
\({{\varvec{p}}_{\varvec{i}}}:\) difference due to patient
i (
i = 1, …, n),
\({{\varvec{p}}_{{\varvec{i}}~}}\sim ~{\mathbf{Normal}}(0,~~\varvec{\sigma}_{{\varvec{p}}}^{2})\)
\({{\varvec{t}}_{\varvec{j}}}:\) difference due to time point
j (
j = 1, …, k),
\(\mathop \sum \limits_{{{\varvec{j}}=1}}^{{\varvec{k}}} {{\varvec{t}}_{\varvec{j}}}=0\)
\({\varvec{p}}{{\varvec{t}}_{{\varvec{i}}{\varvec{j}}}}:\) interaction between patient
i and time point
j,
\(\mathop \sum \limits_{{{\varvec{j}}=1}}^{{\varvec{k}}} {\varvec{p}}{{\varvec{t}}_{{\varvec{i}}{\varvec{j}}}}=0\) and
\({\varvec{p}}{{\varvec{t}}_{{\varvec{i}}{\varvec{j}}}}~\sim ~{\mathbf{Normal}}(0,~~\varvec{\sigma}_{{{\varvec{p}}{\varvec{t}}}}^{2})\)
\({{\varvec{e}}_{{\varvec{i}}{\varvec{j}}}}\): random error,
\({{\varvec{e}}_{{\varvec{i}}{\varvec{j}}}}~\sim ~{\mathbf{Normal}}(0,~~\varvec{\sigma}_{{\varvec{e}}}^{2})\)

Source of variance

df

MS

Expected components in
MS



Between patients

n − 1

MS
_{ P }

\(k\sigma _{p}^{2}+\sigma _{e}^{2}\)


Within patients


Between time points

k − 1

MS
_{ T }

\(n\mathop \sum \limits_{{j=1}}^{k} t_{j}^{2}/(k  1)+\frac{k}{{k  1}}\sigma _{{pt}}^{2}+\sigma _{e}^{2}\)


Error (p × t)

(n − 1)(k − 1)

MS
_{ E }

\(\frac{k}{{k  1}}\sigma _{{pt}}^{2}+\sigma _{e}^{2}\)


ICC (
A, 1) of McGraw and Wong [
6, p 35] =
\(\frac{{\varvec{\sigma}_{{\varvec{p}}}^{2}  \varvec{\sigma}_{{{\varvec{p}}{\varvec{t}}}}^{2}/({\varvec{k}}  1)}}{{\varvec{\sigma}_{{\varvec{p}}}^{2}+\varvec{\sigma}_{{{\varvec{p}}{\varvec{t}}}}^{2}+\varvec{\sigma}_{{\varvec{e}}}^{2}+\mathop \sum \nolimits_{{{\varvec{j}}=1}}^{{\varvec{k}}} {\varvec{t}}_{{\varvec{j}}}^{2}/({\varvec{k}}  1)}}~=\frac{{{\varvec{M}}{{\varvec{S}}_{\varvec{P}}}  {\varvec{M}}{{\varvec{S}}_{\varvec{E}}}}}{{{\varvec{M}}{{\varvec{S}}_{\varvec{P}}}+({\varvec{k}}  1){\varvec{M}}{{\varvec{S}}_{\varvec{E}}}+({\varvec{k}}/{\varvec{n}})({\varvec{M}}{{\varvec{S}}_{\varvec{T}}}  {\varvec{M}}{{\varvec{S}}_{\varvec{E}}})}}\)

This recommendation is based on the following considerations:
1.
The twoway model is recommended over the oneway model because time is a design factor in a typical test–retest assessment and the two time points are not interchangeable (i.e., the chronology is important to detect systematic differences such as learning). An ICC computed using the oneway model would underestimate the reliability due to not partitioning the withinpatient variability into the time variability and the error term.
2.
A mixedeffect model is recommended over a random effect model because, in the former, test and retest time points are prespecified and identical across all study subjects rather than being randomly selected from the population of all possible pairs of time points. In this case, the time effect is considered as fixed.
3.
The timebysubject interaction is assumed to be included in the error term because the interaction cannot be estimated for situations with only one measurement per subject collected at each time point.
4.
Absolute agreement is recommended over consistency because subjects are assumed to be stable for the construct of interest across the two time points. Therefore, the systematic differences in the individual scores on the PRO measure over time are of interest.
Note that the ICC (A,1) values remain the same no matter which twoway ANOVA model is constructed. However, we advocate for the articulation of model choice because of the different conceptual considerations being implied. There are many such statistical models where model assumption and interpretation are conceptually different, but some statistics or test results could be the same (e.g., univariate repeated measures ANOVA vs. multivariate ANOVA, and Rasch model vs. 1parameter logistic item response theory model). We believe that making a clear distinction among models conceptually is important as the chosen model informs the context and the study design. As Schuck [
10] noted, “The most important conclusion of the foregoing discussion is not to report ‘the’ ICC, but to describe which ICC has been used, and for what reason.” Whatever the circumstances, we recommend the inclusion of details that describe the exact model used to estimate the ICC and the rationale for the choice. To facilitate the selection of ICC formulas for different study designs (particularly those that are not typical for test–retest reliability evaluation), a decision tree adapted from McGraw and Wong’s published decision tree is provided (Fig.
1).
×
Test–retest ICC values obtained from specific data sets are only point estimates of the true ICC, and they are affected by sample size, data variability, measurement error, and correlation strength as well as by systematic difference between time points [
2,
4,
6,
11]. In addition to observed ICC values, we recommend always reporting the corresponding confidence intervals to evaluate the precision of the estimate [
6,
12,
13]. When unexpected ICC values occur, additional investigations should be conducted to identify potential reasons for the unexpected values. Investigations to consider include the generation of scatter plots and ANOVA tables and/or conduct of additional correlation assessments, ttests, or subgroup analyses.
Finally, as ratios of variance components, ICCs of the same model and sample that are calculated using different programming software may vary slightly due to differences in the handling of missing values and the estimation algorithms for variance parameters. Also, due to the fact that betweensubject variability is incorporated as part of the ICC ratio, an ICC value is not independent of the study design or specific sample utilized [
2]. Low ICC values may be indicative of issues with the study design rather than with the measurement properties of the assessment tool being evaluated. The study population may be restricted to a very narrow subset of scores on the PRO measure’s full score range, for example, and thereby restrict betweensubject variability. For these reasons and many others, ICC values should be considered as only a single part of the total evidence needed to support the reproducibility of a PRO measure.
Acknowledgements
The authors sincerely thank Donald Bushnell, Cheryl Coon, Amylou Dueck, Adam Gater, Stacie Hudgens, R. J. Wirth, and Kathleen Wyrwich for their review and feedback on the first draft of the recommendations made in this paper.
Compliance with ethical standards
Conflict of interest
Qin S, Nelson L, and McLeod L are researchers employed by RTIHealth Solutions and provide clinical outcome assessment development and psychometric evaluation support for pharmaceutical companies. Eremenco S and Coons S are researchers employed by Critical Path Institute and support patientfocused drug development in cooperation with the U.S. Food and Drug Administration’s Center for Drug Evaluation and Research and the pharmaceutical industry.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.