Introduction
The use of patient-reported outcome measures (PROMs) has become standard practice in clinical research and daily clinics due to the growing emphasis on patient-centered and value-based care. PROMs typically consist of multi-item questionnaires used to measure constructs (or “traits”), such as “depression” or “pain.” However, because PROM scores are often continuous scores without intrinsic meaning, there is a need for (clinically) meaningful thresholds or cutoff points to facilitate interpretation. Examples of meaningful thresholds include a diagnostic cutoff point for depression, and a patient acceptable symptom state (PASS) threshold for pain. Determining a meaningful threshold on a questionnaire requires the comparison with an external criterion indicating the presence or absence of a meaningful trait level to define an interpretable “state” (e.g., clinical depression, or an acceptable symptom state). For clarity, we provide some terminology in Box
1.
Trait: The construct of interest (e.g., depression or pain) that is intended to be measured by a PROM, usually a multi-item questionnaire. The construct itself is not directly observable, hence “latent.” The latent trait is usually continuous. The PROM score provides an approximation of the true trait level. PROM scores are observed (i.e., manifest) Perceived trait: The level of the latent trait as being perceived by the patient or by an observer (e.g., a clinician). The perceived trait is equal to the latent trait plus some random (measurement) error State (of interest): A clinically meaningful condition that is characterized by a minimum level of a trait of interest. Examples of meaningful states are clinical depression and acceptable symptom state Meaningful threshold: The minimum trait level above which a meaningful state is assumed to exist. The meaningful threshold can be thought of as a location on the latent trait (in which case the threshold is latent), or it can be thought of as a particular PROM score (in which case the threshold is manifest, and an approximation of the latent threshold). The term “cutoff point” can be used to indicate a manifest threshold of a PROM State assessment: The procedure used to determine whether or not a state of interest is present. The procedure is independent of the PROM of interest. Examples of state assessments are the making of a diagnosis of clinical depression by a trained professional, and the patient response to a targeted question (often called an “anchor” question) State scores: The results of state assessment. Typically, state scores are dichotomous: “1” for the state of interest is present, and “0” for the state is absent State difficulty: The level of a trait (defining a state of interest) where the probability that a state assessment results in establishing that the state of interest is present, is 50% |
Given that depression represents a continuous trait in the general population [
1], the state clinical depression can be conceptualized as a level of depression above a certain threshold on this trait. Then, making a diagnosis of clinical depression can be seen as estimating a patient’s level of depression, based on their history, and to determine whether this level is above or below the threshold of clinical depression [
2]. In this example, the threshold is agreed upon by the psychiatric professional community.
The PASS represents a threshold of clinical importance beyond which patients consider their level of symptoms (e.g., pain) as acceptable [
3]. A PASS threshold is typically determined using an “anchor” question like “Do you consider your current level of pain acceptable, yes or no?”. The question assumes that patients compare their perceived level of pain to a personal threshold (or benchmark) of acceptability. This PASS threshold probably differs across individuals. Thus, the best group-level PASS estimate would be the mean of the individual PASS thresholds in a group of patients.
Given a continuous “test” variable (i.e., a variable holding the PROM scores) and a dichotomous “state” variable (i.e., a variable holding the state scores), the traditional method to determine a meaningful threshold or cutoff point is receiver operating characteristic (ROC) analysis. ROC analysis examines the sensitivity and specificity of all possible test scores with respect to their ability to classify subjects with respect to the meaningful state [
4]. As a cutoff point, a test score can be selected based on its desired sensitivity and/or specificity, controlling the type and amount of misclassification. Often a so-called “best” or “optimal” cutoff point is chosen of which the difference between sensitivity and specificity is minimized (top-left criterion) or the sum of sensitivity and specificity is maximized (Youden criterion [
5]; in large samples with normally distributed test scores, both criteria identify the same threshold [
6]). An optimal ROC threshold serves to classify subjects with the least amount of misclassification.
A problem with using ROC analysis for identifying meaningful thresholds is that an optimal ROC cutoff point depends on the prevalence of the state. For any given cutoff point, an increase in the state prevalence results in an increase of the cutoff point’s sensitivity and a decrease of its specificity, whereas a decrease in the prevalence has the opposite effect [
7]. An optimal ROC-based cutoff point with a balanced sensitivity and specificity in one particular situation (with a certain prevalence) will, therefore, not be the optimal cutoff point with the same sensitivity–specificity balance in another situation. In other words, an optimal ROC cutoff point is context specific [
8]. As a meaningful threshold is principally independent of the state prevalence, the optimal ROC cutoff point may not identify the meaningful threshold on a continuous construct [
2]. Only if the state prevalence is 50%, the optimal ROC cutoff point will correspond to the meaningful threshold [
2]. In other words, whereas the optimal ROC cutoff point performs excellently in classifying cases and non-cases with minimal misclassification
in specific situations, it is not suitable to identify the (mean) threshold on a continuous trait, as defined by clinical or patient criteria.
An alternative to ROC analysis is predictive modeling, which involves logistic regression analysis using the state variable as the outcome and the test variable as the predictor variable [
9]. The optimal cutoff point is the test score that is equally likely to occur in the state-positive group as in the state-negative group (i.e., the likelihood ratio is 1). Predictive modeling identifies about the same cutoff point as ROC analysis, but with greater precision [
9]. However, like the optimal ROC cutoff point, the predictive modeling cutoff point depends on the state prevalence [
10]. The prevalence-related bias in the predictive modeling cutoff point depends on the reliability of the state variable, the standard deviation (SD) of the test variable, and the point-biserial correlation between the test variable and the state variable. These parameters can be used to adjust the prevalence-related bias and recover the proper threshold across a wide range of state prevalences [
11].
A third method, recently introduced, is based on item response theory (IRT) [
2]. This method uses the state prevalence to estimate a meaningful threshold on the latent trait scale and subsequently determines the corresponding test score threshold. However, this method assumes perfect validity and reliability of the state scores, which is arguably questionable. It is currently unknown to which extent the reliability of the state scores affects the threshold estimate.
This paper presents an improved IRT-based method to estimate meaningful thresholds, which is based on the work of Bjorner et al. [
12] in estimating meaningful within-individual change thresholds using longitudinal IRT. Like Bjorner et al. [
12], the new method uses the IRT difficulty parameter of the state scores, instead of the state prevalence to find the latent trait threshold of interest. We will demonstrate the performance of this method using simulation studies and two real datasets. We will compare the results with the ROC method, the predictive modeling method [
9], the adjusted predictive modeling method [
11], and the “old” state prevalence IRT method [
2].
Discussion
As the use of PROMs has become standard practice in clinical research and daily clinical practice, there is an increased incentive to develop meaningful thresholds to accurately interpret questionnaire scores and facilitate clinical decision making. In this article, we have introduced a new IRT approach to estimate meaningful thresholds. The method perfectly recovered the true (as simulated) meaningful threshold as a fixed value on the latent trait with practically no bias and high precision, regardless of the state prevalence or the state scores reliability. In contrast, most of the other methods examined produced biased threshold estimates if the state prevalence was ≠ 0.5.
Importantly, meaningful thresholds or cutoff points are used for two goals that are principally incompatible with each other: interpretation and classification. The first goal, the interpretation of test scores, relates to questions such as the cutoff point for clinical depression on a depression scale, or the minimum level of acceptability on a pain scale. Interpretational thresholds, especially if they are based on relatively subjective criteria, may depend on specific sample characteristics. For instance, more severe patients may be willing to accept higher levels of knee pain and dysfunction as acceptable than less severe patients [
31]. If the thresholds vary, they do so on the patient level, affecting the mean threshold in the sample. The thresholds do not vary with the prevalence of the state of interest. Our new state difficulty IRT method identifies these interpretational thresholds.
The second goal is classification of individual patients. For instance, for screening we often want thresholds that ensure the best balance between sensitivity and specificity, in order to minimize misclassification. To that end, classificational thresholds must be prevalence specific, because a cutoff point’s sensitivity and specificity change with prevalence [
7]. ROC analysis identifies a test’s optimal cutoff point in a particular situation, which cannot be generalized to situations with differing prevalence and disease spectrum. Therefore, the ROC cutoff point does not identify the interpretational threshold on the latent trait (unless the prevalence is 0.5) [
2].
Apart from the new state difficulty IRT method, the adjusted predictive modeling method also accurately identified the interpretational threshold with high precision, although some bias occurred with state prevalences smaller than 0.3 or greater than 0.7. This bias is at least partly due to the low or high state prevalence [
11], but skewness of the test scores might also play a role. However, the observation of highly similar threshold estimates obtained by the adjusted predictive modeling method and the new IRT method, despite profound skewness and ceiling effects in our second dataset, is a promising finding. Nevertheless, future (simulation) studies should determine to what extent non-normality of the test scores affects the results of the adjusted predictive modeling approach.
The new state difficulty IRT method assumes that the state of interest can be regarded as an effect indicator [
32] of the latent trait and, therefore, can be included as an additional item in the IRT model. In some cases, states may alternatively be conceptualized as having a causal effect on the latent trait. Use of such causal indicators [
32] is beyond the current paper but can be handled by fitting explanatory IRT models [
33].
Both the new state difficulty IRT method and the adjusted predictive modeling method can be used to estimate meaningful thresholds, but the methods come with different assumptions. For the new IRT method, the data should be unidimensional enough to allow IRT analysis [
34], and the questionnaire should fit an IRT model. Although any IRT model may be employed, the GRM usually provides good fit to PROM data. Furthermore, the IRT method assumes that the latent trait is normally distributed. Skewness of the observed test scores is no problem as long as the latent trait is assumably normal. On the other hand, the adjusted predictive modeling method assumes normality of the test scores [
11].
Taking these assumptions into account, the choice of method may depend on the questionnaire’s dimensionality, the distribution of the test scores, and the fit of an IRT model. In case of normally distributed test scores, both the adjusted predictive modeling method and the new IRT method may be used. If the data show profound ceiling or floor effects, we recommend using the new state difficulty IRT method. The old state prevalence IRT method [
2] is clearly inferior to the new IRT method because the state prevalence is affected by the (un)reliability of the state scores. Therefore, we recommend not to use the old state prevalence IRT method [
2] anymore. Similarly, ROC analysis should no longer be used to identify interpretational thresholds.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.