Evaluating the use of prior information to individualise start item selection for the EORTC CAT core
- Open Access
- 01-01-2026
Abstract
Delen
Introduction
Computerized adaptive tests (CATs) are assessment tools that dynamically select items in real time from a bank of pre-calibrated items designed to measure a specific construct. CATs select and present items sequentially based on responses given so far. In each step, a score estimate is computed from the answered items and the next item is the one judged most informative based on the individual’s current score estimate. In this way, CAT tailors the set of asked items (the ‘questionnaire’) to the individual, thereby asking more relevant items and increasing the measurement precision compared to traditional, static questionnaires [1, 2]. Here and throughout the text, ‘relevant’ refers to the relevance of an item’s difficulty—that is, how well the item’s location on the scale matches the individual’s level of the construct being measured. Therefore, when we use the term ‘relevant’ it does not refer to the content relevance of the item.
As score estimates prior to assessment start are usually not available, it is customary to initiate CAT-assessments with the same start item for all individuals in a study, e.g., the item most informative at the expected mean of the target population. A concern for particularly brief CATs is that this may result in inefficient assessments or even biased scores for individuals deviating from the hypothesized mean (i.e., having poorer or better health) because the CAT requires more items to fully adapt to this ‘deviation’ from the mean. If a score estimate was available prior to the start of a CAT-assessment, then the choice of start item could be adapted to the individual, similarly as is done for the subsequent items of a CAT-assessment. As for the individually adapted item selection in CAT in general, it is expected that an individualised start item selection would make the start item more relevant and hence, increase efficiency and measurement precision of the CAT-assessment. To our knowledge, few studies have investigated the use of prior information in CAT for start item selection and/or score estimation [3‐8] Within the field of health-related quality of life (HRQoL), we identified only one study, which focused on using prior information to improve Bayesian score estimation [8]. These studies generally confirmed that use of prior information may improve CAT measurement efficiency and precision.
The EORTC CAT Core is a CAT instrument developed by the European Organisation for Research and Treatment of Cancer (EORTC) Quality of Life Group covering the same HRQoL domains as the EORTC QLQ-C30 questionnaire [9‐12]. To explore the use of prior information to optimise assessment with the EORTC CAT Core, we evaluated the use of ‘cross-domain prediction’ to obtain initial score estimates that could be used to individualise start item selection. Cross-domain prediction is the use of the score estimate for one domain to predict/estimate the score for another domain. Cross-domain prediction can be conducted within an EORTC CAT Core assessment, i.e., no external information is required to obtain an initial score estimate. In a CAT-assessment covering multiple HRQoL domains, cross-domain prediction can be applied to obtain initial score estimates and hence, to individualise start items for all but the first domain assessed. This may be a simple and efficient way to optimise start item selection. We considered implementation of such individual start item selection in the EORTC CAT Core to be advantageous if reasonably precise initial score estimates could be obtained with cross-domain prediction, and the individual start items enhanced measurement precision.
To assess the potential for using cross-domain prediction to individualise start item selection for the EORTC CAT Core, we piloted the approach for the physical functioning domain [13]. The evaluation indicated that physical functioning scores often can be predicted with reasonable precision (within 1SD) from another domain and that individualised start items likely increase precision of physical functioning score estimates for the initial steps of a CAT. Here, we report the evaluations of using cross-domain prediction to individualise start item selection for the 14 EORTC CAT Core symptom and functional domains (including a summary of the previously evaluated physical functioning [13]). Hence, the aims were to evaluate: (1) the precision of predicting scores on one HRQoL domain from scores on another domain (cross-domain prediction), and (2) whether individually selected start items can improve measurement precision with the EORTC CAT Core.
Methods
The analyses presented here are similar to those conducted for the physical functioning domain [13]. The following provides a summary of the data and analyses, further details can be found in a previous publication [13].
The EORTC CAT core
The EORTC CAT Core includes 14 CAT item banks for the five functional and nine symptom domains of the QLQ-C30 questionnaire [11]. In addition, the EORTC CAT Core includes the two QLQ-C30 items on global health/quality of life [14]. The CAT item banks comprise between 7 and 34 items (260 items in total) and include the items of the QLQ-C30 (see Table 1). The EORTC CAT Core measures apply so-called T-score metrics, scaled so that the European general population has a mean of 50 and a standard deviation of 10 for all domains [15]. Hence, scores > 50 reflect better functioning or worse symptoms than the European general population average. This T-scoring is applied and reported in the current study.
Table 1
EORTC CAT core item banks and number of items in each bank
Item bank domain | # items in bank |
|---|---|
Physical functioning | 31 |
Role functioning | 10 |
Emotional functioning | 24 |
Cognitive functioning | 34 |
Social functioning | 13 |
Fatigue | 34 |
Nausea & vomiting | 19 |
Pain | 16 |
Dyspnoea | 32 |
Insomnia | 8 |
Lack of appetite | 7 |
Constipation | 10 |
Diarrhoea | 13 |
Financial difficulties | 9 |
Global health/quality of life | 2* |
Domain score prediction
Sample
Data from the development and validation of the EORTC CAT Core were combined into a single dataset comprising 10,084 cancer patient assessments, with T-score estimates for all 15 domains [11, 12, 15]. To reduce the risk of overestimating the performance of the prediction, the sample was randomly split into a training dataset (80%) for fitting the prediction models and a testing dataset (20%) for evaluating the prediction [16].
Prediction of domain score
For each of the 14 domains applicable to CAT, prediction models were estimated using linear regression in the training dataset—one model for each of the other domains, including global health/quality of life. We used prediction based on the observed score for one domain at a time, as this could be applied in a CAT assessment as soon as one EORTC CAT Core domain had been assessed. As a simple summary of the pairwise relationships between domains, we calculated Pearson correlations for all domain pairs.
Evaluations of domain score predictions
All evaluations were conducted on the testing dataset. For each domain, the 14 predicted scores, one for each of the other domains, were calculated and compared to the observed score for that domain. For each domain we estimated the mean difference, the Pearson correlation, and the percentage of predictions deviating < 5 points (= ½SD) and < 10 points (= 1SD), respectively, between each of the 14 cross-domain predicted scores and the observed score.
Impact of individualised start item selection
Simulation sample
We conducted a series of Monte Carlo CAT simulations to evaluate the impact of individualised start items versus the standard use of fixed start items. For each HRQoL domain and for each score in the range M ± 30 (= 3SD) with increments of 0.5, 200 sets of responses to all items in the CAT item bank were simulated. Here M was the mean of the domain in the sample from the original development [11]. For example, the fatigue item bank includes 34 items and the mean fatigue score of the development sample was 51. Therefore, we simulated 200 sets of responses to the 34 fatigue items for each of the fatigue scores 21, 21.5, …, 81. Based on these response sets, CAT-assessments asking 1 to 10 items (or all items for banks with < 10 items), respectively, were simulated.
The following four CAT start item variants were evaluated–one fixed start item and three individualised start items.
-
‘Fixed’: This fixed start item design initiated all CATs with the item most informative at the mean of the development sample (M). This imitates the current EORTC CAT approach, where all respondents are presented with the same start item based on knowledge about a target population.
-
‘True’: This individualised start item design initiated the CATs with the item most informative at the individual’s ‘true’, i.e., sampled, score.
-
‘Diff ½SD’: This individualised start item design initiated the CATs with the item most informative ½SD either below or above (half of the simulations each) the individual’s true score.
-
‘Diff 1SD’: similar to ‘Diff ½SD’ but starting with the item most informative 1SD from the individual’s true score.
Among the three individual designs, ‘true’ reflected the ideal situation where the score has been predicted perfectly while ‘Diff ½SD’ and ‘Diff 1SD’ reflected situations where the predicted score deviates from the true score.
The above evaluations are of fixed-length CATs, i.e., CATs stopping after a predefined number of items. This is the most common application of the EORTC CAT Core instrument. An alternative approach, widely used in CAT more broadly, is variable length CATs aiming for a specific level of precision (standard error or equivalently reliability). The impact of individualised start items for variable-length CATs may be inferred approximately from these evaluations–simply by finding the average number of items required to obtain a specific level of precision. For example, if the evaluations above showed that four items resulted in an average reliability of 0.87 and five items in 0.92, then the average number of items asked in a variable-length CAT aiming for reliability 0.90 can be expected to be between four and five items. For a more elaborate illustration related to variable-length CATs, we evaluated the impact of start item for CAT administration stopping once the reliability exceeded 0.90 (corresponding to SE ≤ 0.33 on a standard normal metric) or after a maximum of 10 items.
Evaluations of individualised start items
To assess the potential impact of individualised start items for measurement precision we compared CAT score estimates, using each of the four types of start items, to the true score. We anticipated the ‘fixed’ CATs would perform well near the mean (at which the start item was most informative) but less so for more extreme scores. Therefore, the evaluations were divided into three sections: Low scores (3SD to 1SD below the mean), middle scores (1SD below to 1SD above the mean), and high scores (1SD to 3SD above the mean). For each of the three score sections we summarised measurement precision of the CATs by plotting the following:
-
The mean difference between the CAT based scores and the true scores.
-
The percentage differences < 5 (½SD) between CAT scores and the true scores.
All analyses and simulations were conducted using SAS Enterprise Guide software version 7.15.
Results
Domain score prediction
Clinical and sociodemographic characteristics of the N = 10,084 patients in the total sample are shown in Table 2. This was a mixed sample of cancer patients representing 12 countries and a variety of cancer diagnoses and treatments. The training dataset included a random subsample of N = 8068 and the testing dataset included the remaining N = 2016.
Table 2
Clinical and sociodemographic characteristics of the N = 10,084 patients in the sample for evaluating domain score prediction (previously published in [13])
N | Percentage | ||
|---|---|---|---|
Age, mean year(SD) | 10,025 | 60.4 (13.1) | |
Missing | 59 | ||
Sex | Female | 5437 | 54.2% |
Male | 4599 | 45.8% | |
Missing | 48 | ||
Country | Australia | 114 | 1.1% |
Austria | 374 | 3.7% | |
Denmark | 2043 | 20.3% | |
France | 1010 | 10.0% | |
Germany | 300 | 3.0% | |
Italy | 397 | 3.9% | |
Poland | 824 | 8.2% | |
Spain | 407 | 4.0% | |
Sweden | 282 | 2.8% | |
Taiwan | 403 | 4.0% | |
The Netherlands | 129 | 1.3% | |
UK | 3801 | 37.7% | |
Missing | 0 | ||
Cancer stage | I-II | 4676 | 50.3% |
III-IV | 4616 | 49.7% | |
Missing | 792 | ||
Cancer site | Breast | 2,088 | 21.4% |
Gastrointestinal | 1370 | 14.1% | |
Gynaecological | 1164 | 12.0% | |
Head and neck | 1097 | 11.3% | |
Lung | 577 | 5.9% | |
Urogenital | 1523 | 15.6% | |
Other | 1911 | 19.6% | |
Missing | 347 | ||
Current treatment | Chemotherapy | 4268 | 43.1% |
Other treatment | 2016 | 20.4% | |
No current treatment | 3612 | 36.5% | |
Missing | 188 | ||
Employment status | Retired | 4788 | 48.4% |
Working | 3500 | 35.4% | |
Other | 1601 | 16.2% | |
Missing | 195 | ||
Cohabitation | Living with a partner | 7368 | 74.5% |
Living alone | 2517 | 25.5% | |
Missing | 199 | ||
Correlations between the 15 HRQoL domains ranged from 0.03 to 0.71 in absolute value, with a median of 0.37 (see Table 3). Fatigue showed the strongest correlation with most domains, although quality of life, physical functioning, and role functioning were also highly correlated with many domains.
Table 3
Between-domain correlations
PF | RF | EF | CF | SF | FA | NV | PA | DY | SL | AP | CO | DI | FI | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RF | 0.70 | . | . | . | . | . | . | . | . | . | . | . | . | . |
EF | 0.45 | 0.50 | . | . | . | . | . | . | . | . | . | . | . | . |
CF | 0.46 | 0.46 | 0.54 | . | . | . | . | . | . | . | . | . | . | . |
SF | 0.56 | 0.67 | 0.56 | 0.47 | . | . | . | . | . | . | . | . | . | . |
FA | −0.70 | −0.71 | −0.58 | −0.55 | −0.63 | . | . | . | . | . | . | . | . | . |
NV | −0.37 | −0.42 | −0.39 | −0.34 | −0.41 | 0.49 | . | . | . | . | . | . | . | . |
PA | −0.56 | −0.57 | −0.48 | −0.43 | −0.50 | 0.59 | 0.40 | . | . | . | . | . | . | . |
DY | −0.53 | −0.45 | −0.35 | −0.35 | −0.38 | 0.50 | 0.28 | 0.36 | . | . | . | . | . | . |
SL | −0.36 | −0.37 | −0.44 | −0.37 | −0.37 | 0.46 | 0.30 | 0.40 | 0.29 | . | . | . | . | . |
AP | −0.45 | −0.49 | −0.43 | −0.36 | −0.47 | 0.56 | 0.53 | 0.44 | 0.31 | 0.34 | . | . | . | . |
CO | −0.30 | −0.28 | −0.26 | −0.26 | −0.29 | 0.34 | 0.29 | 0.30 | 0.22 | 0.24 | 0.32 | . | . | . |
DI | −0.20 | −0.23 | −0.20 | −0.19 | −0.23 | 0.27 | 0.25 | 0.22 | 0.14 | 0.18 | 0.21 | 0.03 | . | . |
FI | −0.30 | −0.34 | −0.36 | −0.34 | −0.41 | 0.35 | 0.27 | 0.33 | 0.25 | 0.27 | 0.24 | 0.17 | 0.14 | . |
QL | 0.61 | 0.64 | 0.57 | 0.47 | 0.62 | −0.69 | −0.43 | −0.55 | −0.42 | −0.38 | −0.52 | −0.30 | −0.23 | −0.32 |
Table 4 summarises the comparisons between cross-domain predicted T-scores and observed T-scores, highlighting both the accuracy and variability of cross-domain predictions. For each domain, the comparisons are summarised across the 14 predictions based on each of the other domains. The table reports the minimum, maximum, and mean of the following summary statistics when comparing predicted and observed scores: mean difference, correlation, and the percentage of cases with absolute differences < 5 and < 10 points, respectively. For each summary statistic the table also shows the domain resulting in the minimum and maximum values, i.e., the domains providing the best and worst cross-domain prediction according to the specific statistic.
Table 4
Summary of the evaluations comparing predicted domain scores with observed scores
Predicted domain | Mean difference | Correlation | Percent deviating < 5 points | Percent deviating < 10 points | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min | Max | Mean | Min | Max | Mean | Min | Max | Mean | Min | Max | Mean | |
PF | 0.1 (RF) | 0.5 (QL) | 0.3 | 0.25 (DI) | 0.71 (RF) | 0.48 | 33 (CF) | 50 (FA) | 38 | 57 (NV) | 85 (RF) | 69 |
RF | 0.1 (PF) | 0.5 (QL) | 0.2 | 0.27 (DI) | 0.72 (FA) | 0.50 | 21 (DI) | 57 (SF) | 34 | 65 (SL) | 82 (FA) | 75 |
EF | −0.3 (RF) | 0.1 (QL) | −0.1 | 0.22 (DI) | 0.60 (FA) | 0.44 | 34 (DI) | 54 (FA) | 41 | 60 (NV) | 82 (FA) | 74 |
CF | −0.1 (PF) | 0.1 (QL) | 0.0 | 0.23 (DI) | 0.55 (FA) | 0.40 | 33 (CO) | 57 (PA) | 42 | 77 (DY) | 86 (FA) | 81 |
SF | −0.1 (RF) | 0.3 (QL) | 0.1 | 0.26 (DI) | 0.67 (RF) | 0.47 | 26 (CO) | 61 (RF) | 38 | 71 (SL) | 87 (FA) | 80 |
FA | −0.2 (QL) | 0.2 (PF) | 0.1 | 0.31 (DI) | 0.72 (RF) | 0.53 | 41 (CO) | 54 (QL) | 47 | 67 (DI) | 89 (RF) | 75 |
NV | 0.1 (QL) | 0.4 (RF) | 0.3 | 0.25 (DY) | 0.54 (AP) | 0.36 | 38 (CF) | 59 (AP) | 45 | 68 (AP) | 80 (DI) | 74 |
PA | −0.2 (QL) | 0.2 (PF) | 0.0 | 0.24 (DI) | 0.60 (RF) | 0.44 | 23 (CO) | 57 (RF) | 37 | 73 (CO) | 84 (FA) | 77 |
DY | −0.4 (QL) | −0.1 (PF) | −0.3 | 0.16 (DI) | 0.53 (PF) | 0.34 | 6 (DI) | 41 (PF) | 25 | 67 (SL) | 80 (AP) | 75 |
SL | −0.1 (QL) | 0.1 (PF) | 0.0 | 0.22 (CO) | 0.46 (FA) | 0.34 | 33 (DI) | 57 (PA) | 42 | 73 (FI) | 87 (DI) | 81 |
AP | 0.1 (QL) | 0.4 (RF) | 0.3 | 0.24 (FI) | 0.56 (FA) | 0.41 | 4 (FI) | 44 (PA) | 27 | 53 (DY) | 72 (QL) | 66 |
CO | −0.1 (QL) | 0.1 (PF) | 0.0 | 0.08 (DI) | 0.33 (FA) | 0.26 | 3 (DI) | 52 (NV) | 36 | 78 (FA) | 82 (NV) | 80 |
DI | −0.2 (QL) | 0.0 (RF) | −0.1 | 0.08 (CO) | 0.31 (FA) | 0.23 | 1 (CO) | 49 (NV) | 32 | 74 (CO) | 82 (SL) | 78 |
FI | 0.1 (QL) | 0.3 (RF) | 0.2 | 0.15 (CO) | 0.41 (SF) | 0.29 | 35 (SL) | 54 (NV) | 43 | 76 (AP) | 87 (DI) | 81 |
Mean | −0.1 | 0.2 | 0.1 | 0.21 | 0.55 | 0.39 | 24 | 53 | 38 | 69 | 83 | 76 |
The mean difference between predicted and observed scores was trivial (≤ 0.5) for all HRQoL domains regardless of which domain was used to predict the score suggesting that using cross-domain prediction resulted in unbiased estimation of scores. The three other summary statistics showed considerably more variation both across the domains being predicted and across domains used to predict. Across the domains, the highest obtained correlation between predicted and observed scores ranged from 0.31 (Diarrhoea) to 0.72 (Fatigue and Role functioning) with an average of 0.55.
Predicted scores were within 5 points of the observed score for 41% (Dyspnoea) to 61% of the patients (Role functioning) across the domains using the best predictor. For all domains except lack of appetite it was possible to predict scores within 10 points of the observed value for at least 80% and on average across domains for 83%. Even if using the ‘poorest’ predictor for each domain more than two-third of the predicted scores (69%) were within 10 points of the observed score.
In summary, while prediction accuracy varied across domains, the overall performance was satisfactory, with generally unbiased estimates and acceptable precision for most HRQoL domains.
Impact of individualised start item selection
The simulation results summarising the effects of different start-items are presented in Supplementary material 1 and Table 5. Plots for each of the 14 domains (Supplementary material 1) show mean deviations of CAT estimated domain scores from true scores, percentage of CAT scores deviating < 5 points, and mean reliability for each of the four types of start item evaluated. Except for low pain scores there were only trivial differences across the types of start items when asking four or more items in the CATs. For low pain scores a fixed start item resulted in markedly lower reliability (0.2–0.4 lower) than individualised start items for CATs asking 1–4 items. When asking five or more items, differences were trivial. Given the trivial impact of start item for longer CATs the following will focus on CATs asking 1–3 items. As a summary, Table 5 shows the mean differences for CATs asking 1–3 items between CATs started with a fixed start item and CATs started with the item most informative five points from the individual’s true score (‘diff ½SD’). The ‘diff ½SD’ was chosen in this summary as it reflects intermediate performance (‘average’) among the three types of individualised start items evaluated. Differences in mean deviations from the true score were generally trivial. Except for low pain scores all differences in mean deviations were ≤ 1 and ranged from − 0.4 to 0.1 on average across domains. As expected, differences were generally small for the middle interval close to the sample mean score where the fixed start item provided maximum information. For the middle interval the fixed start item resulted, on average across domains, in the same reliability as the individualised start item and marginally more, 2%, estimated scores within 5 points of the true score. For high scores the largest difference was observed for emotional functioning for which the individual start item resulted in 4% more estimated scores within five points of the true score and an increase in reliability of 0.11 compared to using the fixed start item. On average, the proportion of scores deviating < 5 points were similar across domains (difference = 0.8%) but the individual start item resulted in 0.05 higher reliability. The largest differences between fixed and individual start items were observed for low scores, i.e., scores > 1SD below the mean. Here 4% more score estimates were within 5 points of the true score on average across domains when using the individual start item, with the largest difference for insomnia with 12% more. The reliability was up to 0.33 higher when using the individual start item compared to the fixed start item (pain) with an average increase in reliability across domains of 0.11 when using the individual start item.
Table 5
Summary of the simulations of individualised start items vs. fixed start item for cats asking 1–3 items
Domain | Differences in…* | Low scores | Middle scores | High scores |
|---|---|---|---|---|
Physical functioning | Mean deviation | −1.0 | 0.5 | −0.2 |
% dev. <5 points | 11.0 | −1.2 | 1.1 | |
Reliability | 0.26 | 0.00 | 0.05 | |
Role functioning | Mean deviation | −0.6 | 0.0 | 0.0 |
% dev. <5 points | 5.5 | 0.4 | 0.1 | |
Reliability | 0.10 | 0.00 | 0.03 | |
Emotional functioning | Mean deviation | 0.1 | 0.0 | −1.0 |
% dev. <5 points | 2.5 | −5.5 | 4.2 | |
Reliability | 0.05 | 0.01 | 0.11 | |
Cognitive functioning | Mean deviation | −0.5 | 0.1 | 0.0 |
% dev. <5 points | 5.3 | −2.4 | 0.2 | |
Reliability | 0.04 | 0.00 | 0.04 | |
Social functioning | Mean deviation | −0.4 | 0.2 | 0.5 |
% dev. <5 points | 5.1 | −1.5 | 0.0 | |
Reliability | 0.10 | 0.00 | 0.00 | |
Fatigue | Mean deviation | −0.6 | −0.1 | −0.2 |
% dev. <5 points | 2.6 | −0.2 | 0.6 | |
Reliability | 0.11 | 0.00 | 0.09 | |
Nausea & vomiting | Mean deviation | −0.7 | 0.0 | −0.1 |
% dev. <5 points | 4.4 | −0.3 | 2.1 | |
Reliability | 0.19 | 0.00 | 0.09 | |
Pain | Mean deviation | −2.5 | 0.0 | −0.2 |
% dev. <5 points | 6.6 | −2.5 | 1.7 | |
Reliability | 0.33 | −0.01 | 0.09 | |
Dyspnoea (DY) | Mean deviation | 0.2 | 0.3 | 0.5 |
% dev. <5 points | 0.0 | −2.4 | 2.6 | |
Reliability | 0.00 | −0.01 | 0.04 | |
Insomnia (SL) | Mean deviation | −0.8 | −0.2 | 0.6 |
% dev. <5 points | 11.9 | −0.6 | −2.9 | |
Reliability | 0.27 | 0.00 | 0.01 | |
Lack of appetite (AP) | Mean deviation | 0.1 | 0.1 | 0.3 |
% dev. <5 points | −1,7 | −1.3 | −2.5 | |
Reliability | 0.01 | 0.00 | 0.01 | |
Constipation (CO) | Mean deviation | 0.5 | 0.1 | 0.0 |
% dev. <5 points | −1.0 | −2.1 | 1.3 | |
Reliability | 0.06 | 0.0 | 0.06 | |
Diarrhoea (DI) | Mean deviation | 0.2 | 0.4 | 0.4 |
% dev. <5 points | 0.0 | −3.7 | 0.4 | |
Reliability | 0.00 | −0.01 | 0.04 | |
Financial difficulties (FI) | Mean deviation | 0.5 | 0.0 | 0.2 |
% dev. <5 points | −0.6 | −5.4 | 2.9 | |
Reliability | 0.04 | −0.01 | 0.07 | |
Mean across domains | Mean deviation | −0.4 | 0.1 | 0.1 |
% dev. <5 points | 4.1 | −2.1 | 0.8 | |
Reliability | 0.11 | −0.01 | 0.05 |
In summary, the benefits of individualised start items were most apparent for short CATs (1–3 items) and for patients with low scores, while differences were generally negligible for longer CATs and mid-range scores.
The evaluations of start item impact for variable-length CATs stopping once reliability ≥ 0.90 are summarized in Table S1 in Supplementary material 1. For low physical functioning scores, using individualized start items reduced the average number of items administered by up to 1.4. Otherwise, the impact of start item was minimal, with differences in the average number of items not exceeding 0.7.
Discussion
This study explored the potential for individualised start items in the EORTC CAT Core. In brief, we found that cross-domain predictions can generate initial score estimates for individualised start item selection, improving measurement precision for short CATs (up to three items). When implemented in the EORTC CAT Core tool, users may consider applying this option in multi-domain assessments to improve early-stage measurement efficiency.
For all HRQoL domains, unbiased score estimates predicted from any other domain of the EORTC CAT Core could be obtained using simple linear regression. This is essential, as biased predictions can lead to start items that are either too ‘easy’ or too ‘difficult’ (i.e., informative for lower or higher domain levels, respectively) reducing relevance and measurement precision. While all predictions yielded unbiased score estimates, the accuracy varied considerably across domains, especially in terms of the proportion of predicted scores deviating < 5 points. However, the extremely low proportions observed, e.g., for diarrhoea predicted by constipation (1%) and constipation by diarrhoea (3%), mainly resulted from the specific (but arbitrary) choice of a 5-point cut to summarise performance. Had we used e.g., a 7-point cut the two aforementioned predictions would result in 64% and 69% deviations, respectively, below the cut. Across all domains, more than half of the scores were predicted within 10 points of the domain score, regardless of the domain used for prediction and for 10 of the 14 domains at least two-third of all predicted scores were within 10 points, i.e., within 1SD. This suggests that in most cases relevant and informative start items can be selected using cross-domain predictions.
The accuracy of cross-domain score prediction varied across HRQoL domains. In general, multifaceted domains like fatigue and role functioning performed well and predicted other domains more accurately than specific symptoms like diarrhoea and constipation. This likely reflects the underlying clinical interrelationships among the HRQoL domains. Fatigue and role functioning are central indicators of overall health and daily activity, both of which affect and/or are influenced by a wide range of physical and psychological factors and hence, tend to covary with multiple other HRQoL domains. In contrast, specific symptoms such as gastrointestinal complaints are more condition- or treatment-dependent and less consistently related to other symptoms and problems. Therefore, it seems likely that general domains provide stronger predictions of other HRQoL domains, a finding in agreement with inter-domain correlations both observed here and previously reported [19].
To optimise practical use of individualised start items, our findings show that the sequence of domain assessments is important. Multi-domain CAT-assessments should begin with a general domain to optimise subsequent start-item selection. Fatigue often represents a good starting point, as this will increase the likelihood of selecting informative start items for most other domains. Moreover, within the fatigue domain itself there appears to be limited benefit from using individualized start items, making it particularly suitable as a starting point. If fatigue is the primary outcome, one could e.g., start with role functioning to allow for individualised start items for fatigue. An alternative would be to first assess global health/quality of life as this only includes two items and hence, does not benefit from individualised start items. Then all other domains, for which CAT is available, may apply individualised start items. Note, as more domains have been assessed the more likely it is that an appropriate predictor for subsequent domains is available.
The CAT simulations showed that the choice of start item, individualised or fixed, had minimal impact on measurement precision when asking more than three items. This suggests that longer EORTC CAT assessments (i.e., four or more items) are robust to suboptimal (even poor) choices of start item – for such CATs, one does not need to be overly concerned about the choice of start. For shorter CATs however, the choice of start item often had an impact on the measurement precision. Given the large number of domains covered by the EORTC CAT Core, such brief CATs are likely common in multi-domain assessments where limiting the assessment burden is essential. In such cases, individually selected start items appeared to enhance measurement precision, especially when the fixed start item did not align well with the domain score level. As most patient populations include individuals with both high and low scores, any fixed start item will typically match poorly to some patients. For these patients, our evaluations suggest that individualised start items selected based on cross-domain predicted scores may improve measurement precision. However, for most domains the improvement with individualised start items was limited. Across domains the largest improvement was observed for low scores for which the average reliability was improved by 0.11. Still, when brevity and efficiency are key, e.g., when assessing multiple domains or in studies involving frail patient populations (e.g., palliative care) even small improvements may be valuable. Further, our evaluations did not indicate that individualised start items generally result in reduced measurement precision for any of the EORTC CAT Core domains. Hence, given the simplicity of the applied cross-domain prediction to individualise start items, it seems a relevant addition to the EORTC CAT Core tool. When implemented and released in a coming update of the tool, users may apply this new option.
The clinical significance of the observed improvements in measurement precision for short CATs remains to be determined. The relevance of estimated improvements likely depends on how CAT scores are used in practice, e.g., whether for group-level comparisons, individual monitoring, or clinical decision-making. Future studies could examine the extent to which enhanced start item information translates into clinically meaningful improvements in the estimation of symptoms and problems and thereby in improved patient management.
The large international and mixed sample of cancer patients used to evaluate cross-domain prediction is a clear strength of the study. It maximises the likelihood that the findings are applicable to assessment in cancer patients in general. Further, it allowed for splitting the data in a training and testing set thereby reducing the risk of overestimating the ability to predict scores. As such, the evaluations likely provide a realistic ‘average’ indication of how cross-domain score prediction works for the EORTC CAT Core. A potential limitation is that correlations among HRQoL domains may vary across patient subgroups, which could influence prediction accuracy in specific contexts. Given the diversity of the present sample, the sample correlations may reflect an overall average pattern, suggesting that the proposed approach should perform reasonably well in most situations. Future studies could, nonetheless, explore whether using subgroup-specific score prediction might further improve accuracy and the selection of individualised start items. It should also be noted that these findings may not necessarily transfer to other CAT tools as this depends on several factors, not least the interrelationship among the domains of the specific tool.
Another area for possible future research concerns the method used for predicting scores. We applied simple linear regression models as we wanted an approach that can be applied as soon as one HRQoL domain has been assessed and which is straightforward to implement in practice. Whether more complex models, e.g., multiple or nonlinear regression, could further improve cross-domain prediction and, in turn, the selection of start items remains an open question.
We used simulations to assess the impact of individualised start items for CATs of varying lengths. Simulations have the clear advantage that they offer complete control over variables and settings and allow for manipulating variables of interest. For example, we had a known true score, which is not otherwise available. This was used to select individualised start items of varying relevance and to compare the CAT estimated scores against when evaluating start item impact. However, simulated data is not real-world data and although the data is generated based on models estimated from real-world data, it does not guarantee that the simulated data mirror real-world behaviour. Therefore, it would strengthen our findings and conclusions of using cross-domain predicted scores to select individual start items if they are confirmed in independent real-world data. Such validation studies are recommended.
Using both real-world data and simulations, we assessed whether cross-domain predicted scores obtained through linear regression could be utilised to individualise start item selection in the EORTC CAT Core. The study demonstrated that scores from one domain can be used to obtain initial score estimates for other domains, and that individual start items can enhance measurement precision for short CATs, particularly when assessing a domain with at most three items. For longer CATs the impact of individual start items appeared negligible. Nevertheless, given the simplicity and potential benefits for short CATs, cross-domain predicted start item selection is planned to be implemented in a coming update of the EORTC CAT Core tool.
Declarations.
Declarations
Ethics approval and consent to participate
The present study is based on secondary analyses of data collected between 2008 and 2018 as part of the development and validation of the EORTC CAT Core instrument. Local ethics committees of the participating countries approved data collection and written informed consent was obtained before study participation. All procedures performed involving human participants were in accordance with the local ethical standards and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. Data from the validation study was approved by the Danish Data Protection Agency, Ref. No.: 2012-58-0004. Data from the original collections may be available from the EORTC upon reasonable request.
Competing interests
The authors have no conflict of interest, specifically the authors do not have any financial interests that are directly or indirectly related to the results of this work. The EORTC CAT Core and any CAT based on this are copyrighted, with all rights reserved by the EORTC Quality of Life Group. Academic use of EORTC instruments requires no fee. For commercial use the EORTC requests a compensation fee.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.