Ga naar de hoofdinhoud
Top

Evaluating the use of prior information to individualise start item selection for the EORTC CAT core

  • Open Access
  • 01-01-2026
Gepubliceerd in:

Abstract

Purpose

Computerized adaptive tests (CATs) provide individualised measurement, using score estimates based on the patient’s prior responses to select the next most informative item. However, as no score estimate is available at the outset, the start item is typically not individualised. The European Organisation for Research and Treatment of Cancer (EORTC) CAT Core covers 15 health-related quality of life (HRQoL) domains. We explored whether scores from one domain could be used to obtain initial score estimates and hence, individualised start items for another domain, thereby improving measurement.

Methods

For each HRQoL domain, we evaluated the ability to predict scores using each of the 14 other domains in a large international sample of cancer patients (N = 10,084). Using simulations, we compared the impact of individualised versus standard fixed start item on CAT measurement precision.

Results

Across domains, predicted scores were within one standard deviation of the observed score in 72–89% of the assessments (mean = 83%), with predicted-observed correlations ranging 0.31–0.72 (mean = 0.55). The impact of individualised start items varied by domain and score level but typically improved reliability in the initial steps of CATs (first 3 items), particularly for patients with extreme scores.

Conclusion

Cross-domain predictions can be used to generate initial score estimates for individualised start item selection. Simulations suggested that individualised start items lead to improvements in measurement precision, particularly for short CATs (up to 3 items) and patients with extreme scores. Individualised start items based on cross-domain predicted scores is planned to be incorporated into the EORTC CAT Core toolbox.

Supplementary Information

The online version contains supplementary material available at https://doi.org/10.1007/s11136-025-04101-y.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Computerized adaptive tests (CATs) are assessment tools that dynamically select items in real time from a bank of pre-calibrated items designed to measure a specific construct. CATs select and present items sequentially based on responses given so far. In each step, a score estimate is computed from the answered items and the next item is the one judged most informative based on the individual’s current score estimate. In this way, CAT tailors the set of asked items (the ‘questionnaire’) to the individual, thereby asking more relevant items and increasing the measurement precision compared to traditional, static questionnaires [1, 2]. Here and throughout the text, ‘relevant’ refers to the relevance of an item’s difficulty—that is, how well the item’s location on the scale matches the individual’s level of the construct being measured. Therefore, when we use the term ‘relevant’ it does not refer to the content relevance of the item.
As score estimates prior to assessment start are usually not available, it is customary to initiate CAT-assessments with the same start item for all individuals in a study, e.g., the item most informative at the expected mean of the target population. A concern for particularly brief CATs is that this may result in inefficient assessments or even biased scores for individuals deviating from the hypothesized mean (i.e., having poorer or better health) because the CAT requires more items to fully adapt to this ‘deviation’ from the mean. If a score estimate was available prior to the start of a CAT-assessment, then the choice of start item could be adapted to the individual, similarly as is done for the subsequent items of a CAT-assessment. As for the individually adapted item selection in CAT in general, it is expected that an individualised start item selection would make the start item more relevant and hence, increase efficiency and measurement precision of the CAT-assessment. To our knowledge, few studies have investigated the use of prior information in CAT for start item selection and/or score estimation [38] Within the field of health-related quality of life (HRQoL), we identified only one study, which focused on using prior information to improve Bayesian score estimation [8]. These studies generally confirmed that use of prior information may improve CAT measurement efficiency and precision.
The EORTC CAT Core is a CAT instrument developed by the European Organisation for Research and Treatment of Cancer (EORTC) Quality of Life Group covering the same HRQoL domains as the EORTC QLQ-C30 questionnaire [912]. To explore the use of prior information to optimise assessment with the EORTC CAT Core, we evaluated the use of ‘cross-domain prediction’ to obtain initial score estimates that could be used to individualise start item selection. Cross-domain prediction is the use of the score estimate for one domain to predict/estimate the score for another domain. Cross-domain prediction can be conducted within an EORTC CAT Core assessment, i.e., no external information is required to obtain an initial score estimate. In a CAT-assessment covering multiple HRQoL domains, cross-domain prediction can be applied to obtain initial score estimates and hence, to individualise start items for all but the first domain assessed. This may be a simple and efficient way to optimise start item selection. We considered implementation of such individual start item selection in the EORTC CAT Core to be advantageous if reasonably precise initial score estimates could be obtained with cross-domain prediction, and the individual start items enhanced measurement precision.
To assess the potential for using cross-domain prediction to individualise start item selection for the EORTC CAT Core, we piloted the approach for the physical functioning domain [13]. The evaluation indicated that physical functioning scores often can be predicted with reasonable precision (within 1SD) from another domain and that individualised start items likely increase precision of physical functioning score estimates for the initial steps of a CAT. Here, we report the evaluations of using cross-domain prediction to individualise start item selection for the 14 EORTC CAT Core symptom and functional domains (including a summary of the previously evaluated physical functioning [13]). Hence, the aims were to evaluate: (1) the precision of predicting scores on one HRQoL domain from scores on another domain (cross-domain prediction), and (2) whether individually selected start items can improve measurement precision with the EORTC CAT Core.

Methods

The analyses presented here are similar to those conducted for the physical functioning domain [13]. The following provides a summary of the data and analyses, further details can be found in a previous publication [13].

The EORTC CAT core

The EORTC CAT Core includes 14 CAT item banks for the five functional and nine symptom domains of the QLQ-C30 questionnaire [11]. In addition, the EORTC CAT Core includes the two QLQ-C30 items on global health/quality of life [14]. The CAT item banks comprise between 7 and 34 items (260 items in total) and include the items of the QLQ-C30 (see Table 1). The EORTC CAT Core measures apply so-called T-score metrics, scaled so that the European general population has a mean of 50 and a standard deviation of 10 for all domains [15]. Hence, scores > 50 reflect better functioning or worse symptoms than the European general population average. This T-scoring is applied and reported in the current study.
Table 1
EORTC CAT core item banks and number of items in each bank
Item bank domain
# items in bank
Physical functioning
31
Role functioning
10
Emotional functioning
24
Cognitive functioning
34
Social functioning
13
Fatigue
34
Nausea & vomiting
19
Pain
16
Dyspnoea
32
Insomnia
8
Lack of appetite
7
Constipation
10
Diarrhoea
13
Financial difficulties
9
Global health/quality of life
2*
*Additional items have not been developed for the global health/quality of life domain and CAT is not applied for this domain [14]. Here the domain is used for predicting other domains only

Domain score prediction

Sample

Data from the development and validation of the EORTC CAT Core were combined into a single dataset comprising 10,084 cancer patient assessments, with T-score estimates for all 15 domains [11, 12, 15]. To reduce the risk of overestimating the performance of the prediction, the sample was randomly split into a training dataset (80%) for fitting the prediction models and a testing dataset (20%) for evaluating the prediction [16].

Prediction of domain score

For each of the 14 domains applicable to CAT, prediction models were estimated using linear regression in the training dataset—one model for each of the other domains, including global health/quality of life. We used prediction based on the observed score for one domain at a time, as this could be applied in a CAT assessment as soon as one EORTC CAT Core domain had been assessed. As a simple summary of the pairwise relationships between domains, we calculated Pearson correlations for all domain pairs.

Evaluations of domain score predictions

All evaluations were conducted on the testing dataset. For each domain, the 14 predicted scores, one for each of the other domains, were calculated and compared to the observed score for that domain. For each domain we estimated the mean difference, the Pearson correlation, and the percentage of predictions deviating < 5 points (= ½SD) and < 10 points (= 1SD), respectively, between each of the 14 cross-domain predicted scores and the observed score.

Impact of individualised start item selection

Simulation sample

We conducted a series of Monte Carlo CAT simulations to evaluate the impact of individualised start items versus the standard use of fixed start items. For each HRQoL domain and for each score in the range M ± 30 (= 3SD) with increments of 0.5, 200 sets of responses to all items in the CAT item bank were simulated. Here M was the mean of the domain in the sample from the original development [11]. For example, the fatigue item bank includes 34 items and the mean fatigue score of the development sample was 51. Therefore, we simulated 200 sets of responses to the 34 fatigue items for each of the fatigue scores 21, 21.5, …, 81. Based on these response sets, CAT-assessments asking 1 to 10 items (or all items for banks with < 10 items), respectively, were simulated.
The following four CAT start item variants were evaluated–one fixed start item and three individualised start items.
  • ‘Fixed’: This fixed start item design initiated all CATs with the item most informative at the mean of the development sample (M). This imitates the current EORTC CAT approach, where all respondents are presented with the same start item based on knowledge about a target population.
  • ‘True’: This individualised start item design initiated the CATs with the item most informative at the individual’s ‘true’, i.e., sampled, score.
  • ‘Diff ½SD’: This individualised start item design initiated the CATs with the item most informative ½SD either below or above (half of the simulations each) the individual’s true score.
  • ‘Diff 1SD’: similar to ‘Diff ½SD’ but starting with the item most informative 1SD from the individual’s true score.
Among the three individual designs, ‘true’ reflected the ideal situation where the score has been predicted perfectly while ‘Diff ½SD’ and ‘Diff 1SD’ reflected situations where the predicted score deviates from the true score.
The above evaluations are of fixed-length CATs, i.e., CATs stopping after a predefined number of items. This is the most common application of the EORTC CAT Core instrument. An alternative approach, widely used in CAT more broadly, is variable length CATs aiming for a specific level of precision (standard error or equivalently reliability). The impact of individualised start items for variable-length CATs may be inferred approximately from these evaluations–simply by finding the average number of items required to obtain a specific level of precision. For example, if the evaluations above showed that four items resulted in an average reliability of 0.87 and five items in 0.92, then the average number of items asked in a variable-length CAT aiming for reliability 0.90 can be expected to be between four and five items. For a more elaborate illustration related to variable-length CATs, we evaluated the impact of start item for CAT administration stopping once the reliability exceeded 0.90 (corresponding to SE ≤ 0.33 on a standard normal metric) or after a maximum of 10 items.

Evaluations of individualised start items

To assess the potential impact of individualised start items for measurement precision we compared CAT score estimates, using each of the four types of start items, to the true score. We anticipated the ‘fixed’ CATs would perform well near the mean (at which the start item was most informative) but less so for more extreme scores. Therefore, the evaluations were divided into three sections: Low scores (3SD to 1SD below the mean), middle scores (1SD below to 1SD above the mean), and high scores (1SD to 3SD above the mean). For each of the three score sections we summarised measurement precision of the CATs by plotting the following:
  • The mean difference between the CAT based scores and the true scores.
  • The percentage differences < 5 (½SD) between CAT scores and the true scores.
  • The mean reliability obtained with the CATs with the four types of start item. Following [17] the reliability was estimated as 1-SE2/SD2 where the standard error SE was estimated from the information function by SE2 = 1/information [18].
All analyses and simulations were conducted using SAS Enterprise Guide software version 7.15.

Results

Domain score prediction

Clinical and sociodemographic characteristics of the N = 10,084 patients in the total sample are shown in Table 2. This was a mixed sample of cancer patients representing 12 countries and a variety of cancer diagnoses and treatments. The training dataset included a random subsample of N = 8068 and the testing dataset included the remaining N = 2016.
Table 2
Clinical and sociodemographic characteristics of the N = 10,084 patients in the sample for evaluating domain score prediction (previously published in [13])
 
N
Percentage
Age, mean year(SD)
 
10,025
60.4 (13.1)
Missing
59
 
Sex
Female
5437
54.2%
Male
4599
45.8%
Missing
48
 
Country
Australia
114
1.1%
Austria
374
3.7%
Denmark
2043
20.3%
France
1010
10.0%
Germany
300
3.0%
Italy
397
3.9%
Poland
824
8.2%
Spain
407
4.0%
Sweden
282
2.8%
Taiwan
403
4.0%
The Netherlands
129
1.3%
UK
3801
37.7%
Missing
0
 
Cancer stage
I-II
4676
50.3%
III-IV
4616
49.7%
Missing
792
 
Cancer site
Breast
2,088
21.4%
Gastrointestinal
1370
14.1%
Gynaecological
1164
12.0%
Head and neck
1097
11.3%
Lung
577
5.9%
Urogenital
1523
15.6%
Other
1911
19.6%
Missing
347
 
Current treatment
Chemotherapy
4268
43.1%
Other treatment
2016
20.4%
No current treatment
3612
36.5%
Missing
188
 
Employment status
Retired
4788
48.4%
Working
3500
35.4%
Other
1601
16.2%
Missing
195
 
Cohabitation
Living with a partner
7368
74.5%
Living alone
2517
25.5%
Missing
199
 
Correlations between the 15 HRQoL domains ranged from 0.03 to 0.71 in absolute value, with a median of 0.37 (see Table 3). Fatigue showed the strongest correlation with most domains, although quality of life, physical functioning, and role functioning were also highly correlated with many domains.
Table 3
Between-domain correlations
 
PF
RF
EF
CF
SF
FA
NV
PA
DY
SL
AP
CO
DI
FI
RF
0.70
.
.
.
.
.
.
.
.
.
.
.
.
.
EF
0.45
0.50
.
.
.
.
.
.
.
.
.
.
.
.
CF
0.46
0.46
0.54
.
.
.
.
.
.
.
.
.
.
.
SF
0.56
0.67
0.56
0.47
.
.
.
.
.
.
.
.
.
.
FA
−0.70
−0.71
−0.58
−0.55
−0.63
.
.
.
.
.
.
.
.
.
NV
−0.37
−0.42
−0.39
−0.34
−0.41
0.49
.
.
.
.
.
.
.
.
PA
−0.56
−0.57
−0.48
−0.43
−0.50
0.59
0.40
.
.
.
.
.
.
.
DY
−0.53
−0.45
−0.35
−0.35
−0.38
0.50
0.28
0.36
.
.
.
.
.
.
SL
−0.36
−0.37
−0.44
−0.37
−0.37
0.46
0.30
0.40
0.29
.
.
.
.
.
AP
−0.45
−0.49
−0.43
−0.36
−0.47
0.56
0.53
0.44
0.31
0.34
.
.
.
.
CO
−0.30
−0.28
−0.26
−0.26
−0.29
0.34
0.29
0.30
0.22
0.24
0.32
.
.
.
DI
−0.20
−0.23
−0.20
−0.19
−0.23
0.27
0.25
0.22
0.14
0.18
0.21
0.03
.
.
FI
−0.30
−0.34
−0.36
−0.34
−0.41
0.35
0.27
0.33
0.25
0.27
0.24
0.17
0.14
.
QL
0.61
0.64
0.57
0.47
0.62
−0.69
−0.43
−0.55
−0.42
−0.38
−0.52
−0.30
−0.23
−0.32
Domain abbreviations: AP: lack of appetite; CF: Cognitive functioning; CO: constipation; DI: diarrhoea; DY: dyspnoea; EF: emotional functioning; FA: fatigue; FI: financial difficulties; NV: nausea & vomiting; PA: pain; QL: global health/quality of life; RF: role functioning; SF: social functioning; SL: insomnia/sleeplessness
All correlations are significant at p < 0.001
Table 4 summarises the comparisons between cross-domain predicted T-scores and observed T-scores, highlighting both the accuracy and variability of cross-domain predictions. For each domain, the comparisons are summarised across the 14 predictions based on each of the other domains. The table reports the minimum, maximum, and mean of the following summary statistics when comparing predicted and observed scores: mean difference, correlation, and the percentage of cases with absolute differences < 5 and < 10 points, respectively. For each summary statistic the table also shows the domain resulting in the minimum and maximum values, i.e., the domains providing the best and worst cross-domain prediction according to the specific statistic.
Table 4
Summary of the evaluations comparing predicted domain scores with observed scores
Predicted domain
Mean difference
Correlation
Percent deviating < 5 points
Percent deviating < 10 points
Min
Max
Mean
Min
Max
Mean
Min
Max
Mean
Min
Max
Mean
PF
0.1 (RF)
0.5 (QL)
0.3
0.25 (DI)
0.71 (RF)
0.48
33 (CF)
50 (FA)
38
57 (NV)
85 (RF)
69
RF
0.1 (PF)
0.5 (QL)
0.2
0.27 (DI)
0.72 (FA)
0.50
21 (DI)
57 (SF)
34
65 (SL)
82 (FA)
75
EF
−0.3 (RF)
0.1 (QL)
−0.1
0.22 (DI)
0.60 (FA)
0.44
34 (DI)
54 (FA)
41
60 (NV)
82 (FA)
74
CF
−0.1 (PF)
0.1 (QL)
0.0
0.23 (DI)
0.55 (FA)
0.40
33 (CO)
57 (PA)
42
77 (DY)
86 (FA)
81
SF
−0.1 (RF)
0.3 (QL)
0.1
0.26 (DI)
0.67 (RF)
0.47
26 (CO)
61 (RF)
38
71 (SL)
87 (FA)
80
FA
−0.2 (QL)
0.2 (PF)
0.1
0.31 (DI)
0.72 (RF)
0.53
41 (CO)
54 (QL)
47
67 (DI)
89 (RF)
75
NV
0.1 (QL)
0.4 (RF)
0.3
0.25 (DY)
0.54 (AP)
0.36
38 (CF)
59 (AP)
45
68 (AP)
80 (DI)
74
PA
−0.2 (QL)
0.2 (PF)
0.0
0.24 (DI)
0.60 (RF)
0.44
23 (CO)
57 (RF)
37
73 (CO)
84 (FA)
77
DY
−0.4 (QL)
−0.1 (PF)
−0.3
0.16 (DI)
0.53 (PF)
0.34
6 (DI)
41 (PF)
25
67 (SL)
80 (AP)
75
SL
−0.1 (QL)
0.1 (PF)
0.0
0.22 (CO)
0.46 (FA)
0.34
33 (DI)
57 (PA)
42
73 (FI)
87 (DI)
81
AP
0.1 (QL)
0.4 (RF)
0.3
0.24 (FI)
0.56 (FA)
0.41
4 (FI)
44 (PA)
27
53 (DY)
72 (QL)
66
CO
−0.1 (QL)
0.1 (PF)
0.0
0.08 (DI)
0.33 (FA)
0.26
3 (DI)
52 (NV)
36
78 (FA)
82 (NV)
80
DI
−0.2 (QL)
0.0 (RF)
−0.1
0.08 (CO)
0.31 (FA)
0.23
1 (CO)
49 (NV)
32
74 (CO)
82 (SL)
78
FI
0.1 (QL)
0.3 (RF)
0.2
0.15 (CO)
0.41 (SF)
0.29
35 (SL)
54 (NV)
43
76 (AP)
87 (DI)
81
Mean
−0.1
0.2
0.1
0.21
0.55
0.39
24
53
38
69
83
76
The table presents mean differences, correlations, and percentage of differences < 5 points and < 10 points, respectively, between the predicted and observed scores. Domain abbreviations in brackets are the domains with min/max value
Domain abbreviations: AP: lack of appetite; CF: Cognitive functioning; CO: constipation; DI: diarrhoea; DY: dyspnoea; EF: emotional functioning; FA: fatigue; FI: financial difficulties; NV: nausea & vomiting; PA: pain; QL: overall health/quality of life; RF: role functioning; SF: social functioning; SL: insomnia/sleeplessness
The mean difference between predicted and observed scores was trivial (≤ 0.5) for all HRQoL domains regardless of which domain was used to predict the score suggesting that using cross-domain prediction resulted in unbiased estimation of scores. The three other summary statistics showed considerably more variation both across the domains being predicted and across domains used to predict. Across the domains, the highest obtained correlation between predicted and observed scores ranged from 0.31 (Diarrhoea) to 0.72 (Fatigue and Role functioning) with an average of 0.55.
Predicted scores were within 5 points of the observed score for 41% (Dyspnoea) to 61% of the patients (Role functioning) across the domains using the best predictor. For all domains except lack of appetite it was possible to predict scores within 10 points of the observed value for at least 80% and on average across domains for 83%. Even if using the ‘poorest’ predictor for each domain more than two-third of the predicted scores (69%) were within 10 points of the observed score.
In summary, while prediction accuracy varied across domains, the overall performance was satisfactory, with generally unbiased estimates and acceptable precision for most HRQoL domains.

Impact of individualised start item selection

The simulation results summarising the effects of different start-items are presented in Supplementary material 1 and Table 5. Plots for each of the 14 domains (Supplementary material 1) show mean deviations of CAT estimated domain scores from true scores, percentage of CAT scores deviating < 5 points, and mean reliability for each of the four types of start item evaluated. Except for low pain scores there were only trivial differences across the types of start items when asking four or more items in the CATs. For low pain scores a fixed start item resulted in markedly lower reliability (0.2–0.4 lower) than individualised start items for CATs asking 1–4 items. When asking five or more items, differences were trivial. Given the trivial impact of start item for longer CATs the following will focus on CATs asking 1–3 items. As a summary, Table 5 shows the mean differences for CATs asking 1–3 items between CATs started with a fixed start item and CATs started with the item most informative five points from the individual’s true score (‘diff ½SD’). The ‘diff ½SD’ was chosen in this summary as it reflects intermediate performance (‘average’) among the three types of individualised start items evaluated. Differences in mean deviations from the true score were generally trivial. Except for low pain scores all differences in mean deviations were ≤ 1 and ranged from − 0.4 to 0.1 on average across domains. As expected, differences were generally small for the middle interval close to the sample mean score where the fixed start item provided maximum information. For the middle interval the fixed start item resulted, on average across domains, in the same reliability as the individualised start item and marginally more, 2%, estimated scores within 5 points of the true score. For high scores the largest difference was observed for emotional functioning for which the individual start item resulted in 4% more estimated scores within five points of the true score and an increase in reliability of 0.11 compared to using the fixed start item. On average, the proportion of scores deviating < 5 points were similar across domains (difference = 0.8%) but the individual start item resulted in 0.05 higher reliability. The largest differences between fixed and individual start items were observed for low scores, i.e., scores > 1SD below the mean. Here 4% more score estimates were within 5 points of the true score on average across domains when using the individual start item, with the largest difference for insomnia with 12% more. The reliability was up to 0.33 higher when using the individual start item compared to the fixed start item (pain) with an average increase in reliability across domains of 0.11 when using the individual start item.
Table 5
Summary of the simulations of individualised start items vs. fixed start item for cats asking 1–3 items
Domain
Differences in…*
Low scores
Middle scores
High scores
Physical functioning
Mean deviation
−1.0
0.5
−0.2
% dev. <5 points
11.0
−1.2
1.1
Reliability
0.26
0.00
0.05
Role functioning
Mean deviation
−0.6
0.0
0.0
% dev. <5 points
5.5
0.4
0.1
Reliability
0.10
0.00
0.03
Emotional functioning
Mean deviation
0.1
0.0
−1.0
% dev. <5 points
2.5
−5.5
4.2
Reliability
0.05
0.01
0.11
Cognitive functioning
Mean deviation
−0.5
0.1
0.0
% dev. <5 points
5.3
−2.4
0.2
Reliability
0.04
0.00
0.04
Social functioning
Mean deviation
−0.4
0.2
0.5
% dev. <5 points
5.1
−1.5
0.0
Reliability
0.10
0.00
0.00
Fatigue
Mean deviation
−0.6
−0.1
−0.2
% dev. <5 points
2.6
−0.2
0.6
Reliability
0.11
0.00
0.09
Nausea & vomiting
Mean deviation
−0.7
0.0
−0.1
% dev. <5 points
4.4
−0.3
2.1
Reliability
0.19
0.00
0.09
Pain
Mean deviation
−2.5
0.0
−0.2
% dev. <5 points
6.6
−2.5
1.7
Reliability
0.33
−0.01
0.09
Dyspnoea (DY)
Mean deviation
0.2
0.3
0.5
% dev. <5 points
0.0
−2.4
2.6
Reliability
0.00
−0.01
0.04
Insomnia (SL)
Mean deviation
−0.8
−0.2
0.6
% dev. <5 points
11.9
−0.6
−2.9
Reliability
0.27
0.00
0.01
Lack of appetite (AP)
Mean deviation
0.1
0.1
0.3
% dev. <5 points
−1,7
−1.3
−2.5
Reliability
0.01
0.00
0.01
Constipation (CO)
Mean deviation
0.5
0.1
0.0
% dev. <5 points
−1.0
−2.1
1.3
Reliability
0.06
0.0
0.06
Diarrhoea (DI)
Mean deviation
0.2
0.4
0.4
% dev. <5 points
0.0
−3.7
0.4
Reliability
0.00
−0.01
0.04
Financial difficulties (FI)
Mean deviation
0.5
0.0
0.2
% dev. <5 points
−0.6
−5.4
2.9
Reliability
0.04
−0.01
0.07
Mean across domains
Mean deviation
−0.4
0.1
0.1
% dev. <5 points
4.1
−2.1
0.8
Reliability
0.11
−0.01
0.05
The table presents differences in mean deviations from the true score, percent deviating < 5 from the true score, and reliability between CATs started with the item most informative five points from the individual’s true score and CATs started with a fixed start item
‘Mean deviation’ <0 reflects that CATs applying individualised start items were on average closer to the true score; ‘% dev. <5 points’ >0 reflects that CATs applying individualised start items resulted in more scores within 5 points of the true score; ‘reliability’ >0 reflects that CATs applying individualised start items resulted in higher reliability than CATs applying fixed start item
In summary, the benefits of individualised start items were most apparent for short CATs (1–3 items) and for patients with low scores, while differences were generally negligible for longer CATs and mid-range scores.
The evaluations of start item impact for variable-length CATs stopping once reliability ≥ 0.90 are summarized in Table S1 in Supplementary material 1. For low physical functioning scores, using individualized start items reduced the average number of items administered by up to 1.4. Otherwise, the impact of start item was minimal, with differences in the average number of items not exceeding 0.7.

Discussion

This study explored the potential for individualised start items in the EORTC CAT Core. In brief, we found that cross-domain predictions can generate initial score estimates for individualised start item selection, improving measurement precision for short CATs (up to three items). When implemented in the EORTC CAT Core tool, users may consider applying this option in multi-domain assessments to improve early-stage measurement efficiency.
For all HRQoL domains, unbiased score estimates predicted from any other domain of the EORTC CAT Core could be obtained using simple linear regression. This is essential, as biased predictions can lead to start items that are either too ‘easy’ or too ‘difficult’ (i.e., informative for lower or higher domain levels, respectively) reducing relevance and measurement precision. While all predictions yielded unbiased score estimates, the accuracy varied considerably across domains, especially in terms of the proportion of predicted scores deviating < 5 points. However, the extremely low proportions observed, e.g., for diarrhoea predicted by constipation (1%) and constipation by diarrhoea (3%), mainly resulted from the specific (but arbitrary) choice of a 5-point cut to summarise performance. Had we used e.g., a 7-point cut the two aforementioned predictions would result in 64% and 69% deviations, respectively, below the cut. Across all domains, more than half of the scores were predicted within 10 points of the domain score, regardless of the domain used for prediction and for 10 of the 14 domains at least two-third of all predicted scores were within 10 points, i.e., within 1SD. This suggests that in most cases relevant and informative start items can be selected using cross-domain predictions.
The accuracy of cross-domain score prediction varied across HRQoL domains. In general, multifaceted domains like fatigue and role functioning performed well and predicted other domains more accurately than specific symptoms like diarrhoea and constipation. This likely reflects the underlying clinical interrelationships among the HRQoL domains. Fatigue and role functioning are central indicators of overall health and daily activity, both of which affect and/or are influenced by a wide range of physical and psychological factors and hence, tend to covary with multiple other HRQoL domains. In contrast, specific symptoms such as gastrointestinal complaints are more condition- or treatment-dependent and less consistently related to other symptoms and problems. Therefore, it seems likely that general domains provide stronger predictions of other HRQoL domains, a finding in agreement with inter-domain correlations both observed here and previously reported [19].
To optimise practical use of individualised start items, our findings show that the sequence of domain assessments is important. Multi-domain CAT-assessments should begin with a general domain to optimise subsequent start-item selection. Fatigue often represents a good starting point, as this will increase the likelihood of selecting informative start items for most other domains. Moreover, within the fatigue domain itself there appears to be limited benefit from using individualized start items, making it particularly suitable as a starting point. If fatigue is the primary outcome, one could e.g., start with role functioning to allow for individualised start items for fatigue. An alternative would be to first assess global health/quality of life as this only includes two items and hence, does not benefit from individualised start items. Then all other domains, for which CAT is available, may apply individualised start items. Note, as more domains have been assessed the more likely it is that an appropriate predictor for subsequent domains is available.
The CAT simulations showed that the choice of start item, individualised or fixed, had minimal impact on measurement precision when asking more than three items. This suggests that longer EORTC CAT assessments (i.e., four or more items) are robust to suboptimal (even poor) choices of start item – for such CATs, one does not need to be overly concerned about the choice of start. For shorter CATs however, the choice of start item often had an impact on the measurement precision. Given the large number of domains covered by the EORTC CAT Core, such brief CATs are likely common in multi-domain assessments where limiting the assessment burden is essential. In such cases, individually selected start items appeared to enhance measurement precision, especially when the fixed start item did not align well with the domain score level. As most patient populations include individuals with both high and low scores, any fixed start item will typically match poorly to some patients. For these patients, our evaluations suggest that individualised start items selected based on cross-domain predicted scores may improve measurement precision. However, for most domains the improvement with individualised start items was limited. Across domains the largest improvement was observed for low scores for which the average reliability was improved by 0.11. Still, when brevity and efficiency are key, e.g., when assessing multiple domains or in studies involving frail patient populations (e.g., palliative care) even small improvements may be valuable. Further, our evaluations did not indicate that individualised start items generally result in reduced measurement precision for any of the EORTC CAT Core domains. Hence, given the simplicity of the applied cross-domain prediction to individualise start items, it seems a relevant addition to the EORTC CAT Core tool. When implemented and released in a coming update of the tool, users may apply this new option.
The clinical significance of the observed improvements in measurement precision for short CATs remains to be determined. The relevance of estimated improvements likely depends on how CAT scores are used in practice, e.g., whether for group-level comparisons, individual monitoring, or clinical decision-making. Future studies could examine the extent to which enhanced start item information translates into clinically meaningful improvements in the estimation of symptoms and problems and thereby in improved patient management.
The large international and mixed sample of cancer patients used to evaluate cross-domain prediction is a clear strength of the study. It maximises the likelihood that the findings are applicable to assessment in cancer patients in general. Further, it allowed for splitting the data in a training and testing set thereby reducing the risk of overestimating the ability to predict scores. As such, the evaluations likely provide a realistic ‘average’ indication of how cross-domain score prediction works for the EORTC CAT Core. A potential limitation is that correlations among HRQoL domains may vary across patient subgroups, which could influence prediction accuracy in specific contexts. Given the diversity of the present sample, the sample correlations may reflect an overall average pattern, suggesting that the proposed approach should perform reasonably well in most situations. Future studies could, nonetheless, explore whether using subgroup-specific score prediction might further improve accuracy and the selection of individualised start items. It should also be noted that these findings may not necessarily transfer to other CAT tools as this depends on several factors, not least the interrelationship among the domains of the specific tool.
Another area for possible future research concerns the method used for predicting scores. We applied simple linear regression models as we wanted an approach that can be applied as soon as one HRQoL domain has been assessed and which is straightforward to implement in practice. Whether more complex models, e.g., multiple or nonlinear regression, could further improve cross-domain prediction and, in turn, the selection of start items remains an open question.
We used simulations to assess the impact of individualised start items for CATs of varying lengths. Simulations have the clear advantage that they offer complete control over variables and settings and allow for manipulating variables of interest. For example, we had a known true score, which is not otherwise available. This was used to select individualised start items of varying relevance and to compare the CAT estimated scores against when evaluating start item impact. However, simulated data is not real-world data and although the data is generated based on models estimated from real-world data, it does not guarantee that the simulated data mirror real-world behaviour. Therefore, it would strengthen our findings and conclusions of using cross-domain predicted scores to select individual start items if they are confirmed in independent real-world data. Such validation studies are recommended.
Using both real-world data and simulations, we assessed whether cross-domain predicted scores obtained through linear regression could be utilised to individualise start item selection in the EORTC CAT Core. The study demonstrated that scores from one domain can be used to obtain initial score estimates for other domains, and that individual start items can enhance measurement precision for short CATs, particularly when assessing a domain with at most three items. For longer CATs the impact of individual start items appeared negligible. Nevertheless, given the simplicity and potential benefits for short CATs, cross-domain predicted start item selection is planned to be implemented in a coming update of the EORTC CAT Core tool.
Declarations.

Declarations

The present study is based on secondary analyses of data collected between 2008 and 2018 as part of the development and validation of the EORTC CAT Core instrument. Local ethics committees of the participating countries approved data collection and written informed consent was obtained before study participation. All procedures performed involving human participants were in accordance with the local ethical standards and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. Data from the validation study was approved by the Danish Data Protection Agency, Ref. No.: 2012-58-0004. Data from the original collections may be available from the EORTC upon reasonable request.

Competing interests

The authors have no conflict of interest, specifically the authors do not have any financial interests that are directly or indirectly related to the results of this work. The EORTC CAT Core and any CAT based on this are copyrighted, with all rights reserved by the EORTC Quality of Life Group. Academic use of EORTC instruments requires no fee. For commercial use the EORTC requests a compensation fee.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Titel
Evaluating the use of prior information to individualise start item selection for the EORTC CAT core
Auteurs
Morten Aagaard Petersen
Hugo Vachon
Johannes M. Giesinger
Mogens Groenvold
On Behalf of the European Organisation for Research and Treatment of Cancer (EORTC) Quality of Life Group
Publicatiedatum
01-01-2026
Uitgeverij
Springer International Publishing
Gepubliceerd in
Quality of Life Research / Uitgave 1/2026
Print ISSN: 0962-9343
Elektronisch ISSN: 1573-2649
DOI
https://doi.org/10.1007/s11136-025-04101-y

Supplementary Information

Below is the link to the electronic supplementary material.
1.
go back to reference van der Linden, W. J., & Glas, C. A. W. (2010). Elements of adaptive testing. Springer. CrossRef
2.
go back to reference Wainer, H. (2000). Computerized adaptive testing: A primer (2nd ed.). Lawrence Erlbaum Associates, Inc. CrossRef
3.
go back to reference Gialluca, K. A., & Weiss, D. J. (1979). Efficiency of an adaptive inter-subtest branching strategy in the measurement of classroom achievement. Research report 79 – 6. University of Minnesota, Department of Psychology, Psychometric Methods Program Minneapolis.
4.
go back to reference Maurelli, V. A., & Weiss, D. J. (1981). Factors influencing the psychometric characteristics of an adaptive testing strategy for test batteries.
5.
go back to reference van der Linden, W. J. (1999). Empirical initialization of the trait estimator in adaptive testing. Applied Psychological Measurement, 23(1), 21–29.CrossRef
6.
go back to reference Matteucci, M., & Veldkamp, B. P. (2013). On the use of MCMC computerized adaptive testing with empirical prior information to improve efficiency. Statistical Methods & Applications, 22(2), 243–267.CrossRef
7.
go back to reference Xie, Q. (2019). The impact of collateral information on ability estimation in an adaptive test battery. The University of Iowa
8.
go back to reference Frans, N., Braeken, J., Veldkamp, B. P., & Paap, M. C. (2023). Empirical priors in polytomous computerized adaptive tests: Risks and rewards in clinical settings. Applied Psychological Measurement, 47(1), 48–63.CrossRefPubMed
9.
go back to reference Aaronson, N. K., Ahmedzai, S., Bergman, B., Bullinger, M., Cull, A., Duez, N. J., Filiberti, A., Flechtner, H., Fleishman, S. B., & de Haes, J. C. (1993). The European organization for research and treatment of cancer QLQ-C30: A quality-of-life instrument for use in international clinical trials in oncology. Journal of the National Cancer Institute, 85(5), 365–376.CrossRefPubMed
10.
go back to reference Fayers, P., Bottomley, A., & On Behalf of the EORTC Quality of Life Group and of the Quality of Life Unit. (2002). Quality of life research within the EORTC - the EORTC QLQ - C30. European Journal of Cancer, 38(Suppl 4), S125–S133. CrossRefPubMed
11.
go back to reference Petersen, M. A., Aaronson, N. K., Arraras, J. I., Chie, W. C., Conroy, T., Costantini, A., Dirven, L., Fayers, P. M., Gamper, E. M., Giesinger, J. M., Habets, E. J. J., Hammerlid, E., Helbostad, J. L., Hjermstad, M. J., Holzner, B., Johnson, C., Kemmler, G., King, M. T., Kaasa, S., et al. (2018). The EORTC CAT core: The computer adaptive version of the EORTC QLQ-C30 questionnaire. European Journal of Cancer, 100, 8–16. CrossRefPubMed
12.
go back to reference Petersen, M. A., Aaronson, N. K., Conroy, T., Costantini, A., Giesinger, J. M., Hammerlid, E., Holzner, B., Johnson, C. D., Kieffer, J. M., van Leeuwen, M., Nolte, S., Ramage, J., Tomaszewski, K. A., Waldmann, A., Young, T., Zotti, P., & Groenvold, M. (2020). International validation of the EORTC CAT core–a new adaptive instrument for measuring core quality of life domains in cancer. Quality of Life Research, 29(5), 1405–1417. CrossRefPubMed
13.
go back to reference Petersen, M. A., Vachon, H., Giesinger, J. M., Groenvold, M., Group , & On Behalf of the EORTC Quality of Life Group. (2025). Using prior information to individualize start item selection when assessing physical functioning with the EORTC CAT core. Health And Quality Of Life Outcomes, 23, 21. CrossRefPubMedPubMedCentral
14.
go back to reference Petersen, M. A., Groenvold, M., Aaronson, N. K., Chie, W. C., Conroy, T., Costantini, A., Fayers, P., Helbostad, J., Holzner, B., Kaasa, S., Singer, S., Velikova, G., & Young, T. (2010). Development of computerised adaptive testing (CAT) for the EORTC QLQ-C30 dimensions–general approach and initial results for physical functioning. European Journal of Cancer, 46, 1352–1358. CrossRefPubMed
15.
go back to reference Liegl, G., Petersen, M. A., Groenvold, M., Aaronson, N. K., Costantini, A., Fayers, P. M., Holzner, B., Johnson, C., Kemmler, G., Tomaszewski, K. A., Waldmann, A., Young, T., Rose, M., & Nolte, S. (2019). Establishing the European norm for the health-related quality of life domains of the computer-adaptive test EORTC CAT core. European Journal of Cancer, 107, 133–141.CrossRefPubMed
16.
go back to reference Gholamy, A., Kreinovich, V., & Kosheleva, O. (2018). Why 70/30 or 80/20 relation between training and testing sets: A pedagogical explanation.
17.
go back to reference Kim, S., & Feldt, L. S. (2010). The Estimation of the IRT reliability coefficient and its lower and upper bounds, with comparisons to CTT reliability statistics. Asia Pacific Education Review, 11(2), 179–188.CrossRef
18.
go back to reference Muraki, E. (1993). Information functions of the generalized partial credit model. Applied Psychological Measurement, 17(4), 351–363.CrossRef
19.
go back to reference Machingura, A., Taye, M., Musoro, J., Ringash, J., Pe, M., Coens, C., Martinelli, F., Tu, D., Basch, E., Brandberg, Y., Groenvold, M., Eggermont, A., Cardoso, F., Meerbeeck, J. V., van der Graaf, T. A. W., Taphoorn, M. J., Reijneveld, J. C., Soffietti, R., Sloan, J., et al. (2022). Clustering of EORTC QLQ-C30 health-related quality of life scales across several cancer types: Validation study. European Journal of Cancer, 170, 1–9.CrossRefPubMed