Introduction
Osteoarthritis is a highly prevalent chronic disease and a major cause of activity limitations in affected patients [
1,
2]. With hip and knee being two of the most affected joints [
1], patients are often particularly limited in lower body functions and associated activities of daily living (ADL). Consequently, limitations in performing physical activities is an important outcome in the field of hip and knee osteoarthritis [
3].
By now, many different measures of activity limitations have been developed [
4], with patient-reported outcome (PRO) and performance-based outcome (PerfO) measures being the most frequently used assessment types [
4,
5]. In PRO measures, respondents rate their perceived level of activity limitations by responding to self-report items in a questionnaire, while PerfO measures assess a patient’s performance of physical tasks in a standardized test environment [
4]. PRO measures are easy to use and cheap but subjective with regard to the interpretation of the terms used in the questionnaire (i.e., ‘difficulty’) and to the adopted reference frame of the respondent (i.e., the situation or status the subject relates) [
6]. It has been shown that PRO measures are more influenced by subjective patient variables than PerfO assessments [
6‐
9]. In contrast, PerfO measures lead to more objective assessments, but are more resource-intense and burdensome to the patients [
10]. Moreover, while PRO measures allow to capture a broad range of different activities, PerfO measures usually focus on a very specific activity and are often used as single-task measures [
11]. In sum, several previous research findings indicate that PRO and PerfO measures may assess related but yet different constructs, and that respective results should only be compared with caution [
12,
13].
To combine the advantages of PRO and PerfO measures within one instrument, the animated activity questionnaire (AAQ) was developed [
14,
15]. The AAQ is an online-based measure of activity limitations for patients with hip and knee osteoarthritis. Each of its 17 items consists of several videos of an animated avatar performing a specific ADL task. To answer an item, patients choose the animation that best matches their own activity limitation level. Resource-intensity of the AAQ is comparable to computer-based PRO measures. At the same time, by showing animations of activities in a standardized real life situation and environment, the influence of the patient’s reference frame is expected to be minimized [
14,
16]. Moreover, the AAQ is almost non-verbal, potentially reducing validity problems due to differences in literacy across patients, and allows for cross-language application with little translational efforts [
16,
17]. Thus, the AAQ is being discussed as a suitable alternative for PerfO measures in largescale studies [
14].
Several studies indicated good psychometric characteristics of the AAQ [
14,
16‐
19]. However, item response theory (IRT) methods, allowing for validating an instrument on item-level [
20], has not yet been applied to psychometrically evaluate the AAQ. Using IRT, ability estimates can be assessed on an interval scale and statistical precision and power can be improved [
20]. Estimating an IRT model provides individual parameters for each item of a measure [
21‐
23]. A major advantage of using IRT item parameters for scoring is that item sets can be optimized by administering only the most relevant and precise items for a given ability level. One method of item set optimization is the application of computerized adaptive tests (CAT) [
23]. In a CAT, a computer algorithm automatically selects the most informative items for an individual respondent, based on her or his answers given on previous items [
23]. These algorithms are based on item parameters reflecting the individual statistical relationship between the latent construct of a measure and the responses to a given item. The application of CATs usually leads to a significant reduction in the number of items to be answered and in the time required to complete a questionnaire [
21,
23,
24]. This seems to be particularly useful even for relatively short instruments, as study participants often have to complete not only a single questionnaire, but entire batteries of questionnaires, which can be very burdensome for patients and resource-intensive for those conducting these studies [
24]. Moreover, to answer an AAQ item, patients must watch multiple videos simultaneously, which requires a certain level of concentration and attention. Therefore, administration as a CAT could lead to a significant reduction in patient burden.
Considering its computer-based nature, the AAQ seems to be very well-suited for being used as a CAT. Thus, the aims of this study were (1) to investigate if the items of the AAQ fulfill psychometric criteria for IRT modelling and (2) to establish IRT item parameters which can be used for the application as CAT. Moreover, (3) the performance of different CAT versions in terms of test length, precision, and construct validity will be evaluated based on post-hoc simulations.
Discussion
Based on a large international sample of patients with hip and/or knee osteoarthritis, the findings of our study indicate that all items of the AAQ are well-suited for being calibrated on a unidimensional IRT-based scale. Item parameters have been established using graded response modelling. These parameters can be used for applying the AAQ as CAT. Using post-hoc simulations, good psychometric properties were found for three different CAT versions (without length restrictions, with a maximum of 10 items, with a maximum of 5 items).
Statistical analyses indicated that all items of the AAQ fulfill psychometric criteria for IRT modelling. Among others, a core assumption of unidimensional IRT is that all items can be used for the assessment of a common underlying construct (i.e.,
activity limitations in the case of the AAQ). Although the results of a traditional unidimensional CFA were inconsistent, bifactor analysis as well as individual item analyses supported a unidimensional structure of the AAQ items. That the use of bifactor models might be better suited for evaluating sufficient unidimensionality of self-reported data than traditional CFA criteria has been discussed before [
37].
Individual item parameters indicated that the AAQ items are generally best suited for measuring more severe activity limitations. Moreover, items 16 (‘putting on shoes’) and 17 (‘taking off shoes’) showed comparably low slopes, indicating low associations of these two items with the underlying construct. In the context of CAT, items with low slopes are generally less informative and, consequently, less likely to be administered by the automated CAT algorithm. Nonetheless, items 16 and 17 appeared to be useful when it comes to scoring individuals with below-average activity limitations and were actually selected by the CAT algorithm for some participants.
Important to consider in relation to CAT administration is that the individual items of the AAQ cover somewhat different aspects of physical activity limitations, i.e., climbing stairs (items 1 and 2), walking (items 3 to 7), rising and sitting down (items 9 to 15), but also activities that require fine motor skills next to hip joint mobility, such as picking up an object from the floor as well as putting on and taking off shoes (items 8, 16, and 17, respectively). Thus, when applied as CAT, content validity of the AAQ might be reduced in case one or more of these aspects are skipped due to the automatized CAT algorithm [
39]. Content validity means that a measure represents all aspects of the construct of interest. This issue appeared to be particularly relevant for the 5-item CAT, where as much as five AAQ items were never used for scoring any participant (items 2, 8, 9, 15, and 17). In the 10-item CAT, only one item (item 8) was never used; in the 17-item CAT, all items were used (lowest exposure rate was 17% for item 15). Content balancing has been suggested to be a potential solution when reduced content validity causes systematic bias in CAT assessments, i.e., when the items of a scale appear to measure distinct sub-constructs [
23,
39]. Nevertheless, our analyses did not indicate any systematic bias when administering the AAQ as CAT. Each CAT version was highly correlated with the full AAQ and the differences to the original AAQ scores were negligible. Moreover, scores of the full AAQ measure and each CAT version were similarly associated with other PRO and PerfO measures of activity limitations. In sum, based on the results of this study, content balancing seems not to be necessary for any version of the AAQ-CAT. Moreover, the correlation between AAQ scores and each of the other measures of activity limitations tended to be higher than the correlations between PRO and PerfO measures. This finding might empirically reflect the original purpose to develop the AAQ as an innovative assessment tool combining the characteristics of PRO and PerfO measures [
15].
This study has some limitations. First, the evaluation of the CAT performance was based on post-hoc simulations. The performance of actual AAQ-CAT administrations must be examined in future studies. Nevertheless, findings of previous studies comparing simulated and real CAT data indicated that results might be similar [
49]. Moreover, using post-hoc simulations had the advantage that the anticipated performance of different CAT versions could directly be compared to each other and to the full measure. Second, while 11 language versions of the AAQ already exist, data from only 7 countries were used for psychometric evaluations and for establishing IRT parameters in the present study. Three languages (German, Swedish, and Turkish) could not be considered because sufficient data has not yet been collected. Moreover, Italian data were not considered for establishing CAT parameters in the present study because considerable differential item functioning has been identified before [
17]. However, it is not known whether these problems were caused by a lack of cross-cultural validity, or whether there was an issue with the specific set of Italian data collected for the AAQ cross-cultural validity study [
17]. As long as cross-cultural invariance has not been shown, CAT results should be interpreted with caution for languages not included in the present study. Third, the current AAQ metric ranging from 0 to 100 is arbitrary. It is yet to be decided whether a different metric should be used. For instance, linearly transforming the IRT-based theta metric to a T-score metric with a mean of 50 and a standard deviation of 10 based on a representative sample of a meaningful reference population might lead to increased interpretability of AAQ scores [
33]. Original AAQ scores could also be linked to such a metric, allowing for comparisons with AAQ-CAT scores.
With regard to the comparison of different CAT versions, the CAT-10 (with a maximum of 10 items) appeared to be the most efficient version in our sample, with an average of less than 7 administered items but with comparable precision and validity to the full AAQ. Nevertheless, for samples with highly impaired patients, the CAT-5 might also be well-suited.
The AAQ was originally developed to combine the benefits of patient-reported and performance-based measures of activity limitations. In addition, since the AAQ is almost non-verbal, it is applicable in low literacy patients and its items are easy to translate to other languages, which allows for cross-cultural application. Our study clearly supports the suitability of the AAQ to be applied as CAT, measuring activity limitations with lower respondent burden, but similar precision and construct validity compared to the full AAQ measure. Moreover, calibration of the AAQ to an IRT-based scale is the basis for expanding the measurement range by adding new items, e.g., specific items assessing the extremes of the underlying construct, in future developments. To make the CAT accessible to users, it is considered to integrate the AAQ into existing CAT platforms, e.g., the Dutch-Flemish PROMIS Assessment Center.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.