Study design and participants
Data from patients with active axSpA enrolled in a 24-week Phase III, multicenter, randomized, double-blind, placebo-controlled trial (RAPID-axSpA, NCT01087762) [
21] were used in this analysis. Eligible patients were ≥18 years with a documented diagnosis of adult-onset axSpA of at least 3 months’ duration as defined by the ASAS axSpA classification criteria, a Bath Ankylosing Spondylitis Disease Activity Index (BASDAI) [
22] score ≥4, spinal pain ≥4 on a 0–10 numeric rating scale, C-reactive protein (CRP) greater than the upper limit of normal and/or evidence within three months of screening of sacroiliitis on magnetic resonance imaging (MRI) or X-ray as defined by Assessment of SpondyloArthritis international Society (ASAS)/Outcome Measures in Rheumatology (OMERACT) scoring [
23]. All pelvic radiographs and MRI scans were assessed and confirmed by two central readers and, if necessary, an adjudicator. Patients were also required to be intolerant of non-steroidal anti-inflammatory drugs (NSAIDs) or have had an inadequate response to at least one NSAID after at least 30 days of treatment or to two NSAIDs after at least two weeks of treatment with each. The RAPID-axSpA study, from which the patient data used in this study were derived, had been approved by the independent ethics committee or institutional review board at participating sites, and written informed consent obtained from all patients.
Patients were classified as having AS [fulfilling ASAS axSpA classification criteria and modified New York (mNY) classification criteria [
6]] or nr-axSpA (fulfilling ASAS axSpA classification criteria but not mNY classification criteria). Patients with nr-axSpA were further classified using the more stringent objective signs of inflammation criteria [
24], defined as a Spondyloarthritis Research Consortium of Canada [SPARCC] [
25] score ≥2 of MRI scans of the SI joint and/or serum C-reactive protein levels exceeding the upper limit of normal.
The following eight PROs were used in various stages of this evaluation of the ASQoL psychometric properties:
Statistical analysis
Data were described using standard descriptive statistics to characterize the overall patient population and subpopulations. Response pattern evaluations were also performed to assess inter-item tetrachoric correlations. Cross-sectional analyses, including the modern psychometric methods (MPMs), were conducted on baseline data. This approach enables evaluation of the psychometric properties of the ASQoL prior to any experimental and/or pharmacogenic interventions that could alter the underlying disease evaluated by the ASQoL. These analyses were complemented by sensitivity cross-sectional analyses at later time points.
The total axSpA intent-to-treat patient population was used for all MPMs, with further analyses performed on subpopulations (patients diagnosed with nr-axSpA overall as well as in the subgroups with and without objective signs of inflammation). Readers interested in considerations related to sample size and estimation of these models are directed to the discussion wherein a section is dedicated to the finite sample properties of estimability and bias.
MPMs employed a combination of full information item exploratory factor analysis (EFA) and item response theory (IRT) [
32]. These methods generated evidence guiding domain specification, item performance evaluation, assessments of item bias, and scoring.
Following current best practice, the number of domains (factors) was determined from model fit indices [
33]. These included the C2
χ2 test of absolute fit [
33] and the C2-based root mean squared error of approximation (RMSEA) goodness of fit test [
34], and standard metrics were used for interpreting the estimates [
35]. Interpretability of the final domain solutions was achieved through oblique Quartimax rotation.
Four alternative confirmatory IRT structures were assessed to evaluate item performance, bias, and empirically guide scoring. IRT models considered included a two-parameter logistic (2PL) model [
36,
37], a Rasch analog of the 2PL model [
38], a bifactor model [
39], and a multidimensional item response theory (MIRT) model. In addition to the model fit assessment, item parameter quality and Chen’s local dependence statistic [
40] were used to evaluate which IRT model best characterized the performance of the ASQoL items. Items with
χ2 values exceeding 3 indicated potentially serious local dependence. Items with IRT slopes exceeding 4 were considered to be potentially unstable [
41].
Differential item functioning (DIF) was used to assess whether ASQoL items functioned identically between axSpA subpopulations within the final ASQoL domain solution. This was performed using the Wald-2 DIF
χ2 sweep procedure, with p-values adjusted for the false discovery rate using the Benjamini–Hochberg procedure [
42]. Items identified as having significant DIF were further evaluated via a DIF severity assessment [
43] to assess whether detected significant DIF would severely impact scores. For an item to be declared biased between axSpA subpopulations it had to demonstrate both significant and severe DIF.
The ASQoL score is presented as either the sum score (sum of score for each ASQoL item; scale of 0–18) or the mean score [sum score divided by 18 (total number of ASQoL items); scale 0–1]. For both, a lower score indicates better quality of life. As there was no item-level missing data, results were identical for any correlation-based analysis. The optimal scoring procedure was determined based on scoring statistics [
44]. Four possible scores were considered and scoring statistics characterized the relative merits of each: unit-weighted (with each item given equal weighting) domain scores, unit-weighted total scores, and empirically weighted (each item weighted by its reliability) versions of domain and total scores.
Note that a Supplemental Web Appendix contains the complete tabular and graphical output of the modern psychometric results, and interested readers are directed there for additional evidence.
Internal consistency was assessed to characterize the performance of the ASQoL in addition to guiding scoring decisions. Four possible scores were evaluated via the
ω-based statistics: unit-weighted domain scores, unit-weighted total scores, and empirically weighted versions of domain and total scores. Internal consistency was measured by McDonald’s
ω statistic and the corresponding bifactor analog,
ωH [
44]. These statistics are the least biased internal consistency estimators [
42]. Subdomain scores would be supported if ω exceeded
ωH, and total scores would be supported if
ωH exceeded ω. Further, as the
ωH/
ω ratio approaches 1, a total domain is favored. Low values (<0.7) on both
ω and
ωH indicate a need for empirically weighted scores.
ASQoL score performance was evaluated in terms of the test characteristic curve (TCC) and the precision of score measurement via the test information function (TIF). Additional assessments included estimates of test–retest reliability, validity, ability to detect change (responsiveness) and meaningful within-patient change (MWPC).
Test–retest reliability of the ASQoL responses was estimated correlating Baseline with Week 12 and Week 24 follow-up data. Test–retest reliability was estimated in a group of stable patients, defined as those patients with no change in PGIC, PtGADA (defined as a change in scores between ±1 point), or PhGADA (defined as a change in score between ±15 points). The analysis was based on the two-way random intraclass correlation coefficient (ICC[2, 1]) [
45] with estimates of at least 0.7 prespecified as indicating acceptable reproducibility of scores. Given the length of the retest interval and the fact that the retest interval spanned the interventional period of the randomized trial, the evidence presented for test–retest reliability could better be described as long-term stability. To remain consistent with regulatory review and interaction, in this manuscript we retain the description of this evidence as test–retest reliability.
Concurrent validity (both convergent and divergent) was estimated at baseline via Spearman correlations within the nr-axSpA population with objective signs of inflammation. Sensitivity analyses were conducted at Weeks 12 and 24. Convergent validity estimates were obtained by correlating ASQoL total scores and those of the BASDAI, BASFI, PtGADA, PhGADA, total and nocturnal spinal pain numeric rating scales, and the ASDAS composite score. Divergent validity estimates were obtained by correlating the ASQoL total scores with the physical functioning and physical component scores of the SF-36v2. The ability of the tests to detect change was also determined from the Spearman correlation coefficient for change in scores from Baseline to Week 12 and to Week 24 for ASQoL versus other PRO measures. In all cases correlations ≥|0.4| met the prespecified criterion for acceptable validity.
Known-groups validity evidence was generated at Baseline, Week 12 and Week 24. Scores from the PhGADA and ASDAS were dichotomized (median split and cut at 2.1, respectively) to define known groups. The mean differences in ASQoL score between the known groups for each measure were analyzed using analysis of variance (ANOVA).
MWPC was estimated by both distribution and anchor-based methods. Given regulatory emphasis on anchor-based methods, only anchor-based evidence is reported. The anchor-based method for MWPC estimation was based on patients whose change in ASQoL score between Baseline and Week 12 and Week 24 was equal to or greater than the estimated median change in ASQoL score in patients with a PGIC of 6 (moderate improvement) or 5 (minimal improvement). In addition, the MWPC point estimate was validated via empirical cumulative distribution functions (eCDFs) and 95% Clopper-Pearson confidence bands for change in ASQoL score from Baseline, stratified by PGIC strata (no change, minimal improvement, moderate improvement, marked improvement).
All analyses used observed case data only; no imputation of missing values was undertaken.
MPMs were conducted using FlexMIRT version 3.5 (Vector Psychometric Group). All other analyses were performed using a combination of Statistical Analysis Software version 9.4 (SAS® Institute Inc., Cary, NC) and R statistical software version 3.4.3 (R Development Core Team).