Introduction
Several disease-specific questionnaires have been developed to measure pain and/or disability in patients with neck pain (e.g., Neck Disability Index (NDI) and Neck Pain and Disability Scale (NPDS)) [
1,
2]. In order to make a rational choice for the use of these questionnaires in clinical research and practice, it is important to assess and compare their measurement properties (e.g., reliability, validity, and responsiveness) [
3].
A systematic review, published in 2002, evaluated the measurement properties of several neck-specific questionnaires and showed that, except for the NDI, all questionnaires were lacking psychometric information and that comparison was therefore not possible [
4]. Recent reviews show that the amount of studies evaluating measurement properties of neck-specific questionnaires has extended considerably in the past years [
5‐
7]. However, all these reviews lack an adequate instrument to critically appraise the methodological quality of the included studies. Studies of high methodological quality are needed to guarantee appropriate conclusions about the measurement properties. Recently, the “COnsensus-based Standards for the selection of health status Measurement INstruments” (COSMIN) checklist, an instrument to evaluate the methodological quality of studies on measurement properties of health status questionnaires, has become available [
8]. Using the COSMIN checklist, it is now possible to critically appraise and compare the quality of these studies.
A recent review of the cross-cultural adaptations of the McGill Pain Questionnaire showed that pooling of the measurement properties of different language versions results in inconsistent findings regarding the results for measurement properties, caused by differences in cultural context [
9]. Since it is likely that the same accounts for the translated questionnaires in our review, we decided to evaluate them in a separate systematic review [
10].
The purpose of this study is to critically appraise and compare the measurement properties of the original version of neck-specific questionnaires.
Methods
Search strategy
We searched the following computerized bibliographic databases: Medline (1966 to July 2010), EMbase (1974 to July 2010), CINAHL (1981 to July 2010), and PsycINFO (1806 to July 2010). We used the index terms “neck”, “neck pain”, and “neck injuries/injury” in combination with “research measurement”, “questionnaire”, “outcome assessment”, “psychometry”, “reliability”, “validity” and derivatives of these terms. The full search strategy used in each database is available upon request from the authors. Reference lists were screened to identify additional relevant studies.
Selection criteria
A study was included if it was a full text original article (e.g., not an abstract, review or editorial), published in English, concerning the development or evaluation of the measurement properties of an original version of a neck-specific questionnaire. The questionnaire had to be self-reported, evaluating pain and/or disability, and specifically developed or adapted for patients with neck pain.
For inclusion, neck pain had to be the main complaint of the study population. Accompanying complaints (e.g., low back pain or shoulder pain) were no reason for exclusion, as long as the main focus was neck pain. Studies considering study populations with a specific neck disorder (e.g., neurologic disorder, rheumatologic disorder, malignancy, infection, or fracture) were excluded, except for patients with cervical radiculopathy or whiplash-associated disorder (WAD).
Two reviewers (JMS, APV) independently assessed the titles, abstracts, and reference lists of studies retrieved by the literature search. In case of disagreement between the two reviewers, there was discussion to reach consensus. If necessary, a third reviewer (HCV) made the decision regarding inclusion of the article.
Measurement properties
The measurement properties are divided over three domains: reliability, validity, and responsiveness [
11]. In addition, the interpretability is described.
Reliability
Reliability is defined as the extent to which scores for patients who have not changed are the same for repeated measurement under several conditions: e.g., using different sets of items from the same questionnaire (internal consistency); over time (test–retest); by different persons on the same occasion (inter-rater); or by the same persons on different occasions (intra-rater) [
11].
Reliability contains the following measurement properties:
-
Internal consistency: The interrelatedness among the items in a questionnaire, expressed by Cronbach’s α or Kuder-Richardson Formula 20 (KR-20) [
8,
11].
-
Measurement error: The systematic and random error of a patient’s score that is not attributed to true changes in the construct to be measured, expressed by the standard error of measurement (SEM) [
11,
12]. The SEM can be converted into the smallest detectable change (SDC) [
12]. Changes exceeding the SDC can be labeled as change beyond measurement error [
12]. Another approach is to calculate the limits of agreement (LoA) [
13]. To determine the adequacy of measurement error, the smallest detectable change and/or limits of agreement is related to the minimal important change (MIC) [
14]. As measurement error is expressed in the units of measurements, it is impossible to give one value for adequacy. However, it is important that the measurement error (i.e., noise, expressed as SDC or limits of agreement) is not larger than the signal (i.e., MIC) that one wants to assess.
-
Reliability: The proportion of the total variance in the measurements, which is due to “true” differences between patients [
11]. This aspect is reflected by the intraclass correlation coefficient (ICC) or Cohen’s Kappa [
3,
11].
Validity
Validity is the extent to which a questionnaire measures the construct it is supposed to measure and contains the following measurement properties [
11]:
-
Structural validity: The degree to which the scores of an instrument are an adequate reflection of the dimensionality of the construct to be measured [
11]. Factor analysis should be performed to confirm the number of subscales present in a questionnaire [
15].
-
Hypothesis testing: The degree to which a particular measure relates to other measures in a way one would expect if it is validly measuring the supposed construct, i.e., in accordance with predefined hypotheses about the correlation or differences between the measures [
11].
-
Cross-
cultural validity: The degree to which the performance of the items on a translated or culturally adapted instrument is an adequate reflection of the performance of the items of the original version of the instrument [
11]. The cross-cultural validity of neck specificity questionnaire is addressed in a separate systematic review [
10].
Responsiveness
Responsiveness is the ability of an instrument to detect change over time in the construct to be measured [
11]. Responsiveness is considered an aspect of validity, in a longitudinal context [
15]. Therefore, the same standards apply as for validity: the correlation between change scores of two measures should be in accordance with predefined hypotheses [
15]. Another approach is to determine the area under the receiver operator characteristic curve (AUC) [
15].
Interpretability
Interpretability is the degree to which one can assign qualitative meaning to quantitative scores [
11]. This means that investigators should provide information about clinically meaningful differences in scores between subgroups, floor and ceiling effects, and the MIC [
15]. Interpretability is not a measurement property, but an important characteristic of a measurement instrument [
11].
Quality assessment
To determine whether the results of the included studies can be trusted, the methodological quality of the studies was assessed. This step was carried out using the COSMIN checklist [
8]. The COSMIN checklist consists of nine boxes with 5–18 items concerning methodological standards for how each measurement property should be assessed. Each item was scored on a 4-point rating scale (i.e., “poor”, “fair”, “good”, or “excellent”), which is an additional feature of the COSMIN checklist (see
http://www.cosmin.nl). An overall score for the methodological quality of a study was determined for each measurement property separately, by taking the lowest rating of any of the items in a box. The methodological quality of a study was evaluated per measurement property.
Data extraction and assessment of (methodological) quality were performed by two reviewers (JMS, CBT) independently. In case of disagreement between the two reviewers, there was discussion in order to reach consensus. If necessary, a third reviewer (HCV) made the decision.
Best evidence synthesis: levels of evidence
To summarize all the evidence on the measurement properties of the different questionnaires, we synthesised the different studies by combining their results, taking the number and methodological quality of the studies and the consistency of their results into account. The possible overall rating for a measurement property is “positive”, “indeterminate”, or “negative”, accompanied by levels of evidence, similarly as was proposed by the Cochrane Back Review Group (see Table
1) [
16,
17].
Table 1
Levels of evidence for the overall quality of the measurement property [
17]
Strong | +++ or −−− | Consistent findings in multiple studies of good |
methodological quality OR in one study of excellent |
methodological quality |
Moderate | ++ or −− | Consistent findings in multiple studies of fair |
methodological quality OR in one study of good |
methodological quality |
Limited | + or − | One study of fair methodological quality |
Conflicting | ± | Conflicting findings |
Unknown | ? | Only studies of poor methodological quality |
To assess whether the results of the measurement properties were positive, negative, or indeterminate, we used criteria based on Terwee et al. (see Table
2) [
18].
Table 2
Quality criteria for measurement properties [
18]
Reliability |
Internal consistency | + | (Sub)scale unidimensional AND Cronbach’s alpha(s) ≥ 0.70 |
? | Dimensionality not known OR Cronbach’s alpha not determined |
− | (Sub)scale not unidimensional OR Cronbach’s alpha(s) < 0.70 |
Measurement error | + | MIC > SDC OR MIC outside the LOA |
? | MIC not defined |
− | MIC ≤ SDC OR MIC equals or inside LOA |
Reliability | + | ICC/weighted Kappa ≥ 0.70 OR Pearson’s r ≥ 0.80 |
? | Neither ICC/weighted Kappa, nor Pearson’s r determined |
− | ICC/weighted Kappa < 0.70 OR Pearson’s r < 0.80 |
Validity |
Content validity | + | The target population considers all items in the questionnaire to be relevant AND considers the questionnaire to be complete |
? | No target population involvement |
− | The target population considers items in the questionnaire to be irrelevant OR considers the questionnaire to be incomplete |
Construct validity |
Structural validity | + | Factors should explain at least 50% of the variance |
? | Explained variance not mentioned |
− | Factors explain < 50% of the variance |
Hypothesis testing | + | (Correlation with an instrument measuring the same construct ≥ 0.50 OR at least 75% of the results are in accordance with the hypotheses) AND correlation with related constructs is higher than with unrelated constructs |
? | Solely correlations determined with unrelated constructs |
− | Correlation with an instrument measuring the same construct < 0.50 OR < 75% of the results are in accordance with the hypotheses OR correlation with related constructs is lower than with unrelated constructs |
Responsiveness |
Responsiveness | + | (Correlation with an instrument measuring the same construct ≥ 0.50 OR at least 75% of the results are in accordance with the hypotheses OR AUC ≥ 0.70) AND correlation with related constructs is higher than with unrelated constructs |
? | Solely correlations determined with unrelated constructs |
− | Correlation with an instrument measuring the same construct < 0.50 OR < 75% of the results are in accordance with the hypotheses OR AUC < 0.70 OR correlation with related constructs is lower than with unrelated constructs |
Discussion
Eight different questionnaires have been developed to measure pain and/or disability in patients with neck pain. All original versions are in English, except for the CNFDS, which was developed in Danish. The NDI is the most frequently evaluated questionnaire and its measurement properties seem adequate, except for reliability. The other questionnaires show positive results, but the evidence is mostly limited and at least half of the information on measurement properties per questionnaire is lacking. Therefore, the results should be treated with caution.
The COSMIN checklist has recently been developed and is based on consensus between experts in the field of health status questionnaires [
8]. The COSMIN checklist facilitates a separate judgment of the methodological quality of the included studies and their results. This is in line with the methodology of systematic reviews of clinical trials [
16]. The inter-rater agreement of the COSMIN checklist is adequate [
46]. The inter-rater reliability for many COSMIN items is poor, which is suggested to be due to interpretation of checklist items [
46]. To minimize differences between reviewers (JMS, CBT, and HCV) in interpretation of checklist items, decisions were made in advance on how to score the different items.
The criteria in Table
1 are based on the levels of evidence as previously proposed by the Cochrane Back Review Group [
17]. The criteria are originally meant for systematic reviews of clinical trials, but we believe that they are also applicable for reviews on measurement properties of health status questionnaires.
Exclusion of non-English papers may introduce selection bias. However, the leading journals, and as a consequence the most important studies, are published in English. So, research performed in populations with a different native language is generally still published in English. This is illustrated by the large number of articles we retrieved regarding translations of neck-specific questionnaires (see Fig.
1). In these papers, we did not find a reference to an original version of a neck-specific questionnaire that was not included in our systematic review. This makes us confident that chances are small that we have missed any original versions of neck-specific questionnaires.
The different studies showed similar methodological shortcomings. A small sample size, for example, frequently led to indeterminate results. We do not discuss these flaws in detail here but elaborate on this subject in a separate publication [
47].
A problem we encountered during the rating of “hypothesis testing” and “responsiveness” was that most studies do not formulate hypotheses regarding expected correlations in advance. Moreover, none of the development studies specified the supposed underlying constructs of the questionnaire. Therefore, it is difficult to judge content validity, which is one of the most important measurement properties. We dealt with this problem by reaching agreement about what we thought were the supposed underlying constructs, based on the items in the questionnaire, before we rated the studies.
The assumption that pooling of results from original and translated versions could result in inconsistent findings regarding the results for measurement properties is confirmed in our systematic review of translated versions of neck-specific questionnaires [
10]. A poor translation process and/or lack of cross-cultural validation seem to affect the measurement properties of the questionnaire, particularly the validity (i.e., structural validity and hypothesis testing) [
10]. This is not surprising, as the importance and/or meaning of questionnaire items (e.g., driving, depressed mood) may depend on setting and context. So, a simple translation of the original questionnaire is not sufficient and might affect the measurement of the underlying constructs [
10].
Since the review in 2002, 17 of the 25 included studies in our review were published, and four new neck-specific questionnaires have been developed [
4,
36,
40,
43,
45]. These studies added new information, but due to their poor to fair methodological quality, a substantial amount of uncertainty about the quality of the measurement properties remains.
The quality of the measurement properties of several neck-specific questionnaires was recently evaluated in a best evidence synthesis, which showed positive results for the NDI, NPDS, NBQ, NPQ, CNFDS, and WDQ [
5]. However, these results were partially based on methodologically flawed studies and this study contained only a small part of the manuscripts included in our study.
A state-of-the-art review evaluating the NDI reported that its reliability, internal consistency, factor structure (i.e., unidimensional scale), construct validity, and responsiveness are well described and of very high quality [
7], which is not completely in agreement with our findings. Possible explanations for the discrepancies are that the study reporting the negative result for reliability was published after the search of the state-of-the-art review ended and that they did not critically appraise the methodological quality or results of the included studies [
7,
29]. A more recent systematic review evaluating the NDI reports a good internal consistency, acceptable reliability, good construct validity and responsiveness, and inconsistent results regarding the structural validity of the NDI [
6]. The differences with our findings are probably attributable to the fact that they did not take the methodological quality of the included studies into account [
6].
It is difficult to determine the content validity of the different neck-specific questionnaires, because almost all retrieved studies on this subject were of poor methodological quality. Furthermore, the underlying constructs were not clear. However, a recent content analysis showed that correspondence between the symptoms expressed by neck pain patients and the content of the questionnaires was low, mainly due to lack of patient involvement in development of the questionnaire [
48]. The importance of content validity for a questionnaire makes it desirable that this measurement property is evaluated in a high quality study for each questionnaire. The results from these studies will show which questionnaires are suitable for neck pain patients and whether development of a new neck-specific questionnaire is necessary.
The most frequently studied measurement property is responsiveness. This is not surprising, since these questionnaires are often used as an outcome measure. However, except for the NDI and NPQ, there is only limited positive evidence for responsiveness.
For clinical practice and research, we advise to use the original version of neck-specific questionnaires with caution: the majority of the results are positive, but the evidence is mostly limited and for each questionnaire, except for the NDI, at least half of the information regarding measurement properties is lacking. Provisionally, we recommend using the NDI, because it is the questionnaire for which the most information is available and the results are mostly positive. However, research is needed to clarify its underlying constructs, measurement error, reliability, and to improve the interpretation of its scores.
No clinician should make decisions regarding management of neck pain patients solely on unvalidated instruments. However, neck-specific questionnaires can provide a broader and deeper understanding of the impact of neck pain on the individual patients.
For future research, we recommend performing high quality studies to evaluate the unknown measurement properties, especially content validity, and provide strong evidence for the other measurement properties. It seems advisable to refrain from developing new neck-specific questionnaires until high quality studies evaluating the measurement properties of current questionnaires show shortcomings that make it necessary to develop a new questionnaire.