The role of response domain and scale label in the quantitative interpretation of patient-reported outcome measure response options

Peasgood, Tessa; Chang, Jen-Yu; Mir, Robina; Mukuria, Clara; Powell, Philip A.

doi:10.1007/s11136-021-02801-9

The role of response domain and scale label in the quantitative interpretation of patient-reported outcome measure response options

Open access
Published: 04 March 2021

Volume 30, pages 2097–2108, (2021)
Cite this article

Download PDF

You have full access to this open access article

Quality of Life Research Aims and scope Submit manuscript

The role of response domain and scale label in the quantitative interpretation of patient-reported outcome measure response options

Download PDF

Tessa Peasgood^1,2,
Jen-Yu Chang¹,
Robina Mir³,
Clara Mukuria¹ &
…
Philip A. Powell ORCID: orcid.org/0000-0003-1169-3431¹

1382 Accesses
1 Citation
8 Altmetric
Explore all metrics

Abstract

Purpose

Uncertainties exist in how respondents interpret response options in patient-reported outcome measures (PROMs), particularly across different domains and for different scale labels. The current study assessed how respondents quantitatively interpret common response options.

Methods

Members of the general public were recruited to this study via an online panel, stratified by age, gender, and having English as a first language. Participants completed background questions and were randomised to answer questions on one of three domains (i.e. loneliness (negatively phrased), happiness or activities (positively phrased)). Participants were asked to provide quantitative interpretations of response options (e.g. how many times per week is equal to “often”) and to order several common response options (e.g. occasionally, sometimes) on a 0–100 slider scale. Chi-squared tests and regression analyses were used to assess whether response options were interpreted consistently across domains and respondent characteristics.

Results

Data from 1377 participants were analysed. There was general consistency in quantifying the number of times over the last 7 days to which each response option referred. Response options were consistently assigned a lower value in the loneliness than happiness and activities domains. Individual differences, such as age and English as a second language, explained some significant variation in responses, but less than domain.

Conclusion

Members of the public quantify common response options in a similar way, but their quantification is not equivalent across domains or every type of respondent. Recommendations for the use of certain scale labels over others in PROM development are provided.

Health, Health-Related Quality of Life, and Quality of Life: What is the Difference?

Article 18 February 2016

Loneliness and social isolation interventions for older adults: a scoping review of reviews

Article Open access 14 February 2020

Associations between loneliness and perceived social support and outcomes of mental health problems: a systematic review

Article Open access 29 May 2018

Introduction

Patient-reported outcome measures (PROMs) are completed by patients (or proxies) in order to provide a summary of patients’ evaluation of their health or health-related quality of life. PROMs are used to assess the impact of conditions and/or interventions in the context of effectiveness studies, cost-effectiveness analysis, or to track changes in individual health in routine care [1, 2]. Evidence on their validity and reliability, for example as a function of mode of administration [3], relationship with other outcomes [4], and linguistic content [5, 6], is therefore of interest to those working in health care decision making.

PROMs consist of questions covering different domains (e.g. pain, mental health or wellbeing, physical, social, role functioning, etc.). Patients report their level of symptoms or functioning using numeric rating scales (NRS), visual analogue scales (VAS), or verbal rating scales (VRS) [7, 8]. Response options can be based on how frequently patients experience a symptom or have problems with functioning, how severe the symptom is, or on levels of difficulty (e.g. with functioning) [9]. Finally, PROM questions can also be phrased as agreement scales, for example, from strongly agree to strongly disagree [10].

The type of response option used is related to the concept being measured. For example, experience of symptoms is usually linked to either frequency or severity scales. Summary scores or weighted index values can be generated from the PROM, which can be used to assess health [11, 12]. However, verbal response options are considered to be vague quantifiers as they rely on respondents’ interpretation of terms such as ‘sometimes’ or ‘occasionally’ [13, 14]. The vagueness of verbal quantifiers is potentially problematic in the assessment of health-related quality of life, which often relies on them heavily.

Vague quantifiers are also problematic when PROMs are used by health economists to elicit health state utility values, which play an important practical role in cost-effectiveness analysis to determine health care resource allocation [15]. As response levels are displayed independently of the other response options in utility elicitation, it is important for PROMs to have response choices that are clear and can be consistently interpreted over time, context, and between people [16]. Within each type of response options, there are variations in the number of options and the qualitative labels used to distinguish between them. There are ongoing methodological uncertainties around potential differences in the interpretation of response options [13] and concerns as to whether respondents can clearly distinguish between different numbers of response categories [17, 18].

For frequency response options, the relationship between participants’ numerical estimates and corresponding linguistic terms (e.g. ‘often’, ‘some of the time’, ‘seldom’) has been explored to understand the order and the degree of difference between displayed options [13, 19,20,21,22]. While similar analysis has also been conducted on severity response options (e.g. ‘very much’, ‘quite a bit’, ‘some’) [23,24,25], little has been investigated regarding response options that quantify difficulties (e.g. ‘a little difficulty’, ‘moderate difficulty’) [26].

How participants assign a quantitative value to qualitative response options (e.g. how many times an event has to have happened to match the label ‘often’ or ‘sometimes’) is not clearly understood. Additionally, evidence on whether such interpretation varies across different context or domains remains scarce. The interpretation of response options could also potentially be heterogeneous within different subpopulations, for example, with regard to their health, language, and cultural background. Whether a uniformed questionnaire can provide a generally consistent measure across different groups requires exploration [27, 28].

This project aimed to explore how respondents quantitatively interpret common PROM response options. This was in part to support ongoing instrument development work of a new preference-based measure of health and wellbeing (the EQ-HWB; https://scharr.dept.shef.ac.uk/e-qaly/) and the choice of questions and response options investigated in this study was linked to those being considered for inclusion within this broader project. Three key questions were addressed: 1) whether the quantification of different response options reflects their intuitive or linguistic ordering; 2) whether response options to questions assessing different domains are interpreted differently; and 3) whether individual characteristics, such as age and having English as a second language (ESL), influence the way response options are interpreted.

Methods

Sample

Adult residents in the UK were recruited via the Prolific online panel [29] in January 2019 (n = 1401), pre-screened to cover a spread of age (18–47 versus 48 + year olds), gender, and ESL. No formal a priori sample size estimation was undertaken. Participants received £1.20 for completing the online survey.

Survey

Ethics approval was granted from the host institution ethics committee. Figure 1 illustrates the survey flow. Following consent, participants provided background characteristics, including their age, gender, ethnic group, highest educational qualification, health, and any chronic mental or physical health problems (see the full survey in Online Resource 1). Participants were randomised to one of three domains (loneliness, happiness, or activities) and asked a series of questions about the quantitative interpretation of commonly used PROM response options. Each participant thus received a question stem that corresponded to the abovementioned domains, either: (1) a negatively phrased question related to social functioning, ‘I felt lonely’; (2) a positive phrased mental health/wellbeing question, ‘I felt happy’; or (3) an activity/role functioning question, ‘I was able to do the things I wanted to do’.

Following stratification by one of the three questions, participants were further randomised such that half were given the response option ‘occasionally’ to interpret and half were given ‘only occasionally’. Given that ‘occasionally’ and ‘only occasionally’ always fell in the same position within response options this randomisation enabled testing of whether the actual wording of the response option made a difference beyond their ranked order. ‘Occasionally’ and ‘only occasionally’ were tested in separate arms in part to inform selection of response options for the new measure.

First, participants were asked to respond to the question ‘Thinking about how things have been over the last 7 days…’ with the stem dependent on question randomisation (e.g. ‘I felt lonely’), using a 5-point scale (ranging from ‘none of the time’ to ‘most or all of the time’; for full response scales to all survey questions see Online Resource 1). Participants were then asked to provide a quantitative interpretation of their own response to this question, based on the number of days over a 7-day period they thought the response best referred to, on a 8-point scale (ranging from ‘not even once in the last 7 days’ to ‘seven or more times in the last 7 days). The same quantification question was asked again for another random response option in addition to their own answer.

In order to observe what quantitative values participants assigned to each response option, relative to one another on the same scale, participants then completed three slider tasks. First, for each set of randomly ordered frequency response options (‘occasionally’ or ‘only occasionally’, ‘sometimes’, ‘often’, ‘most of the time’^{Footnote 1}), participants were asked to assign a numeric value between 0 and 100 using a slider for each response option (0 = none of the time, 100 = all of the time).

Second, they were asked to assign a set of randomly ordered severity response options (‘a little bit’, ‘somewhat’, ‘some’, ‘quite a bit’, ‘very much’) on a similar 0 to 100 slider scale (0 = not at all, 100 was undefined). In both cases, these questions related to the domain to which the participant had been randomised (e.g. loneliness). For the frequency slider the top anchor ‘all of the time’ is intuitive, however, there is no such clear top anchor for a severity scale, which can be applied across the three domains. The top of the severity scale was left undefined to avoid introducing potential focusing effects from terms that are not usually part of the response option set and also to allow respondents to place ‘very much’ at the top of the scale should they wish.

Third, all participants responded to a question with a difficulty response option, linked to mobility: ‘Thinking about how things have been over the last 7 days… How well were you able to get around outside?’ on a 5-point scale (ranging from ‘no difficulty’ to ‘unable’). After this, participants were asked to assign the following response options ‘a lot of difficulty’, ‘some difficulty’ and ‘slight difficulty’ on a 0 to 100 slider scale (0 = no difficulty, 100 = unable).

To help with data quality, the survey was designed to be short (less than 10 min); respondents were timed out after 30 min. The research team included a patient researcher who was involved in supporting the design of the study, including the survey and input into the clarity and content of the study information sheet.

Data quality

We judged a respondent to have answered in a logically inconsistent way when they gave a quantitative answer for the bottom response option (e.g. ‘only occasionally’ or ‘none of the time’) that was equal to or higher than their response to the top response option (e.g. ‘most of the time’). Inconsistent responses were dropped from within the group of questions where the inconsistency was identified (i.e. frequency quantification in number of times over 7 days, frequency slider, severity slider, or difficulty slider). However, for the selection about the number of times over 7 days, selections for the bottom and top response option that were both at the top end of the scale (7 or more times) were not considered as inconsistent due to the upper censoring of the scale. In addition to the above, we excluded completely individuals who had three or more inconsistencies in responses, as these respondents were considered to not have paid attention or understood the tasks. An additional analysis was conducted with respondents with any inconsistencies dropped to explore the effect on the results (see Online Resource 4).

Data analysis

Descriptive analyses were conducted to document characteristics of respondents and to compare results of the slider and frequency quantification questions for different response options across different domains (e.g. happiness versus loneliness). Differences were tested using Fisher’s exact test (for medians) and unpaired t-tests (for means). We also explored descriptively the relative gaps between mean responses across the response options (no statistical test performed) and the variability of respondents’ answers for each slider response (using a variance comparison test) to indicate the consistency of interpretation of the options across our sample.

We used regression analysis to explore the combined impact of respondent characteristics and the domain of the question on the assigned values. We ran separate ordinary least squares (OLS) regressions which combined all slider answers relating to a particular response option (e.g. ‘sometimes’); this meant combining respondents from different arms of the study (for frequency and severity response options, see Fig. 1). For each model, the response option was used as the dependent variable with domain and respondent characteristics as the independent variables. We also ran the frequency and severity models without the domain variables to explore the extent of variance explained by individual characteristics. As each respondent was included only once in each model, no adjustment for clustering standard errors was used. Ideally, respondent characteristics should have no impact on interpretation of the labels; therefore, small coefficients and low variance explained by the model were preferred. Accordingly, we did not have any a priori effect sizes against which to judge the effects.

Results

Inconsistency checks

Of 1401 survey completions, 229 had one inconsistency, 41 had two and 24 had three or more. The latter were dropped leaving a valid sample of 1377 participants. Most remaining inconsistencies occurred in interpreting the ‘difficulty’ response options linked to the mobility question (15.6% of the remaining sample had an inconsistency on this question). For those randomised to the happiness and activities domains, the slider for the difficulty question had reverse anchors, such that to the right was the most negative (i.e. greater problems with mobility), this contrasts with earlier questions where the right anchor was the most positive (e.g. more happiness). Accordingly, inconsistencies on the difficulty question were more pronounced in the happiness (17.5% of this sub-sample) and activities (22.9% of this sub-sample) groups, than the loneliness group (6.5% of this sub-sample).

Respondent characteristics

Basic characteristics of the valid sample are in Table 1. Of the 1377 included completions, 53.1% were female, 37.5% reported a health condition, and 85.6% reported English as their first language. The mean age of the sample was 42.4 years (SD = 14.0), with a minimum of 18 and maximum of 86 years. On average, participants completed the survey in 6.9 min (SD = 3.4), with a minimum of 2.2 and a maximum of 30.2 min. Respondents were evenly randomised into three groups (i.e. happiness n = 458, loneliness n = 460, and activity n = 459), and there were no significant differences in characteristics between groups (see Table 1).

Table 1 Respondents’ characteristics

Full size table

Response option interpretation based on number of times experienced

Table 2 shows the median response to interpretations based on the number of times over the last 7 days their own selection refers to and similarly for a randomly specified ‘other’ response option. The median response increases in line with expectations.

Table 2 Median number of times over the last 7 days participants reported that each possible frequency response referred, split by occasionally and only occasionally arm

Full size table

Comparisons between the ‘only occasionally’ arm and the ‘occasionally’ arm found significant differences for interpretations of the responses to the ‘felt happy’ question for ‘occasionally/only occasionally’ (p = 0.032 one tailed Fisher’s exact test) and ‘sometimes’ (p = 0.003 one tailed Fisher’s exact test). All other comparisons between these arms were not significant. This means the interpretation of ‘sometimes’ for the happy question is lower when presented alongside ‘only occasionally’ as the next response choice.

Response option interpretation based on sliders

Figure 2 shows the mean answer on the sliders by question, for each response option (see Online Resource 2 for a table of this data). ‘Only occasionally’ is significantly lower than ‘occasionally’ for all three questions (unpaired t-test, for lonely t = − 5.698 p < 0.001, happy t = − 5.364 p < 0.001, and activities t = − 5.798 p < 0.001). For other response options, those presented within the ‘only occasionally’ versus ‘occasionally’ arm were not significantly different from one another, with one exception (the response of ‘most of the time’ within the activities domain; unpaired t-test, t = -2.562, p = 0.005), and consequently answers to sliders are shown combined across these two arms (except for the only occasionally/occasionally option).

The graph shows a significantly lower quantitative interpretation of response options in the loneliness domain compared to the happiness and activities domains (as indicated by non-overlapping 95% confidence intervals), with the exception of ‘most of the time’, where the confidence intervals overlap between the loneliness and happiness domains.

The ordinal interpretation of response options is in line with expectations. ‘Occasionally’ was interpreted as quantitatively greater than ‘only occasionally’. There are some interesting differences between the frequency and severity terms: ‘a little bit’ is quantitatively interpreted as being closer to ‘only occasionally’ than ‘occasionally’. ‘Often’ was given a higher score than the fourth severity category, ‘quite a bit’. The response option ‘some’ was interpreted similarly to ‘somewhat’, although the SD for ‘somewhat’ was higher for all three domains (variance comparison test: happy f = 0.821, 2*Pr (F < f) = 0.039, activities f = 0.630, 2*Pr (F < f) < 0.001, lonely f = 0.703, 2*Pr (F < f) < 0.001). There is a greater distance between the third (‘often’/’quite a bit’) and the second response options (‘sometimes’/’somewhat’) than for other differences.

The slider responses to the difficulty mobility question (i.e. ‘I was able to get around outside with…’) shows that the terms ‘a lot of difficulty’ (M = 85.7, SD = 10.6); ‘some difficulty’ (M = 49.5, SD = 17.2); and ‘slight difficulty’ (M = 26.7, SD = 16.2) were interpreted broadly as expected.

Response option interpretation based on regression analysis

Table 3 shows the predictive models for the slider responses to frequency and severity response options regressed on respondent characteristics across the three randomised domains (i.e. loneliness, happiness, activities). Respondents gave a statistically significant higher value to all response options in the happiness and activities domains than loneliness domains, and these effects remain after individual characteristics are controlled for.

Table 3 OLS regression results for respondent characteristics predicting slider responses to frequency and severity response options

Full size table

Older participants tended to give more polarised responses, giving a lower value for lower anchored response options (e.g. ‘occasionally’) and a higher value for higher anchored response options (e.g. ‘often’) than younger participants. Women gave higher values for some of the response options at the top of the scale (i.e. ‘sometimes’, ‘often’, ‘most of the time’, ‘very much’). Participants with a degree gave lower values for ‘a little bit’, ‘occasionally’ and ‘only occasionally’. Those disclosing a mental health problem gave a higher value for ‘quite a bit’ than those not disclosing a problem.

Finally, participants with ESL reported a higher value for ‘only occasionally’ and lower values for ‘quite a bit’ and ‘somewhat’. Tests of equality of variance between slider values from participants with English as a first language versus ESL finds a significantly greater variance for ESL for ‘quite a bit’ across all three domains (variance comparison test: happy f = 0.422, 2*Pr(F < f) < 0.001, lonely f = 0.519 2*Pr(F < f) < 0.001, activities f = 0.511, Pr(F < f) < 0.001), and for ‘very much’ in the loneliness domain (f = 0.489, 2*Pr(F < f) < 0.001). No other significant differences were observed. Overall minimal variance in slider values is explained by respondent characteristics. When question type is not included as a covariate, the highest adjusted R-squared for the severity responses is 0.034 and 0.024 for the frequency responses (see Online Resource 3). Accordingly, the biggest variation in responding is driven by context (or the domain being measured).

Table 4 shows how well the respondent characteristics predict the slider response for the difficultly response options linked to the mobility question. Only a small amount of the variation in slider responses was explained by respondent characteristics (between 0.4 and 3.0%). Women provided a significantly higher slider value (i.e. closer to ‘unable’) for all levels of difficulty. Those declaring a physical health problem and those who were older, interpreted ‘slight difficulty’ as quantitatively lower (i.e. closer to ‘no problems’). A respondent had caring responsibilities and was not significantly related to any slider values and thus was not included as a covariate in any of the final models.

Table 4 OLS regression results for respondent characteristics predicting slider responses to difficulty response options in the mobility domain

Full size table

Supplementary analyses excluding any respondent with one or more inconsistency resulted in slight changes to the significance level of some of the individual characteristics in the frequency and severity models, most notably age and education (see Online Resource 4), but overall findings remained consistent.

Discussion

This study addressed three key questions. Initially, we explored whether respondents’ quantification of different response options reflected their intuitive or linguistic ordering. In general, this was the case, suggesting that the assumed qualitative ordering of these common response options (when presented together) has underlying validity. Nevertheless, there are further takeaways. First, in previous studies, respondents tended to spread out all response options on a numerical scale when evaluating them simultaneously [25, 30]. Therefore, the same response option might appear to have a different numerical value when a number of other options vary. One strength of this study is that respondents were allocated to different arms through randomisation. As a result, we can draw inferences that the labels given to response options influenced numerical interpretation beyond the positioning or order of the options shown. Despite being in the same position within the choice set, ‘occasionally’ had a higher value for respondents than ‘only occasionally’. Similarly, the values across frequency and severity response options differed, with values for ‘often’ being above those for ‘quite a bit’ despite the fact that they were both ordered in fourth position within their respective response option scales.

Second, there was greater quantitative differentiation between some terms than others (e.g. see Fig. 2). For example, the difference between ‘a little bit’ and ‘somewhat’ was smaller than the difference between ‘somewhat’ and ‘quite a bit’. Further, the distance between ‘sometimes’ and ‘occasionally’ was lower than distances between other neighbouring response options. The similarity of interpretation of these two terms has been found elsewhere. Spector [30], in an exercise with students from the University of South Florida, found that ‘sometimes’ and ‘occasionally’ were given the same overall ranking. Our results suggest that people do not simply apply an interval approach to ranking a finite number of response options on the same scale, which has implications for scale design and analysis. Rather than interval-based scoring systems for PROMs, it may support the use of a scoring system which draws upon other information to provide the relative score for a response option, such as the use of item response theory (IRT) or preference-based scoring [31, 32].

Third, as Fig. 2 shows, there is a clear gap in the middle of the quantitative rating scale for the selection of response options tested in this study. This suggests that none of the options tested in this study was really adequate for a Likert scale that requires a mid-point, and it prompts the need for further testing of response options that may better fulfil this role within PROM design (e.g. ‘half the time’).

The second research question was whether domain affected the quantitative value participants placed on response options. We found some support for this. The loneliness domain, which featured a negatively worded question stem (i.e. more is worse) results in response options having a lower numerical interpretation compared to the two positively worded happiness and activities domains. Similar results have been reported elsewhere, with negatively phrased questions receiving lower values on average than positively phrased items [13]. Nevertheless, as we used only one negatively phrased question, it is unclear whether our findings are due solely to the negative phrasing of the item, or something specific about the content of the domain (i.e. loneliness), relative to the comparators.

Our final research question was whether participants’ individual characteristics influenced the way response options were interpreted. The overall effect of individual characteristics was substantially smaller than the effect of domain context, which was a positive finding. However, some characteristics made a difference, and these may have a cumulative impact when multi-item PROMs are completed. Women and older respondents tended to report significantly higher values for a number of response options, especially towards the top of the scale, and this should be taken into consideration in research involving mixed samples. Of particular interest was the effect of ESL on response option interpretation. The labels ‘only occasionally’ (but not ‘occasionally’), ‘somewhat’, and ‘quite a bit’ were interpreted significantly differently by respondents with ESL, with the variation in the interpretation of ‘quite a bit’ being significantly greater for respondents with ESL than those without. Interestingly, the numeric value of ‘quite a bit’ (without a specific domain context) was found to be higher in a Swedish study (mean = 73.5) [23] compared to our findings. Additionally, ‘quite a bit’ was also found to have a higher numerical value (mean = 75.1) in an international study testing the translation equivalence of SF-36 in different countries [28], with the interpretation of ‘quite a bit’ varying from country to country [24, 33]. These previous studies show that country and translation might have an impact on the interpretation of response options, whereas our study further explored the effect of ESL on understanding response options within the same language.

We acknowledge some limitations of the present study. As an online survey recruited through a commercial panel, the data will be subject to concerns over quality. This risk was mitigated through keeping the survey short, careful design (and piloting) of the survey, and dropping logically inconsistent responses for individual questions plus full cases that had at least three cases of inconsistency. As a consequence of keeping the survey short individuals only interpreted response options for one of the three domains. This meant comparison between questions was based on different individuals—while this had the advantage that values given were not being impacted by potential ordering effects—it is a potential limitation in making comparisons. Furthermore, only three domains were explored, which limits the interpretation and generalisation of the findings. The quantitative interpretation of the response options relied upon a visual analogue scale (VAS) scale (our slider), which has well known biases, such as end aversion [34]; the extent to which such a bias may be interacting here with individual characteristics or question domain is unknown.

This study has shown that while respondents quantitatively interpret common response options in a logical way, this interpretation may differ systematically based on domain being measured and certain individual characteristics, and this should be taken into account in PROM design and analysis. Several recommendations can be made based on our findings. First, in PROM design, it is sensible not to mix negatively and positively phrased domain items within the same measure [35]. This is particularly the case within the same multi-item scale (or set of items that are combined to calculate a score). It may be possible to include consistent sets of positively and negatively worded items within a domain (or subscale) of a PROM, when those domains are not then combined to form a total score. However, if these domains use the same response options, then PROM developers should be aware that the same scale may be interpreted differently across domains as a function of negative or positive wording. Further, if a PROM is intended to be valued for use in cost-effectiveness analysis (i.e. is to be ‘preference-based’ [15]), and there is particular need to avoid positively and negatively phrased items within the same PROM. Prototypical items from different domains are typically used together in health state valuation exercises and mixing positive and negative items may lead to a differential quantitative interpretation by respondents. Researchers following this advice will also need to consider whether they want to include wholly positive or negatively phrased items during PROM design. This is an issue that, in our opinion, can be best addressed in collaborative patient and public involvement and engagement work with the PROM’s target population and/or appropriate cognitive debriefing exercises.

Second, in PROM design and analysis, simple PROM scoring systems that rely on an assumption of interval properties of Likert response options should be avoided wherever possible [31]. Instead, methods such as IRT scoring can be used to adjust for uneven distances between response options. If the PROM is to be valued for use in health resource cost-effectiveness analysis and requires item reduction, then it is possible to use the output from IRT analyses to select items that produce the best spread across the latent scale.

Third, our findings can be used to inform the selection of response options for future PROM development, depending on the target research sample. Most successful PROMs are not designed to be used solely in people with English as a first language and so researchers should consider their choice of response options carefully for interpretability across people with ESL during the design stage [16]. For example, if the sample involves participants with ESL then researchers should consider avoiding using the response options ‘somewhat’ and ‘quite a bit’.

Notes

This differed to the original response ‘most or all of the time’ in order to explore the quantitative interpretation of ‘most’.

References

Black, N. (2013). Patient reported outcome measures could help transform healthcare. BMJ, 346, f167.
Article Google Scholar
Dawson, J., Doll, H., Fitzpatrick, R., Jenkinson, C., & Carr, A. J. (2010). The routine use of patient reported outcome measures in healthcare settings. BMJ, 340, c186.
Article Google Scholar
Byrom, B., Doll, H., Muehlhausen, W., Flood, E., Cassedy, C., McDowell, B., Sohn, J., Hogan, K., Belmont, R., Skerritt, B., & McCarthy, M. (2018). Measurement equivalence of patient-reported outcome measure response scale types collected using bring your own device compared to paper and a provisioned device: results of a randomized equivalence trial. Value Health, 21(5), 581–589.
Article Google Scholar
Lapin, B. R., Kinzy, T. G., Thompson, N. R., Krishnaney, A., & Katzan, I. L. (2018). Accuracy of linking VR-12 and PROMIS global health scores in clinical practice. Value in health, 21(10), 1226–1233.
Article Google Scholar
Gauthier, M., Egan, S., Ryan, A., Khurana, L., Dallabrida, S., & Evans, C. (2018). Do words matter? Patient perspetives on conceptually similar symptoms and impacts frequently utilized in patient-reported outcome (PRO) measures. Value in Health, 21, S109.
Article Google Scholar
Patrick, D. L., Burke, L. B., Gwaltney, C. J., Leidy, N. K., Martin, M. L., Molsen, E., & Ring, L. (2011). Content validity—establishing and reporting the evidence in newly developed patient-reported outcomes (PRO) instruments for medical product evaluation: ISPOR PRO Good Research Practices Task Force report: part 2—assessing respondent understanding. Value in Health, 14(8), 978–988.
Article Google Scholar
Agt, H. V., & Bonsel, G. (2005). EQ-5D concepts and methods: A developmental history— the number of levels in the descriptive system. Dordrecht: Springer.
Google Scholar
Fitzpatrick, R., Davey, C., Buxton, M. J., & Jones, D. R. (1998). Evaluating patient-based outcome measures for use in clinical trials. Health Technology Assessment, 2(14), 1–74.
Article CAS Google Scholar
DeWalt, D. A., Rothrock, N., Yount, S., Stone, A. A., & Group, P. C. (2007). Evaluation of item candidates: the PROMIS qualitative item review. Medical Care, 45(5 1), S12-21.
Article Google Scholar
Saris, W., Revilla, M., Krosnick, J. A., & Shaeffer, E. M. (2010). Comparing questions with agree/disagree response options to questions with item-specific response options. Survey Research Methods, 4(1), 61–79.
Google Scholar
Brazier, J., Usherwood, T., Harper, R., & Thomas, K. (1998). Deriving a preference-based single index from the UK SF-36 Health Survey. Journal of Clinical Epidemiology, 51(11), 1115–1128.
Article CAS Google Scholar
Dolan, P. (1997). Modeling valuations for EuroQol health states. Medical care, 25, 1095–1108.
Article Google Scholar
Schneider, S., & Stone, A. A. (2016). The meaning of vaguely quantified frequency response options on a quality of life scale depends on respondents’ medical status and age. Quality of Life Research, 25(10), 2511–2521.
Article Google Scholar
Bradburn, N. M., & Miles, C. (1979). Vague quantifiers. Public Opinion Quarterly, 43(1), 92–101.
Article Google Scholar
Brazier, J., Rowen, D., Mavranezouli, I., Tsuchiya, A., Young, T., Yang, Y., Barkham, M., & Ibbotson, R. (2012). Developing and testing methods for deriving preference-based measures of health from condition-specific measures (and other patient-based measures of outcome). In NIHR Health Technology Assessment programme: Executive Summaries: NIHR Journals Library.
Luo, N., Li, M., Chevalier, J., Lloyd, A., & Herdman, M. (2013). A comparison of the scaling properties of the English, Spanish, French, and Chinese EQ-5D descriptive systems. Quality of Life Research, 22(8), 2237–2243.
Article Google Scholar
Lozano, L. M., García-Cueto, E., & Muñiz, J. (2008). Effect of the number of response categories on the reliability and validity of rating scales. Methodology, 4(2), 73–79.
Article Google Scholar
Borgers, N., Sikkel, D., & Hox, J. (2004). Response effects in surveys on children and adolescents: The effect of number of response options, negative wording, and neutral mid-point. Quality and Quantity, 38(1), 17–33.
Article Google Scholar
Bocklisch, F., Bocklisch, S. F., & Krems, J. F. (2012). Sometimes, often, and always: Exploring the vague meanings of frequency expressions. Behavior Research Methods, 44(1), 144–157.
Article Google Scholar
Bass, B., & Cascio, W. F. J. O. C. E. (1974). Magnitude estimations of expressions of freuquency and amount. Journal of Applied Psychology, 59(3), 313–320.
Article Google Scholar
Yock, Y., Lim, I., Lim, Y. H., Lim, W. S., Chew, N., & Archuleta, S. (2017). Sometimes means some of the time: Residents’ overlapping responses to vague quantifiers on the ACGME-I resident survey. Journal of Graduate Medical Education, 9(6), 735–740.
Article Google Scholar
Hox, J., Borgers, N., & Sikkel, D. (2003). Response quality in survey research with children and adolescents: The effect of labeled response options and vague quantifiers. International Journal of Public Opinion Research, 15(1), 83–94.
Article Google Scholar
Knutsson, I., Rydstrom, H., Reimer, J., Nyberg, P., & Hagell, P. (2010). Interpretation of response categories in patient-reported rating scales: A controlled study among people with Parkinson’s disease. Health and Quality of Life Outcomes, 8, 61.
Article Google Scholar
Herdman, M., Gudex, C., Lloyd, A., Janssen, M., Kind, P., Parkin, D., Bonsel, G., & Badia, X. (2011). Development and preliminary testing of the new five-level version of EQ-5D (EQ-5D-5L). Quality of Life Research, 20(10), 1727–1736.
Article CAS Google Scholar
Skevington, S. M., & Tucker, C. (1999). Designing response scales for cross-cultural use in health care: Data from the development of the UK WHOQOL. British Journal of Medical Psychology, 72(1), 51–61.
Article Google Scholar
Hatt, S. R., Leske, D. A., Wernimont, S. M., Birch, E. E., & Holmes, J. M. (2017). Comparison of rating scales in the development of patient-reported outcome measures for children with eye disorders. Strabismus, 25(1), 33–38.
Article Google Scholar
Vaingankar, J. A., Subramaniam, M., Chong, S. A., Abdin, E., Orlando Edelen, M., Picco, L., Lim, Y. W., Phua, M. Y., Chua, B. Y., Tee, J. Y., & Sherbourne, C. (2011). The positive mental health instrument: Development and validation of a culturally relevant scale in a multi-ethnic Asian population. Health and Quality of Life Outcomes, 9, 92.
Article Google Scholar
Keller, S. D., Ware, J. E., Jr., Gandek, B., Aaronson, N. K., Alonso, J., Apolone, G., Bjorner, J. B., Brazier, J., Bullinger, M., & Fukuhara, S. (1998). Testing the equivalence of translations of widely used response choice labels: results from the IQOLA project. Journal of Clinical Epidemiology, 51(11), 933–944.
Article CAS Google Scholar
Palan, S., & Schitter, C. (2018). Prolific.ac: A subject pool for online experiments. Journal of Behavioral and Experimental Finance, 17, 22–27.
Article Google Scholar
Spector, P. E. (1976). Choosing response categories for summated rating scales. Journal of Applied Psychology, 61(3), 374.
Article Google Scholar
Lamu, A. N., Gamst-Klaussen, T., & Olsen, J. A. (2017). Preference weighting of health state values: what difference does it make, and why? Value in Health, 20(3), 451–457.
Article Google Scholar
Fayers, P. M., & Machin, D. (2013). Quality of life: the assessment, analysis and interpretation of patient-reported outcomes. New York: Wiley.
Google Scholar
Luo, N., Li, M., Liu, G. G., Lloyd, A., de Charro, F., & Herdman, M. (2013). Developing the Chinese version of the new 5-level EQ-5D descriptive system: the response scaling approach. Quality of Life Research, 22(4), 885–890.
Article Google Scholar
Streiner, D. L., Norman, G. R., & Cairney, J. (2015). Health measurement scales: A practical guide to their development and use. Oxford: Oxford University Press.
Book Google Scholar
Van Sonderen, E., Sanderman, R., & Coyne, J. C. (2013). Ineffectiveness of reverse wording of questionnaire items: Let’s learn from cows in the rain. PLoS ONE, 8(7), e68967.
Article Google Scholar

Download references

Acknowledgements

We would like to thank Rosie Lovett for comments on an earlier version of the paper. A previous version of this work was presented as a poster at ISOQOL 2019, San Diego USA. The abstract is published in Quality of Life Research, 28(Supp. 1), p. S109. https://doi.org/10.1007/s11136-019-02257-y

Funding

This manuscript is independent research funded by the National Institute for Health Research Yorkshire and Humber Applied Research Collaboration. The views expressed in this publication are those of the author(s) and not necessarily those of the National Institute for Health Research or the Department of Health and Social Care.

Author information

Authors and Affiliations

School of Health and Related Research (ScHARR), University of Sheffield, Regent Court, 30 Regent Street, Sheffield, S1 4DA, UK
Tessa Peasgood, Jen-Yu Chang, Clara Mukuria & Philip A. Powell
School of Population and Global Health, University of Melbourne, Parkville, Australia
Tessa Peasgood
NIHR Research Design Service Yorkshire and Humber, University of Sheffield, Regent Court, 30 Regent Street, Sheffield, S1 4DA, UK
Robina Mir

Authors

Tessa Peasgood
View author publications
You can also search for this author in PubMed Google Scholar
Jen-Yu Chang
View author publications
You can also search for this author in PubMed Google Scholar
Robina Mir
View author publications
You can also search for this author in PubMed Google Scholar
Clara Mukuria
View author publications
You can also search for this author in PubMed Google Scholar
Philip A. Powell
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Philip A. Powell.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

All procedures performed involving human participants were in accordance with the ethical standards of the Institutional Review Board of University of Sheffield and with the 1964 Helsinki declaration and its later amendments.

Informed consent

Informed consent was obtained from all participants in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOCX 50kb)

Supplementary material 2 (DOCX 22kb)

Supplementary material 3 (DOCX 29kb)

Supplementary material 4 (DOCX 34kb)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Peasgood, T., Chang, JY., Mir, R. et al. The role of response domain and scale label in the quantitative interpretation of patient-reported outcome measure response options. Qual Life Res 30, 2097–2108 (2021). https://doi.org/10.1007/s11136-021-02801-9

Download citation

Accepted: 15 February 2021
Published: 04 March 2021
Issue Date: July 2021
DOI: https://doi.org/10.1007/s11136-021-02801-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

The role of response domain and scale label in the quantitative interpretation of patient-reported outcome measure response options

Abstract

Purpose

Methods

Results

Conclusion

Similar content being viewed by others

Health, Health-Related Quality of Life, and Quality of Life: What is the Difference?

Loneliness and social isolation interventions for older adults: a scoping review of reviews

Associations between loneliness and perceived social support and outcomes of mental health problems: a systematic review

Introduction

Methods

Sample

Survey

Data quality

Data analysis

Results

Inconsistency checks

Respondent characteristics

Response option interpretation based on number of times experienced

Response option interpretation based on sliders

Response option interpretation based on regression analysis

Discussion

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (DOCX 50kb)

Supplementary material 2 (DOCX 22kb)

Supplementary material 3 (DOCX 29kb)

Supplementary material 4 (DOCX 34kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation