Introduction

Patient-reported outcome measures (PROMs) are completed by patients (or proxies) in order to provide a summary of patients’ evaluation of their health or health-related quality of life. PROMs are used to assess the impact of conditions and/or interventions in the context of effectiveness studies, cost-effectiveness analysis, or to track changes in individual health in routine care [1, 2]. Evidence on their validity and reliability, for example as a function of mode of administration [3], relationship with other outcomes [4], and linguistic content [5, 6], is therefore of interest to those working in health care decision making.

PROMs consist of questions covering different domains (e.g. pain, mental health or wellbeing, physical, social, role functioning, etc.). Patients report their level of symptoms or functioning using numeric rating scales (NRS), visual analogue scales (VAS), or verbal rating scales (VRS) [7, 8]. Response options can be based on how frequently patients experience a symptom or have problems with functioning, how severe the symptom is, or on levels of difficulty (e.g. with functioning) [9]. Finally, PROM questions can also be phrased as agreement scales, for example, from strongly agree to strongly disagree [10].

The type of response option used is related to the concept being measured. For example, experience of symptoms is usually linked to either frequency or severity scales. Summary scores or weighted index values can be generated from the PROM, which can be used to assess health [11, 12]. However, verbal response options are considered to be vague quantifiers as they rely on respondents’ interpretation of terms such as ‘sometimes’ or ‘occasionally’ [13, 14]. The vagueness of verbal quantifiers is potentially problematic in the assessment of health-related quality of life, which often relies on them heavily.

Vague quantifiers are also problematic when PROMs are used by health economists to elicit health state utility values, which play an important practical role in cost-effectiveness analysis to determine health care resource allocation [15]. As response levels are displayed independently of the other response options in utility elicitation, it is important for PROMs to have response choices that are clear and can be consistently interpreted over time, context, and between people [16]. Within each type of response options, there are variations in the number of options and the qualitative labels used to distinguish between them. There are ongoing methodological uncertainties around potential differences in the interpretation of response options [13] and concerns as to whether respondents can clearly distinguish between different numbers of response categories [17, 18].

For frequency response options, the relationship between participants’ numerical estimates and corresponding linguistic terms (e.g. ‘often’, ‘some of the time’, ‘seldom’) has been explored to understand the order and the degree of difference between displayed options [13, 19,20,21,22]. While similar analysis has also been conducted on severity response options (e.g. ‘very much’, ‘quite a bit’, ‘some’) [23,24,25], little has been investigated regarding response options that quantify difficulties (e.g. ‘a little difficulty’, ‘moderate difficulty’) [26].

How participants assign a quantitative value to qualitative response options (e.g. how many times an event has to have happened to match the label ‘often’ or ‘sometimes’) is not clearly understood. Additionally, evidence on whether such interpretation varies across different context or domains remains scarce. The interpretation of response options could also potentially be heterogeneous within different subpopulations, for example, with regard to their health, language, and cultural background. Whether a uniformed questionnaire can provide a generally consistent measure across different groups requires exploration [27, 28].

This project aimed to explore how respondents quantitatively interpret common PROM response options. This was in part to support ongoing instrument development work of a new preference-based measure of health and wellbeing (the EQ-HWB; https://scharr.dept.shef.ac.uk/e-qaly/) and the choice of questions and response options investigated in this study was linked to those being considered for inclusion within this broader project. Three key questions were addressed: 1) whether the quantification of different response options reflects their intuitive or linguistic ordering; 2) whether response options to questions assessing different domains are interpreted differently; and 3) whether individual characteristics, such as age and having English as a second language (ESL), influence the way response options are interpreted.

Methods

Sample

Adult residents in the UK were recruited via the Prolific online panel [29] in January 2019 (n = 1401), pre-screened to cover a spread of age (18–47 versus 48 + year olds), gender, and ESL. No formal a priori sample size estimation was undertaken. Participants received £1.20 for completing the online survey.

Survey

Ethics approval was granted from the host institution ethics committee. Figure 1 illustrates the survey flow. Following consent, participants provided background characteristics, including their age, gender, ethnic group, highest educational qualification, health, and any chronic mental or physical health problems (see the full survey in Online Resource 1). Participants were randomised to one of three domains (loneliness, happiness, or activities) and asked a series of questions about the quantitative interpretation of commonly used PROM response options. Each participant thus received a question stem that corresponded to the abovementioned domains, either: (1) a negatively phrased question related to social functioning, ‘I felt lonely’; (2) a positive phrased mental health/wellbeing question, ‘I felt happy’; or (3) an activity/role functioning question, ‘I was able to do the things I wanted to do’.

Fig. 1
figure 1

Schematic of survey flow. Respondents were randomised to answer questions on one of three health related quality of life domains (happiness, loneliness, or activities), and then to either the ‘only occasionally’ or ‘occasionally’ frequency response option. Frequency slider response options included: (‘none of the time’), ‘only occasionally’/’occasionally’, ‘sometimes’, ‘often’, ‘most of the time’, (‘all of the time’). Severity slider response options included: (‘not at all’), ‘a little bit’, ‘some’, ‘somewhat’, ‘quite a bit’, ‘very much’. Difficulty slider response options included: (‘no difficulty’), ‘slight difficulty’, ‘some difficulty’, ‘a lot of difficulty’, (‘unable’)

Following stratification by one of the three questions, participants were further randomised such that half were given the response option ‘occasionally’ to interpret and half were given ‘only occasionally’. Given that ‘occasionally’ and ‘only occasionally’ always fell in the same position within response options this randomisation enabled testing of whether the actual wording of the response option made a difference beyond their ranked order. ‘Occasionally’ and ‘only occasionally’ were tested in separate arms in part to inform selection of response options for the new measure.

First, participants were asked to respond to the question ‘Thinking about how things have been over the last 7 days…’ with the stem dependent on question randomisation (e.g. ‘I felt lonely’), using a 5-point scale (ranging from ‘none of the time’ to ‘most or all of the time’; for full response scales to all survey questions see Online Resource 1). Participants were then asked to provide a quantitative interpretation of their own response to this question, based on the number of days over a 7-day period they thought the response best referred to, on a 8-point scale (ranging from ‘not even once in the last 7 days’ to ‘seven or more times in the last 7 days). The same quantification question was asked again for another random response option in addition to their own answer.

In order to observe what quantitative values participants assigned to each response option, relative to one another on the same scale, participants then completed three slider tasks. First, for each set of randomly ordered frequency response options (‘occasionally’ or ‘only occasionally’, ‘sometimes’, ‘often’, ‘most of the time’Footnote 1), participants were asked to assign a numeric value between 0 and 100 using a slider for each response option (0 = none of the time, 100 = all of the time).

Second, they were asked to assign a set of randomly ordered severity response options (‘a little bit’, ‘somewhat’, ‘some’, ‘quite a bit’, ‘very much’) on a similar 0 to 100 slider scale (0 = not at all, 100 was undefined). In both cases, these questions related to the domain to which the participant had been randomised (e.g. loneliness). For the frequency slider the top anchor ‘all of the time’ is intuitive, however, there is no such clear top anchor for a severity scale, which can be applied across the three domains. The top of the severity scale was left undefined to avoid introducing potential focusing effects from terms that are not usually part of the response option set and also to allow respondents to place ‘very much’ at the top of the scale should they wish.

Third, all participants responded to a question with a difficulty response option, linked to mobility: ‘Thinking about how things have been over the last 7 days… How well were you able to get around outside?’ on a 5-point scale (ranging from ‘no difficulty’ to ‘unable’). After this, participants were asked to assign the following response options ‘a lot of difficulty’, ‘some difficulty’ and ‘slight difficulty’ on a 0 to 100 slider scale (0 = no difficulty, 100 = unable).

To help with data quality, the survey was designed to be short (less than 10 min); respondents were timed out after 30 min. The research team included a patient researcher who was involved in supporting the design of the study, including the survey and input into the clarity and content of the study information sheet.

Data quality

We judged a respondent to have answered in a logically inconsistent way when they gave a quantitative answer for the bottom response option (e.g. ‘only occasionally’ or ‘none of the time’) that was equal to or higher than their response to the top response option (e.g. ‘most of the time’). Inconsistent responses were dropped from within the group of questions where the inconsistency was identified (i.e. frequency quantification in number of times over 7 days, frequency slider, severity slider, or difficulty slider). However, for the selection about the number of times over 7 days, selections for the bottom and top response option that were both at the top end of the scale (7 or more times) were not considered as inconsistent due to the upper censoring of the scale. In addition to the above, we excluded completely individuals who had three or more inconsistencies in responses, as these respondents were considered to not have paid attention or understood the tasks. An additional analysis was conducted with respondents with any inconsistencies dropped to explore the effect on the results (see Online Resource 4).

Data analysis

Descriptive analyses were conducted to document characteristics of respondents and to compare results of the slider and frequency quantification questions for different response options across different domains (e.g. happiness versus loneliness). Differences were tested using Fisher’s exact test (for medians) and unpaired t-tests (for means). We also explored descriptively the relative gaps between mean responses across the response options (no statistical test performed) and the variability of respondents’ answers for each slider response (using a variance comparison test) to indicate the consistency of interpretation of the options across our sample.

We used regression analysis to explore the combined impact of respondent characteristics and the domain of the question on the assigned values. We ran separate ordinary least squares (OLS) regressions which combined all slider answers relating to a particular response option (e.g. ‘sometimes’); this meant combining respondents from different arms of the study (for frequency and severity response options, see Fig. 1). For each model, the response option was used as the dependent variable with domain and respondent characteristics as the independent variables. We also ran the frequency and severity models without the domain variables to explore the extent of variance explained by individual characteristics. As each respondent was included only once in each model, no adjustment for clustering standard errors was used. Ideally, respondent characteristics should have no impact on interpretation of the labels; therefore, small coefficients and low variance explained by the model were preferred. Accordingly, we did not have any a priori effect sizes against which to judge the effects.

Results

Inconsistency checks

Of 1401 survey completions, 229 had one inconsistency, 41 had two and 24 had three or more. The latter were dropped leaving a valid sample of 1377 participants. Most remaining inconsistencies occurred in interpreting the ‘difficulty’ response options linked to the mobility question (15.6% of the remaining sample had an inconsistency on this question). For those randomised to the happiness and activities domains, the slider for the difficulty question had reverse anchors, such that to the right was the most negative (i.e. greater problems with mobility), this contrasts with earlier questions where the right anchor was the most positive (e.g. more happiness). Accordingly, inconsistencies on the difficulty question were more pronounced in the happiness (17.5% of this sub-sample) and activities (22.9% of this sub-sample) groups, than the loneliness group (6.5% of this sub-sample).

Respondent characteristics

Basic characteristics of the valid sample are in Table 1. Of the 1377 included completions, 53.1% were female, 37.5% reported a health condition, and 85.6% reported English as their first language. The mean age of the sample was 42.4 years (SD = 14.0), with a minimum of 18 and maximum of 86 years. On average, participants completed the survey in 6.9 min (SD = 3.4), with a minimum of 2.2 and a maximum of 30.2 min. Respondents were evenly randomised into three groups (i.e. happiness n = 458, loneliness n = 460, and activity n = 459), and there were no significant differences in characteristics between groups (see Table 1).

Table 1 Respondents’ characteristics

Response option interpretation based on number of times experienced

Table 2 shows the median response to interpretations based on the number of times over the last 7 days their own selection refers to and similarly for a randomly specified ‘other’ response option. The median response increases in line with expectations.

Table 2 Median number of times over the last 7 days participants reported that each possible frequency response referred, split by occasionally and only occasionally arm

Comparisons between the ‘only occasionally’ arm and the ‘occasionally’ arm found significant differences for interpretations of the responses to the ‘felt happy’ question for ‘occasionally/only occasionally’ (p = 0.032 one tailed Fisher’s exact test) and ‘sometimes’ (p = 0.003 one tailed Fisher’s exact test). All other comparisons between these arms were not significant. This means the interpretation of ‘sometimes’ for the happy question is lower when presented alongside ‘only occasionally’ as the next response choice.

Response option interpretation based on sliders

Figure 2 shows the mean answer on the sliders by question, for each response option (see Online Resource 2 for a table of this data). ‘Only occasionally’ is significantly lower than ‘occasionally’ for all three questions (unpaired t-test, for lonely t = − 5.698 p < 0.001, happy t = − 5.364 p < 0.001, and activities t = − 5.798 p < 0.001). For other response options, those presented within the ‘only occasionally’ versus ‘occasionally’ arm were not significantly different from one another, with one exception (the response of ‘most of the time’ within the activities domain; unpaired t-test, t = -2.562, p = 0.005), and consequently answers to sliders are shown combined across these two arms (except for the only occasionally/occasionally option).

Fig. 2
figure 2

Mean value to slider questions for frequency and severity response options. Legend value attributed by participants to each response option on a slider task across ‘activities’, ‘happiness’ and ‘loneliness’ quality of life domains. Response options were presented simultaneously in slider tasks with either a frequency (‘occasionally’ or ‘only occasionally’, ‘sometimes’, ‘often’, ‘most of the time’) or severity (‘a little bit’, ‘somewhat’, ‘some’, ‘quite a bit’, ‘very much’) response option scale

The graph shows a significantly lower quantitative interpretation of response options in the loneliness domain compared to the happiness and activities domains (as indicated by non-overlapping 95% confidence intervals), with the exception of ‘most of the time’, where the confidence intervals overlap between the loneliness and happiness domains.

The ordinal interpretation of response options is in line with expectations. ‘Occasionally’ was interpreted as quantitatively greater than ‘only occasionally’. There are some interesting differences between the frequency and severity terms: ‘a little bit’ is quantitatively interpreted as being closer to ‘only occasionally’ than ‘occasionally’. ‘Often’ was given a higher score than the fourth severity category, ‘quite a bit’. The response option ‘some’ was interpreted similarly to ‘somewhat’, although the SD for ‘somewhat’ was higher for all three domains (variance comparison test: happy f = 0.821, 2*Pr (F < f) = 0.039, activities f = 0.630, 2*Pr (F < f) < 0.001, lonely f = 0.703, 2*Pr (F < f) < 0.001). There is a greater distance between the third (‘often’/’quite a bit’) and the second response options (‘sometimes’/’somewhat’) than for other differences.

The slider responses to the difficulty mobility question (i.e. ‘I was able to get around outside with…’) shows that the terms ‘a lot of difficulty’ (M = 85.7, SD = 10.6); ‘some difficulty’ (M = 49.5, SD = 17.2); and ‘slight difficulty’ (M = 26.7, SD = 16.2) were interpreted broadly as expected.

Response option interpretation based on regression analysis

Table 3 shows the predictive models for the slider responses to frequency and severity response options regressed on respondent characteristics across the three randomised domains (i.e. loneliness, happiness, activities). Respondents gave a statistically significant higher value to all response options in the happiness and activities domains than loneliness domains, and these effects remain after individual characteristics are controlled for.

Table 3 OLS regression results for respondent characteristics predicting slider responses to frequency and severity response options

Older participants tended to give more polarised responses, giving a lower value for lower anchored response options (e.g. ‘occasionally’) and a higher value for higher anchored response options (e.g. ‘often’) than younger participants. Women gave higher values for some of the response options at the top of the scale (i.e. ‘sometimes’, ‘often’, ‘most of the time’, ‘very much’). Participants with a degree gave lower values for ‘a little bit’, ‘occasionally’ and ‘only occasionally’. Those disclosing a mental health problem gave a higher value for ‘quite a bit’ than those not disclosing a problem.

Finally, participants with ESL reported a higher value for ‘only occasionally’ and lower values for ‘quite a bit’ and ‘somewhat’. Tests of equality of variance between slider values from participants with English as a first language versus ESL finds a significantly greater variance for ESL for ‘quite a bit’ across all three domains (variance comparison test: happy f = 0.422, 2*Pr(F < f) < 0.001, lonely f = 0.519 2*Pr(F < f) < 0.001, activities f = 0.511, Pr(F < f) < 0.001), and for ‘very much’ in the loneliness domain (f = 0.489, 2*Pr(F < f) < 0.001). No other significant differences were observed. Overall minimal variance in slider values is explained by respondent characteristics. When question type is not included as a covariate, the highest adjusted R-squared for the severity responses is 0.034 and 0.024 for the frequency responses (see Online Resource 3). Accordingly, the biggest variation in responding is driven by context (or the domain being measured).

Table 4 shows how well the respondent characteristics predict the slider response for the difficultly response options linked to the mobility question. Only a small amount of the variation in slider responses was explained by respondent characteristics (between 0.4 and 3.0%). Women provided a significantly higher slider value (i.e. closer to ‘unable’) for all levels of difficulty. Those declaring a physical health problem and those who were older, interpreted ‘slight difficulty’ as quantitatively lower (i.e. closer to ‘no problems’). A respondent had caring responsibilities and was not significantly related to any slider values and thus was not included as a covariate in any of the final models.

Table 4 OLS regression results for respondent characteristics predicting slider responses to difficulty response options in the mobility domain

Supplementary analyses excluding any respondent with one or more inconsistency resulted in slight changes to the significance level of some of the individual characteristics in the frequency and severity models, most notably age and education (see Online Resource 4), but overall findings remained consistent.

Discussion

This study addressed three key questions. Initially, we explored whether respondents’ quantification of different response options reflected their intuitive or linguistic ordering. In general, this was the case, suggesting that the assumed qualitative ordering of these common response options (when presented together) has underlying validity. Nevertheless, there are further takeaways. First, in previous studies, respondents tended to spread out all response options on a numerical scale when evaluating them simultaneously [25, 30]. Therefore, the same response option might appear to have a different numerical value when a number of other options vary. One strength of this study is that respondents were allocated to different arms through randomisation. As a result, we can draw inferences that the labels given to response options influenced numerical interpretation beyond the positioning or order of the options shown. Despite being in the same position within the choice set, ‘occasionally’ had a higher value for respondents than ‘only occasionally’. Similarly, the values across frequency and severity response options differed, with values for ‘often’ being above those for ‘quite a bit’ despite the fact that they were both ordered in fourth position within their respective response option scales.

Second, there was greater quantitative differentiation between some terms than others (e.g. see Fig. 2). For example, the difference between ‘a little bit’ and ‘somewhat’ was smaller than the difference between ‘somewhat’ and ‘quite a bit’. Further, the distance between ‘sometimes’ and ‘occasionally’ was lower than distances between other neighbouring response options. The similarity of interpretation of these two terms has been found elsewhere. Spector [30], in an exercise with students from the University of South Florida, found that ‘sometimes’ and ‘occasionally’ were given the same overall ranking. Our results suggest that people do not simply apply an interval approach to ranking a finite number of response options on the same scale, which has implications for scale design and analysis. Rather than interval-based scoring systems for PROMs, it may support the use of a scoring system which draws upon other information to provide the relative score for a response option, such as the use of item response theory (IRT) or preference-based scoring [31, 32].

Third, as Fig. 2 shows, there is a clear gap in the middle of the quantitative rating scale for the selection of response options tested in this study. This suggests that none of the options tested in this study was really adequate for a Likert scale that requires a mid-point, and it prompts the need for further testing of response options that may better fulfil this role within PROM design (e.g. ‘half the time’).

The second research question was whether domain affected the quantitative value participants placed on response options. We found some support for this. The loneliness domain, which featured a negatively worded question stem (i.e. more is worse) results in response options having a lower numerical interpretation compared to the two positively worded happiness and activities domains. Similar results have been reported elsewhere, with negatively phrased questions receiving lower values on average than positively phrased items [13]. Nevertheless, as we used only one negatively phrased question, it is unclear whether our findings are due solely to the negative phrasing of the item, or something specific about the content of the domain (i.e. loneliness), relative to the comparators.

Our final research question was whether participants’ individual characteristics influenced the way response options were interpreted. The overall effect of individual characteristics was substantially smaller than the effect of domain context, which was a positive finding. However, some characteristics made a difference, and these may have a cumulative impact when multi-item PROMs are completed. Women and older respondents tended to report significantly higher values for a number of response options, especially towards the top of the scale, and this should be taken into consideration in research involving mixed samples. Of particular interest was the effect of ESL on response option interpretation. The labels ‘only occasionally’ (but not ‘occasionally’), ‘somewhat’, and ‘quite a bit’ were interpreted significantly differently by respondents with ESL, with the variation in the interpretation of ‘quite a bit’ being significantly greater for respondents with ESL than those without. Interestingly, the numeric value of ‘quite a bit’ (without a specific domain context) was found to be higher in a Swedish study (mean = 73.5) [23] compared to our findings. Additionally, ‘quite a bit’ was also found to have a higher numerical value (mean = 75.1) in an international study testing the translation equivalence of SF-36 in different countries [28], with the interpretation of ‘quite a bit’ varying from country to country [24, 33]. These previous studies show that country and translation might have an impact on the interpretation of response options, whereas our study further explored the effect of ESL on understanding response options within the same language.

We acknowledge some limitations of the present study. As an online survey recruited through a commercial panel, the data will be subject to concerns over quality. This risk was mitigated through keeping the survey short, careful design (and piloting) of the survey, and dropping logically inconsistent responses for individual questions plus full cases that had at least three cases of inconsistency. As a consequence of keeping the survey short individuals only interpreted response options for one of the three domains. This meant comparison between questions was based on different individuals—while this had the advantage that values given were not being impacted by potential ordering effects—it is a potential limitation in making comparisons. Furthermore, only three domains were explored, which limits the interpretation and generalisation of the findings. The quantitative interpretation of the response options relied upon a visual analogue scale (VAS) scale (our slider), which has well known biases, such as end aversion [34]; the extent to which such a bias may be interacting here with individual characteristics or question domain is unknown.

This study has shown that while respondents quantitatively interpret common response options in a logical way, this interpretation may differ systematically based on domain being measured and certain individual characteristics, and this should be taken into account in PROM design and analysis. Several recommendations can be made based on our findings. First, in PROM design, it is sensible not to mix negatively and positively phrased domain items within the same measure [35]. This is particularly the case within the same multi-item scale (or set of items that are combined to calculate a score). It may be possible to include consistent sets of positively and negatively worded items within a domain (or subscale) of a PROM, when those domains are not then combined to form a total score. However, if these domains use the same response options, then PROM developers should be aware that the same scale may be interpreted differently across domains as a function of negative or positive wording. Further, if a PROM is intended to be valued for use in cost-effectiveness analysis (i.e. is to be ‘preference-based’ [15]), and there is particular need to avoid positively and negatively phrased items within the same PROM. Prototypical items from different domains are typically used together in health state valuation exercises and mixing positive and negative items may lead to a differential quantitative interpretation by respondents. Researchers following this advice will also need to consider whether they want to include wholly positive or negatively phrased items during PROM design. This is an issue that, in our opinion, can be best addressed in collaborative patient and public involvement and engagement work with the PROM’s target population and/or appropriate cognitive debriefing exercises.

Second, in PROM design and analysis, simple PROM scoring systems that rely on an assumption of interval properties of Likert response options should be avoided wherever possible [31]. Instead, methods such as IRT scoring can be used to adjust for uneven distances between response options. If the PROM is to be valued for use in health resource cost-effectiveness analysis and requires item reduction, then it is possible to use the output from IRT analyses to select items that produce the best spread across the latent scale.

Third, our findings can be used to inform the selection of response options for future PROM development, depending on the target research sample. Most successful PROMs are not designed to be used solely in people with English as a first language and so researchers should consider their choice of response options carefully for interpretability across people with ESL during the design stage [16]. For example, if the sample involves participants with ESL then researchers should consider avoiding using the response options ‘somewhat’ and ‘quite a bit’.