Item selection occurs after the conceptual framework for a new measure has been established. For preference-based measures, the conceptual framework identifies independent domains that require one (or potentially more) questions that will be used in valuation. This differs from profile measures which may have several questions representing each domain. There is also the additional complexity that although patients complete the preference-based measure, the valuation is often undertaken by members of the public who may or may not have any health condition. This requires consideration of the criteria in relation to both those completing measures
and those undertaking the valuation. In this article, we consider how standard item selection criteria should be modified and discuss some additional criteria which could usefully be considered to ensure items will be appropriate to take forward to valuation. The list has been derived from existing lists [
4‐
8] and through presentations and discussion with researchers and advisors associated with the ‘Extending the QALY’ project (
https://scharr.dept.shef.ac.uk/e-qaly/). This article has arisen out of a project to develop a new instrument (the EQ-HWB) to capture the impact of interventions on patients, carers and social-care users (hence capturing health, carer and social-care
1 related quality of life), which could be used to derive the quality adjustment value of a state to estimate QALYs.
Accuracy and completion
Criteria 1–12 relate to accurate and complete responses and aiming for questions which are “brief, clearly worded, easily understood, unambiguous and easy to respond to” [
4].
The first three relate to ensuring items are easy to read. Steiner and Norman [
5] recommend a reading age of not more than 12 years for patient-reported outcome measures (PROMs) [
9]. Reading ease can be assessed by looking at some combination of number of words per sentence, number of syllables per word, ratio of complex words to easy words and number of characters per word (e.g. Flesch Kincaid Grade Level, Gunning Fog Score, SMOG Index, Coleman Liau Index, ARI). Many of these reading level assessments focus upon whole blocks of text, but for preference-elicitation tasks, the item will be read alone and out of context hence offering less contextual clues for the individual to draw upon to aid reading. This implies a need for reading level of isolated items to be evaluated (sensitively) within qualitative work.
Bradburn et al. [
8] notes that ill patients and some older people in particular may be confused by long complicated sentences, suggesting a need to keep items short whenever possible (Criteria 2). When response options are built into the question through repeating a core question stem (e.g. I had no pain, I had slight pain, I had moderate pain etc.) respondents may get frustrated at being asked to read repetitive text. However, when the endorsed statement is used within a health state classification (see Fig.
1), the built in style has an advantage in that it enables the exact wording to be shown within the description of the state to be valued – therefore, what is valued is also what is described by the full measure.
Double negatives [
8] make the items difficult to understand and to complete (Criteria 3). Double negatives may be created by the choice of response options, e.g.
I felt I had no control – none of the time. When respondents are completing measures, the full set of options will be visible, e.g. for frequency options (most of the time, often, occasionally etc.) and the double negative may be less problematic. However, once the state is described for valuation, the appearance of the item makes this double negative clash more apparent.
Criteria 4 seeks to avoid items which are potentially ambiguous because they are too complex, vague, hard to interpret, or could be interpreted in different ways. For example, a term such as ‘satisfied’’ may be problematic as it can be interpreted as a positive or a neutral state. Ambiguity also arises where the individual’s specific context or circumstances impact upon their interpretation of items, for example, ‘being able to communicate’’ may relate to social media use for some young people, presentation skills for some working age individuals, or the ability to be understood when speaking for post-stroke patients.
Avoiding specialist terminology or jargon (Criteria 5) also relates to ambiguity since some respondents may interpret the meaning of an item in its specialist sense and others may not – or may not understand it at all. This includes medical terminology where some respondents may link to specific meanings or diagnostic criteria and others may apply a more colloquial interpretation. For example, if asked about ‘being depressed’’ some respondents may only respond positively if they have a diagnosis of depression. Colloquial items (Criteria 6) such as ‘feeling down in the dumps’’ are also likely to be problematic for consistent interpretation, translation, and may date quickly.
Respondents may find it difficult to answer items where two or more questions are asked at the same time [
5] (Criteria 7). For the patient completing the questionnaire, there may be a conflict between the different components of an item. It is also problematic for valuation as those undertaking valuation may focus on one part of the question, e.g. the SF-6D item ‘You feel tense or downhearted and low’ [
10]. In valuation of the EQ-5D item on pain/discomfort (
I have moderate pain or discomfort), valuation has been found to focus mainly on pain [
11].
Some questions use two or more component parts to help clarify the meaning of a single construct. For example, the question ‘I was able to focus and concentrate’, is asking about the same domain. Whilst this may help convey clarity of meaning, it may still raise problems if part of the compound taps into mild problems and the other taps into more severe problems. It will not then be clear which level of severity should be focused upon in the valuation. There may be a trade-off between this criteria and also needing to adequately communicate the concept of interest, which may be best done through the use of additional terms.
The next three criteria promote high completion rates across all relevant groups. Highly personal or intrusive items (Criteria 8), such as suicide ideation or problems with sexual activity, may lead to missing values or annoy or upset responders. Bradburn et al. [
8] notes that potentially embarrassing or offending questions, if necessary, should be included at the end with an opt-out condition used. However, algorithms to generate a quality adjustment value require complete data across all items; hence, missing data should be avoided.
Consideration needs to be given to the ethical issues around asking specific questions to vulnerable groups (Criteria 9), for example, individuals caring for very sick or dying loved ones, individuals with severe physical and mental health problems, or individuals with very limited remaining life expectancy. How completing the question is likely to make people feel is a legitimate concern at the item selection stage. Asking positively framed questions (e.g. how satisfied are you with your life?) may be insensitive for those in very difficult circumstances or for those close to the end of life. Asking particularly negative questions (e.g. I felt like a failure) may spark upsetting feelings or thoughts for responders. Asking questions relating to safeguarding concerns (such as suicide ideation) may be problematic where no clinical follow-up is incorporated.
Criteria 10 relates to avoiding items that refer to a particular circumstance, situation or lifestyle that may not be universal across all future responders. This includes avoiding questions which refer to spouses or families; refer to employment; refer to particular activities or circumstances which might not be relevant to all (e.g. questions about working or sexual activity). If the domain is not relevant, this may result in missing data, which as noted, would make it difficult to generate the quality adjustment value for the state.
Criteria 11 recommends that items do not draw upon another piece of knowledge, such as what other people think, as this may be difficult to complete if the responder is not confident in that knowledge, e.g. ‘other people care about me’, ‘I am a burden to others’. This also has the problem (discussed below) of attribution – in valuation it might be the actual burden to others which is valued or the experience for the individual of feeling like a burden.
Criteria 12 recommends caution around potentially value laden or judgmental questions which may lead to socially desirable responding [
12]. The tone of the question should be neutral to avoid respondents trying to conform to social norms or present themselves in a good light. That said, many instruments may draw on a theory of quality of life which does contain normative judgments. For example, a question on extent of social contact may be included on the basis that more social contact is assumed to be an improvement in QoL (even though some people may not agree with this). Normative judgments within a measure should be clear and transparent and not arise accidentally through choice of item.
Ensure items are suitable for measuring QALYs
Criteria 14–19 relate to the need to avoid items that will be unsuitable for estimating the quality adjustment component used in the calculation of QALYs. QALY calculations require an assumption that states can be valued independently to their duration or their position within a sequence of states [
13]. The quality adjustment for a state is assumed to be inter-personally and inter-temporally comparable. Consequently, it is important that each item is clearly tapping into the specific time period and does not rely upon comparisons to other people or other time periods.
Inter-personal comparisons rely upon the absence of differential item functioning (DIF) [
14] (Criteria 14) which identifies sub-groups who, despite having the same underlying level of an attribute, answer an item differently (either consistently across the domain: uniform DIF, or with a different degree of difference across the domain: non-uniform DIF). For example, crying questions can be answered differently between men and women even when they have the same level of depression [
15]. DIF may arise when different groups interpret items in different ways.
Psychometric analysis can test for DIF across different groups where there is a hypothesized reason for exploring this difference (e.g. age, gender and ethnicity). One problem with QoL items arises because potentially relevant domains may also be symptoms of certain health conditions. Where items tap into specific symptoms, we may expect to see DIF for that patient group, for example, feelings of hopefulness may be part of a full QoL and general positive affect questionnaire, yet also a symptom of depression. Hence findings of DIF need to be interpreted with caution.
Inter-personal comparability also includes avoiding items that make comparisons to other people directly (Criteria 15) or to expectations (Criteria 16). Items that make comparisons to others depend upon who the individuals choose to use for a comparison, therefore, conflict with the need for inter-personal comparability, e.g. ‘I felt just as good as other people’ depends which ‘other people’. Where one item adopts a comparison approach and others do not, we would expect to identify DIF on that item. Given the self-complete nature of items, this may not be avoided entirely; however, items which are less likely to draw on individual expectations will be preferred.
Items which ask the respondent to make a comparison to another time period or to ‘usual’ are not suitable as they depend upon what the past or ‘usual’ is like for the individual (Criteria 17), e.g. ‘I’m bothered by things that don’t usually bother me’’.
Criteria 18 focuses on the need for items which clearly link to a specified time period. Items will not be suitable if they refer (directly or via the respondent’s interpretation) to the recent or distant past beyond the specified time period, or to the future. For example, an item using the term ‘life’ as in ‘how good is your life’ may not lend itself to a confined time period. Including a specific time period for consideration in the preamble to the question may not overcome the time-period framing created by the item. This includes items which could be interpreted as referring to a personality trait drawing outside the specified time period, e.g. ‘I had a bad temper’. If the items refer to the last seven days, then the respondents’ answer should be based only on their judgment of the last seven days.
Criteria 19 seeks to avoid items, and underlying constructs, in which there may be disagreement about whether a better response option always represents a better quality of life. This may be discussed in qualitative work. For example, people may disagree as to whether more self-confidence, control or independence is always a good thing. Quantitative assessment using IRT or Rasch analysis can also be used to assess whether response options are ordered as expected.
Criteria 20–22 relate specifically to considerations of the valuation task. All items included within a classification for valuation should be domains of life that are important, with supporting evidence from qualitative interviews or previous valuation studies. Those doing the valuation should be willing to trade-off improvements in the domain against other domains (Criteria 20). This criteria may be hard to establish in advance of conducting valuation exercises, reinforcing the need for an iterative approach to the design of the measure.
Trade-offs which involve more than one person may be problematic, such as the concern for others wellbeing. For example, the AQol-8D item “How much of a burden do you feel you are to other people?” [
16]. If improvement in that item is traded against deterioration in another item, it will not be clear whether the individual is valuing the feeling or experience of perceiving one is being a burden, or whether they are valuing the actual burden or impact upon others.
Within a valuation tasks, it should be clear what is being valued. Items which attribute a decrement or problem to a particular circumstance will be problematic in this regard (Criteria 21). For example, an item such as “because of X I am unable to do Y” “because of my pain I am unable to see my friends”, is difficult to value due to uncertainty as to whether X (pain) or Y (seeing friends) is being valued. It is also problematic because respondents may not be able to accurately attribute the cause.
Criteria 22 relates to the direction of the framing of an item and its response options. Whether items should be positively or negatively framed has been extensively debated [
17]. Positively framed items can have advantages in terms of willingness of completion, and how completing the questionnaire may impact on mood. However, this is not universal as some groups with very poor mental health or life circumstances may prefer to complete negatively framed items.
A case has been made for including both, in part to ensure those responding as ‘flat liners’ down one end of the scale would have less impact upon the average results as the positive and negative scoring would cancel each other out. However, this has been found to have a negative impact upon validity [
17‐
19]. When items are designed for preference-elicitation tasks, it would be better for the use of response options to be in a consistent direction. Switching the meaning of terms would be confusing for respondents (e.g. where ‘none of the time’ is good (pain – a negative item) and bad (energy – a positive item)) which may have an impact on quality adjustment values.