Introduction
A number of generic and disease-targeted health-related quality of life (HRQOL) instruments have been developed. Survey measurement of HRQOL assumes comprehension of the questions by respondents. Several studies have been conducted to evaluate national levels of literacy. For example, in the United Kingdom, a government report showed that 56% of a randomly selected sample of adults had literacy skills at the lowest level of ability [
1]. In addition, Smith et al. report that 22% of the working population in the United Kingdom have a low level of literacy [
2,
3]. Analysis from the 2003 National Assessment of Adult Literary Survey in the United States indicated that 44% of adults had basic or below basic literacy level [
4]. This report also found that 36% of the adult US population had basic or below basic health literacy [
4]. Gazmararian et al. [
5] specifically examined functional health literacy in a US national sample of Medicare enrollees in a managed care organization and found that more than one-third of respondents had inadequate or marginal health literacy.
Low literacy is associated with lower socioeconomic levels and poor health. In the United States, it disproportionally affects ethnic minorities including those immigrants who often arrive with low levels of education, socio-economic status, English proficiency, and discrepant cultural models with regard to disease and disease prevention compared to US models [
6]. Discrepancies between the readability of health information and the literacy skills of patients have been extensively reported since the onset of health-related readability evaluation in the 1980s [
2,
7‐
17].
Studies that have evaluated patient literacy have found that patient educational level is not always consistent with their literacy level. Davis et al. [
18] reported that among adult patients with a fifth to tenth grade education, 60% were reading at least three grades below their grade level. Similar results have been reported in other studies which report up to six grade reading levels below the highest grade completed [
19].
US norms recommend that surveys do not include items that require more than 8 or 9 years of formal schooling for the general population; and more than 5 years of formal schooling for vulnerable populations [
12,
13]. Likewise, in the United Kingdom, it is recommended that health literature is written so that no more than 5 years of education are needed to completely understand the passage [
17]. Therefore, it seems appropriate to suggest that health materials be written assuming a maximum of 5 years of formal education to assure comprehension by the widest population possible [
16,
17,
20]. Items that are not easily understood will have higher rates of non-response and the data may become unreliable due to items being incomprehensible to subjects with low literacy levels.
Reading ease evaluation has become increasingly important since research has shown that comprehension is higher when texts are easily read. The concept of readability refers to the ease of a piece of text to be read and understood. Most health-related readability studies have focused on educational materials, consent forms, and more recently some internet-based health information studies have also been done [
16,
21,
22]. By contrast, relatively few studies have been conducted to evaluate the readability of health surveys. Furthermore, only a few of these studies evaluated readability of each item separately [
16,
21,
22]. This is important since computerized methods calculate a weighted average of text readability, when the instrument is evaluated as a whole, and this average readability score only reflects the mean level of the readability of the whole instrument. But in a survey, the average readability score of the whole instrument tells only a part of the story because the subject needs to have an adequate literacy level to understand each item independently. In addition, mean readability scores are insufficient parameters to describe the real reading level that participants face in a survey as the variation of item reading levels may be high, and therefore the full range of scores would not be captured. Thus, before collecting survey data, assessing readability scores at the item level is an important contribution to the literature that will help close the gap between survey research and what is truly understood by the general population.
The Health Measurement Research Group conducted a multisite study to evaluate extensively used HRQOL instruments [
23,
24]. Five of these are generic instruments: the Short-Form Health Survey-36 item (SF-36v2), Health Utilities Index (HUI), European Quality of Life-5-Dimensional (EQ-5D), Quality of Well-Being Scale-Self-Administered (QWB-SA), and the Health and Activities Limitations Index (HALex). In addition, two disease-targeted instruments were included to learn how health assessments function differently in subjects with specific conditions: The Minnesota Living with Heart Failure Questionnaire (MLHFQ) and the National Eye Institute Visual Functioning Questionnaire-25 item (VFQ-25). These latter instruments were selected because they focus on patients for whom the study data were collected, those with heart disease and cataracts. In addition, these disease-targeted instruments are considered to be legacy measures for these conditions [
25]. The generic instruments used are among the most widely used measures. The purpose of this article is to assess the readability of these seven commonly used HRQOL instruments at the item level.
Results (data available upon request from S. Paz)
The mean and median F–K grade level scores judged this instrument to be at a “fairly easy” and “very easy” level of readability (See Tables
1,
2). Nineteen items (53%) scored above the recommended 5 years of schooling (See Fig.
1). In addition, nine items (25%) fell in the categories of “fairly difficult,” “difficult,” or “very difficult,” and eight items (22%) require more than 12 years of formal schooling to be properly understood. The mean and median on the FRE readability index placed this survey at a “fairly easy” level of reading difficulty (see Tables
1,
2). Even though the mean and median values are “fairly easy,” 18 items (50%) fell in the categories of “fairly easy,” “standard,” “fairly difficult,” “difficult,” or “very difficult” according to the FRE scoring method; i.e., 50% are harder than the recommended categories of “very easy” or “easy”. Eight items (22%) scored “fairly difficult,” “difficult,” or “very difficult” according to both scoring methods. The readability scores for the SF-36 overall were 6.7 using the F–K grade level scoring and 70.3 using the FRE readability formula. These results set this survey at an “easy” and “fairly easy” level of readability, respectively, according to the classification presented in Table
1.
2.
Health Utilities Index (HUI)
The mean and median F–K grade level score for the HUI items were 9.6 and 9.0, respectively, setting this survey at a “standard” level of readability according to the classification presented in Tables
1 and
2. All 15 items (100%) scored above the recommended 5 years of formal schooling (see Fig.
2). Using the FRE readability formula, which does not depend on grade level score, the mean and median for this questionnaire’s items also set the survey at “standard” level of readability on average (see Table
1,
2). Even though the mean and median values are at the “standard” level of readability, 100% of items (15/15) fell in the categories that are harder than “very easy” or “easy,” and 40% of the items (6/15) fell in the categories of “fairly difficult,” “difficult,” or “very difficult” according to both scoring methods (see Table
1). When calculated as a whole instrument, the overall readability scores for the HUI were 7.1 using the F–K grade level scoring and 65.7 using the FRE readability formula. These results would set this survey at a “fairly easy” and “standard” level of readability, respectively, according to the classification presented in Table
1.
3.
European Quality of Life-5-Dimensional (EQ-5D)
The mean and median F–K grade level score for the EQ-5D items set this survey at the “easy” level of readability according to the classification given in Table
1. The standard deviation was 3.5 and the range of scores went from 3.0 to 12.0 (VAS item) (see Table
2). Three items (50%) scored above the recommended 5 years of schooling (See Fig.
3). Using the FRE readability formula, the mean and median for the EQ-5D placed this survey at a “standard” level according to Table
1. Even though the mean and median values are standard, only 33% (2/6 items) fall in the categories of “very easy” or “easy”. The VAS item, with a score of 12.0, the highest in the F–K scale and a rating of “fairly difficult,” has a “standard” rating in the FRE scale, thus not affecting as much the mean score using this latter method. Item 3, the hardest item using the FRE is the only “difficult” item in this survey according to this method. The overall readability scores for the EQ-5D were 4.2 using the F–K grade level scoring and 78.4 using the FRE readability formula. These results set this survey at a “very easy” and “fairly easy” level of readability respectively (see Table
1).
4.
Quality of Well-Being Scale-Self-Administered (QWB-SA)
The mean and median F–K grade level score set this survey at a “standard” level of readability (see Tables
1,
2). Only 11 items scored at the recommended 5 years of formal schooling. This means that 85% (64/75) of the items in this survey may not be appropriately understood by individuals with less education (see Fig.
4). Furthermore, 14 items in this survey scored above 12.0 using the F–K method meaning that a college level education or higher is needed to appropriately comprehend 19% (14/75) of this survey. The FRE readability mean and median estimates for the QWB-SA placed this survey at a “standard” level according to Tables
1 and
2. The standard deviation was 18.7 and the range went from 0.0 to 100.0. Even though the mean and median values are “standard,” only 21% (16/75 items) fell in the recommended categories of “very easy” or “easy” according to this method of evaluating readability. The overall readability scores for the QWB-SA were 3.1 using the F–K grade level scoring and 79.3 using the FRE readability formula, setting this survey at a “very easy” and “fairly easy” level of readability respectively (see Table
1).
5.
Health and Activities Limitations Index (HALex)
The mean and median F–K grade level score for the HALex items set this survey at a “standard” and “fairly difficult” level of readability respectively (see Tables
1,
2). All seven items scored above the recommended 5 years of formal schooling meaning that 100% (7/7) of the items in this survey may not be appropriately understood by individuals with less education (see Fig.
5). Furthermore, one item (14%) requires completed 12 years of formal schooling and one item requires more than college level education to be properly understood. On the FRE readability formula, the mean and median placed this survey at a “fairly difficult” level according to Tables
1 and
2. As with the F–K formula, 100% (7/7 items) fell in the categories above the recommended “very easy” or “easy” categories using this scoring method. The overall readability scores for the HALex were 10.1 using the F–K grade level scoring and 55.4 using the FRE readability formula, setting this survey at a “fairly difficult” level of readability with both methods, according to the classification presented in Table
1.
6.
Minnesota Living with Heart Failure Questionnaire (MLHFQ)
The mean and median F–K grade level score for the MLHFQ items set this survey at a “standard” level of readability when using the F–K scoring method (see Tables
1,
2). All 21 items (100%) scored above the recommended 5 years of formal schooling making this survey not appropriate for subjects with less education (see Fig.
6). The mean and median on the FRE readability estimates also placed this survey at a “standard” level according to Tables
1 and
2. Even though the mean and median values are standard, 100% (21/21 items) fell in the categories that are harder than the recommended “very easy” or “easy” using this scoring method. When calculated as a whole instrument, the overall readability scores for the MLHFQ were 5.5 using the F–K grade level scoring and 69.2 using the FRE readability formula. These results place this survey at a “very easy” and “standard” level of readability respectively (see Table
1).
7.
National Eye Institute Visual Functioning Questionnaire-25 item (VFQ-25)
The mean and median F–K grade level score for the VFQ-25 questionnaire placed this survey at a “standard” level of readability (see Tables
1,
2). Twenty items (80%) scored above the recommended 5 years of schooling (see Fig.
7). Furthermore, two items require more than a High School level education to be properly understood. Using the FRE readability formula, the mean and median also placed this survey at a “standard” level according to Tables
1 and
2. Even though the mean and median values are standard, 80% (20/25) items did not fall in the recommended categories of “very easy” or “easy”. The overall readability scores for the VFQ-25 were 8.9 using the F–K grade level scoring and 63.7 using the FRE readability formula. These results set this instrument at a “standard” level of readability using both calculation methods (see Table
1).
Figure
8 shows scores obtained by scoring the item alone, along with the item in addition to response choices, for each item of the VFQ-25. The graph shows that for two items the scores were identical, for 20 items including response choices had a higher score, and for three items not including response choices had a higher score. For these last three items, this occurred because adding a second sentence, which normally is longer but with a lower readability score, contributes with a higher weight to the average total item score. In these three items, the second sentence, which included the response choices was shorter than the first sentence, and therefore contributed less to the weighted average.
Discussion
The results of this study reveal that current HRQOL measures may be inappropriate for general population surveys and in particular, they are inappropriate for populations with lower socio-economic status. Readability analysis for HRQOL surveys is important and furthermore analysis at the item level is essential. Mean scores for all of these widely used surveys required more than the recommended 5 years of formal schooling. Moreover, all surveys had a significant number of items with scores above the recommended threshold. These findings show that most readability studies, which report survey mean scores, are inadequate since a significant segment of the population will not have the literacy skills needed to comprehend and respond correctly to many items in the surveys. Furthermore, vulnerable populations will especially be affected with the administration of surveys, which are beyond their literary skills.
Ethnic minorities and underserved populations in the United States consistently show worse health outcomes, preventive screening rates, worse disease management, and lower survival rates [
44]. Health literacy and limited reading skills are known to be important barriers to improving health outcomes. Meade et al. [
44] reported on alarming low levels of literacy in the general population which happen to be disproportionately prevalent among vulnerable populations. There are multiple studies that report on health materials written at readability levels far above the recommended US national norms [
20]. Although educational level is not always consistent with literacy level, before developing new measures of HRQOL, it behooves outcome researchers to consider the educational background of the target populations. A discrepancy between the readability level and the appropriate readability when including underserved populations was found in most surveys analyzed in this paper. In addition, data is at an even higher risk of poor quality when surveys are administered to populations who lack literacy levels necessary for full comprehension of items. This is exacerbated when immigrant populations who tend to have less education and English proficiency are included in the sample.
Readability formulae are useful in that they can assist with a quantifiable estimation of the reading ease of given text. However, they do not take into account other factors that are important in predicting survey comprehension. Content, layout, learning stimulation, and cultural appropriateness are some examples of additional factors that might influence the readability of surveys. Furthermore, they do not take into account complementarities of individual items which can also facilitate understanding when taken as a whole in a specific context. Other personal factors that have been studied and found to affect readability are previous experience, motivation, and interest. These formulae may underestimate the effect of new material with vocabulary not usually used by the general population.
Bailin and Grafstein [
45] reported on a study documenting that reading ability is significantly determined by knowledge procedures involved in deriving significance from given text. An additional caveat of these formulae is that they rely solely on sentence and word length, and therefore score equally sentences with the same words but scrambled in a different order. Less useful in this context are other recommendations like design factors and other visuals that could accompany written text, and that have been found to increase readability. Even though most of these studies have been done on educational materials or web-based information, some extensively reported suggestions that might help with reading ease and that could be helpful when working with surveys are a font size of 12 or larger along with the use of black ink on white paper and the amount of white space in the page [
46,
47].
An additional limitation of this study is that readability analyses were performed only in one language using methods used primarily for a US population. The use of other indices such as the SMOG (Simple Measure of Gobbledygook) index which estimates the years of education needed to appropriately understand a piece of text, and which is often used in the United Kingdom, would be an important contribution to the literature. In addition, future studies could estimate the readability in other languages. For example, it would be of interest to use the Fernandez Huerta formula to estimate the readability of the SF-36 in Spanish or the Kandel and Moles formula to estimate the readability in French.
Despite these limitations, readability formulae provide a fast and efficient measurement tool that is readily available in commonly used computer software. The use of these formulae when developing surveys could help investigators select simpler vocabulary and sentence structure. Both scoring algorithms used in this paper yield better results when using shorter and more commonly used words and shorter sentences. The methods used in this analysis may still be used as a helpful tool when developing new surveys and modifying existing ones focusing on reducing the discrepancy between survey readability and population skills. For example, in Part IV of the QWB-SA instrument, all nine items have the instruction “please fill in all days that apply” as part of the question. By removing this phrase and placing it at the beginning of the section as an instruction for all the following items, readability scores are reduced from 8.3 to 5.8 for item 1 and from 10.5 to 8.3 for item 2, using the F–K method.
An interesting finding of this study was the variation in the readability within surveys and between surveys. The largest range within a survey was found in the QWB-SA with an item variation of 100.0 using the FRE algorithm and 21.0 using the F–K formula. The smallest range was seen in the MLHFQ with 24.7 using the FRE and 4.7 using the F–K algorithm. Both highest and lowest ranges were found in the same survey using both formulae. With regard to between surveys and considering the median value of each, the highest readability score with the FRE algorithm was seen in the HALex (59.6) and the lowest in the SF-36 (79.6). When using the F–K scoring algorithm, the highest median score was seen in both the HALex and MLHFQ, both 9.9, and the lowest was seen in the SF-36 (4.5). Not considering these extreme scores, the rest of the survey scores were all within the 60 s range using FRE algorithms, and showing more variability ranging in the 6th–9th grade level using F–K algorithm. Being a more stable statistic and less influenced by extreme values, the median was reported for this comparison.
No major differences were found between the generic and the disease-targeted instruments. Both disease-targeted instruments had means and medians above the recommended scores, as did most of the generic instruments. Of interest, both disease-targeted instruments had the same mean score using the F–K algorithm, but the median was lower in the VFQ-25. And as Fig.
7 confirms, this instrument has more items within the recommended range.
As seen in Fig.
8, most items have higher readability scores when including all response choices within the question. When surveys are administered by professional interviewers, most probably all response choices are read. The items may be longer literally, but the interviewer could be helpful in explaining items that are not clear to the subject, or emphasizing the item’s important part; both options not being available with self-administered surveys. In addition, Krosnick and Alwin’s study found that the order of response choices affects the response selected and differs when the questionnaire is self-administered versus interviewer-administered. While the likelihood of choosing the first response choices increased when the survey was self-administered, the likelihood of selecting the last choices increased with interviewer-administered surveys. Furthermore, the authors also concluded from their study that subjects with lower levels of education were more likely to be influenced by changes in the order of response choices [
48].
The validity of data collected from self-reported outcome measures depends upon the subject’s ability to comprehend each item in the survey. The gap between survey readability levels and necessary reading skills for comprehension must be reduced. Working along with educators and editors, researchers working with survey data need to become more conscious of the population’s low literacy levels. If the goal of outcome measurement is ultimately to improve HRQOL, sensitivity to an ever changing population is necessary when using existing measures and when creating new methods of evaluation. Surveys that are multicultural, multilingual, and literacy sensitive to a demographically continuously changing population are warranted.