Introduction

A seminal finding of twentieth century epidemiology was that a population’s mean predicts the proportion of high-scoring (‘deviant’) individuals. This was first demonstrated for physical health [1] and has recently been reported for mental health in adult populations across Europe [2] and in child populations within Great Britain [3].

These findings highlight the importance of implementing population-wide interventions alongside interventions which target the highest-risk individuals [1]. They also suggest the possibility of using population mean scores to compare health over time, space or culture. Caution is needed, however, when making such comparisons using subjectively reported outcomes such as mental health. This is because differences in mean scores may not reflect differences in population health but rather systematic bias in how mental health is reported. Such biases may be particularly likely in brief questionnaires which (unlike structured diagnostic interviews), ask only a small number of broad questions and which allow no role for clinical judgement [4, 5].

We have previously shown that, in general, such systematic reporting biases do not seem to apply within Great Britain when using the Strengths and Difficulties Questionnaire (SDQ) [6]. Mean SDQ symptom scores predicted the prevalence of disorder in an accurate and unbiased manner across populations defined by multiple child, family and area characteristics (e.g. ethnicity, family type, and area deprivation) [3]. This was true for the parent, teacher and youth SDQs alike, and allowed us to derive and validate UK ‘SDQ prevalence estimators’. For the parent and teacher (but not youth) SDQs, the prevalence of disorder was also closely estimated by (1) the proportion of individuals with high SDQ symptoms plus impact; and (2) the proportion of individuals reporting ‘definite’ or ‘severe’ difficulties in a one-item, global rating of child mental health problems.

It would be a great boost to child psychiatric epidemiology if these British findings applied cross nationally, i.e. if the same set of equations could be applied to generate prevalence estimates in and within countries other than Britain. First, it would allow researchers in other settings to treat the SDQ as an accurate and unbiased method for monitoring and comparing child mental health. This could be particularly important in low- and middle-income settings, which frequently lack the money and clinical expertise to conduct prevalence studies using detailed diagnostic interviews and/or to use diagnostic interviews to derive country-specific prevalence estimating equations. Second, it would greatly facilitate comparisons of child mental health across many different countries, and so aid the identification of population-level determinants of health [7].

Interesting findings regarding cross-cultural similarities and differences in child mental health have already emerged from international comparisons using brief questionnaires [810], including the SDQ [11, 12]. Yet, interpreting these findings is substantially complicated by uncertainty about how far these brief questionnaires provide unbiased cross-cultural estimates of disorder prevalence. Several studies indicate that rating norms may differ across cultures [13, 14], providing indirect evidence that brief questionnaires may be problematic. To our knowledge only one study examines this issue directly, demonstrating that differences in mean SDQ scores only sometimes reflected different disorder prevalences in Norway compared to Britain [4]. This paper builds upon this Norway–Britain comparison to examine whether caseness indicators based on the parent, teacher or youth SDQ provide meaningful prevalence estimates cross nationally.

Methods

Study samples

Our data come from 29,225 5- to 16-year olds from seven different countries: Britain [15, 16], Norway [17], Brazil [18, 19], Yemen [20], India [21], Bangladesh [22] and Russia [23]. These represent the participants in all published population-based studies which have: administered the parent SDQ; estimated prevalence using a highly comparable form of clinician-rated diagnosis (including shared supervision, as described see below); and based prevalence estimates upon Development and Well-Being Assessment (DAWBA) interviews about at least 100 children.

All these studies have previously been reported in detail individually [1523] and Table 1 summarises their survey methodology, including sampling procedures and informants used. Four out of eight studies were two-phase, administering the DAWBA to all children who screened positive on the SDQ and to a random subsample of children who screened negative. All studies approached parents for written informed consent to take part and the present analyses include only those children with complete parent SDQ data. With parental permission, 7/8 studies also collected mental health data from teachers (all except India) and 5/8 collected data from youth aged 11–16 (all except Bangladesh, Norway and Yemen). All studies received ethical approval from local and/or UK research ethics committees.

Table 1 Key methodological features and sample characteristics of study populations

In three studies (from Brazil, Britain and Yemen), we subdivided the study samples a priori into further socio-demographic populations. The result was 10 British and 10 non-British populations, the age range and sex composition of which are reported in Table 1.

Measures

Strengths and Difficulties Questionnaire (SDQ)

The Strengths and Difficulties Questionnaire (SDQ) is a brief questionnaire measure of child mental health problems that can be administered to parents and teachers of children aged 4–16 and to young people aged 11–16 [6, 24]. It contains 20 items covering emotional symptoms, conduct problems, hyperactivity and peer problems, which can be summed to give a ‘total difficulty score’. The total difficulty score is a measure of overall child mental health problems that has been shown to have good psychometric properties in studies from around the world [6, 2531]. This includes evidence that the total difficulty score is correlated with existing questionnaire and interview measures; differentiates clinic and community samples; and is associated with increasing rates of clinician-rated diagnoses of child mental disorder across its full range.

This paper makes cross-cultural comparisons using three SDQ caseness indicators

  1. 1)

    ‘SDQ prevalence estimates’. Within Britain, we have previously derived and validated equations which estimate disorder prevalence based on mean total difficulty scores, adjusting for the population’s age and sex composition [3; prevalence estimator equations in Supplementary material]. We used these prevalence estimates rather than raw mean scores in order to allow for age differences between our study samples.

  2. 2)

    SDQ ‘symptoms+impact’. The SDQ impact supplement asks whether reported difficulties cause the child distress (1 item) or impairment in their daily life (4 items for parents and youth, and 2 items for teachers) [29]. We calculated the proportion of children with borderline or high symptoms (total difficulty score cut-points 13/14 for parent SDQ, 11/12 for teacher SDQ, and 15/16 for youth SDQ) plus high impact (impact score cut-point 1/2 for all informants) [32].

  3. 3)

    ‘Definite/severe’ difficulties. The SDQ symptom questions are followed by a single item asking whether the child has difficulties with “emotions, concentration, behaviour or being able to get along with other people”. We calculated the proportion of informants reporting ‘definite’ or ‘severe’ difficulties (vs. ‘no’ or ‘minor’ difficulties).

Development and Well-being Assessment (DAWBA)

We measured disorder prevalence using the DAWBA. This is a detailed psychiatric interview administered by lay interviewers to parents and youth, and a briefer questionnaire for teachers [33]. The main DAWBA interview is fully structured, closely following the diagnostic criteria operationalised in the Diagnostic and Statistical Manual of Mental Disorders, 4th edition (DSM-IV) [34]. Responding parents, teachers and youth are then prompted to describe any reported difficulties in detail, with answers recorded verbatim by the interviewer. Experienced clinicians review the open and closed accounts of all available informants, and rate the presence or absence of individual diagnoses according to DSM-IV [35].

In our eight study populations, the DAWBA diagnoses have been shown to have high inter-rater reliabilities [17, 18, 36, 37], to discriminate clinic and community samples [18, 22, 33, 36], to show plausible patterns of comorbidity and association with risk factors [17, 19, 23, 37, 38], and to be strongly predictive of mental health service contact [17, 39]. All diagnostic ratings were carried out by the DAWBA’s creator (RG) or by experienced local professionals supervised by RG. These experienced local professionals trained initially on the 54 practice cases in the on-line DAWBA manual (http://www.dawba.info/manual/m0.html). They were then supervised individually by RG who reviewed a mixture of randomly selected cases and difficult cases that the trainee had provisionally rated.

Analyses

We calculated all prevalence estimates and confidence intervals using sampling weights to correct for the two-phase design of some studies (see Table 1). We also adjusted for the complex survey design of those studies that used stratification or clustered sampling. We plotted each of our nine SDQ caseness indicators (three measures times three informants) against the measured prevalence of disorder using the DAWBA, deriving the measured prevalence from the same subset of children (e.g. comparing predictors based on teacher SDQs with the prevalence of disorder in children with teacher SDQ data). We fitted nine corresponding linear regression models, with the relevant SDQ caseness indicator as the explanatory variable and giving all study populations equal weight. We present the adjusted R 2 values from these regression models as a measure of how much of the variance in prevalence was explained. All analyses were performed in Stata 10.2

Results

The prevalence of disorder measured using the DAWBA ranged from 2.2% in our Indian sample to 17.1% in our Russian sample. Figure 1 plots these prevalence values (y-axis), comparing them to the three parent-based SDQ caseness indicators (x-axis) and presenting the R 2 values; Figs. 2, 3 present corresponding graphs for the teacher and youth SDQs. This information is also presented in tables in the Supplementary material, together with the raw mean total difficulty scores upon which the SDQ prevalence estimates are based. The Supplementary material also shows the prevalence rates and relative proportions of emotional, behavioural and hyperactivity disorders; these relative proportions were much less variable than the overall prevalence rates.

Fig. 1
figure 1

Parent SDQ caseness indicators versus prevalence of disorder: data from seven countries

Fig. 2
figure 2

Teacher SDQ caseness indicators versus prevalence of disorder: data from six countries

Fig. 3
figure 3

Youth SDQ caseness indicators versus prevalence of disorder: data from four countries

The three parent-based SDQ caseness indicators yielded R 2 values of 0.14–0.38—that is, explaining 14–38% of the observed cross-national variation in the prevalence of disorder ascertained using the DAWBA (see Figures for individual R 2 values). The corresponding R 2 values were 0.30–0.56 for teachers and for 0.08–0.41 for youth. These values were similar when the analyses were repeated separately for study populations aged 5–10 years and for populations aged 11–16 years (see Supplementary material) and generally fell when the British samples were removed. Only within Britain did the SDQ prevalence estimates closely approximate the true prevalence (i.e. lie close to the 45 degree line plotted in the Figures); in most other populations the SDQ prevalence estimator equations overestimated the prevalence, while in Norway they underestimated it.

The result was that none of these SDQ caseness indicators could be used to make meaningful estimates of prevalence across the non-British samples. To illustrate this point, it is useful to consider the performance of the parent SDQ in the 10 populations with the highest measured prevalence of disorder. The actual prevalences as measured by the DAWBA ranged from 11 to 17% in these 10 populations (see Fig. 1). By contrast, the estimated prevalences from the parent SDQ prevalence estimators were 10–15% in rural Brazil and the four most deprived British population; 22% in urban Yemen and urban slum Brazil; 30–32% in Russia and Bangladesh; and 60% in rural Yemen. The other two parent SDQ caseness indicators did no better, giving values ranging from 5 to 47% for these same populations. An instance of inaccurate prediction affecting a population with a low prevalence of disorder was seen in the Northeastern Brazilian quilombo (predominantly African–Brazilian rural area): this had a parent SDQ prevalence estimate of 39%, as compared to a measured prevalence of 7%.

Populations with a similar measured prevalence of disorder therefore showed large variations in the parent SDQ caseness indicators. The same was true for the teacher and youth SDQs, as shown in Figs. 2, 3. Furthermore, the relative ordering of populations was not consistent across these measures. For example, in Bangladesh the parent SDQ prevalence estimate was 32%, reflecting a high level of symptoms reported by parents. Yet only around 5% of the Bangladeshi children had SDQ symptoms+impact or were reported by their parents to have ‘definite/severe’ difficulties, among the lower values in the sample. The Brazilian quilombo likewise had one of the highest parent SDQ prevalence estimates (39%) but only 1–3% had symptoms+impact or ‘definite/severe’ difficulties. The teacher and youth SDQ produced similar findings. These discrepancies suggested cross-cultural variation in the relationship between symptoms and impact within the SDQ. To investigate this, we plotted mean parent SDQ impact scores against the SDQ prevalence estimates—that is, against age-adjusted parent SDQ symptoms. As Fig. 4 shows, Bangladesh and quilombo Brazil stand out in having unusually low impact scores at a given level of symptoms. The same was true of rural Yemen, where mean parent impact scores were slightly lower than urban Yemen, but the SDQ prevalence estimates were much higher.

Fig. 4
figure 4

Estimated prevalence of disorder from parent SDQ versus mean impact score on parent SDQ, stratified by disorder status. Bang Bangladesh; Br,mc Brazil, middle class; Br,r Brazil, rural; Br,sl Brazil, urban slum; Br,q Brazil, quilombo; Ind India; Rus Russia; Nor Norway, Y,u Yemen, urban; Y,r Yemen, rural; Unlabelled points Great Britain

Indeed, parent SDQ symptom scores were so high in rural Yemen that the population mean of non-disordered children was comparable to that of children with a disorder in Britain. The converse was true of the final notably anomalous population in Fig. 4, namely the 26 Indian children with a disorder. These children had mean levels of parent-reported symptoms and impact which were far lower than disordered children in any other population (p ≤ 0.003), and indeed lower than non-disordered children in Russia and Yemen. This was replicated for the youth SDQ, where again the Indian children with a disorder had mean SDQ symptom and impact scores which were indistinguishable from non-disordered samples in most other populations (see Supplementary material for teacher and youth graphs).

Discussion

This study of 29,225 5- to 16-year olds from seven countries has examined whether measures based on the parent, teacher or youth SDQ can be used to estimate the prevalence of child mental disorder cross nationally without the need for population-specific norms. Our findings suggest that this is not possible, and that population-specific norms may be needed when estimating prevalence. Our findings also imply the need for substantial caution when interpreting cross-cultural comparisons of levels of child mental health problems which are based solely upon brief questionnaires.

When interpreting these findings, it is worth bearing in mind the limitations of our study. First, our study populations had different age ranges. However, the low correlations between the SDQ measures and the DAWBA were almost unchanged after stratifying by age, suggesting that this cannot explain the large cross-national discrepancies observed. A second limitation is that although all studies collected mental health data from parents, one study did not include teachers and three did not include youth. This undermines comparability because multi-informant DAWBA information generates slightly higher prevalence estimates (e.g. clinicians in Britain were 6% more likely to diagnose a disorder if teachers completed a DAWBA as well as parents [15]). Again, however, these effects are too small to plausibly affect our substantive conclusions. Finally, the DAWBA-generated prevalence figures are themselves only estimates of the true prevalence. Despite our efforts to standardize ratings through shared training and supervision, the DAWBA diagnoses are themselves subject to measurement error, some of which may be systematic across countries. Nevertheless, we believe that the DAWBA’s use of multiple detailed questions, open-ended transcripts and local clinical judgment all render it less prone to cross-cultural bias than the SDQ [4]. Moreover, any bias in the DAWBA cannot plausibly account for the extremely large cross-national differences we observed in the SDQ.

We are therefore confident in our substantive conclusion that the SDQ shows large cross-cultural reporting effects and cannot be assumed a priori to be a valid method for comparing prevalences cross nationally without recourse to population-specific norms. Of course, brief questionnaires may nonetheless be important in monitoring mental health or examining risk factor associations. Moreover, cross-cultural bias between countries does not necessarily translate into cross-cultural bias within a country. For example, despite the differences between the Indian and the British studies in this paper, the SDQ and DAWBA have very similar psychometric properties between British Indians and British Whites [40]. More broadly, within Britain the parent, teacher and youth SDQs generally provide accurate and unbiased prevalence estimates for populations defined by multiple child, family and area characteristics [3].

Yet what our findings do indicate is that population-specific SDQ norms may be necessary for valid international comparisons. Moreover, it cannot necessarily be assumed that the same norms will always apply within a single country. For example, parent SDQ symptom scores were far higher in rural Yemen than urban Yemen, despite similar disorder prevalences and SDQ impact scores. Much the same was true comparing the Northeastern Brazilian quilombo with the Southeastern Brazilian populations. One possible explanation is that in relatively isolated rural communities, respondents have little experience of completing questionnaires, and may find it hard to know what level of symptoms the investigators are interested in [19]. In Yemen, rural parents may also show lower tolerance for problematic child behaviour than urban parents. This would be consistent with our previous demonstration that harsh physical punishment is particularly common in rural Yemen, perhaps reflecting a higher work burden and reduced childcare support [41]. Thus, SDQ symptom scores may be higher when respondents have little familiarity with questionnaires and perhaps when stressful life circumstances reduce tolerance for troubled children. We believe both factors may partly explaining why, relative to British norms, the SDQ tended to overestimate the prevalence of disorder in all our low and middle-income country samples. Only Norway showed an effect in the opposite direction, possibly reflecting a more ‘normalizing’ attitude towards some child mental health problems [4].

One final striking cross-national anomaly was the low SDQ symptom and impact scores of children with a DAWBA diagnosis in Goa, India. This could reflect a cross-national rating bias, such that the threshold for assigning DAWBA diagnoses was lower in India than elsewhere. This, however, would imply that the true prevalence in our Indian sample was even lower than the (already exceptionally low) 2.2% recorded. Instead the judgement of the experienced local adolescent psychiatrist (VP) is that Indian informants were understating child mental health symptoms and impact. This counterpoint to the overstatement hypothesised in rural Yemen again highlights the importance of using local cultural and linguistic knowledge when reading the DAWBA transcripts and interpreting responses to structured questions.

To summarise, this paper uses a uniquely rich dataset to demonstrate substantial cross-cultural differences in how parents report child mental health problems on the SDQ. Our findings also demonstrate that these cross-cultural differences take many different forms, and do not show any obvious systematic pattern. We conclude that the SDQ cannot be used as a short-cut to comparing prevalence cross nationally. Furthermore we hypothesise that this may also apply to other widely used questionnaires such as the Rutter [42] and the ASEBA [4345], which are similar to the SDQ in their format, items and psychometric properties [6, 25, 46]. We therefore recommend that questionnaires are only used in cross-cultural comparisons when their cross-cultural equivalence has been empirically demonstrated. Doing so may require detailed diagnostic measurements that employ local and contextual knowledge in order to provide population-specific reference points for judging the performance of brief questionnaire measures.

Such cross-national comparisons based on detailed culturally sensitive assessments will clearly require substantially more time and resources than questionnaire-based studies. Nonetheless, their potential importance is illustrated by the almost eightfold difference between the 2.2% prevalence of child mental disorder in our Indian sample to the 17.1% prevalence in our Russian sample. This is far greater than the variation typically seen within populations from the same country; for example, prevalence ‘only’ varied from 5.7 to 13.5% between the least deprived and most deprived deciles in Britain. Under such circumstances, multi-population studies may yield powerful new insights into the determinants of population health [7]. Understanding international differences in child mental health therefore remains a key research goal in seeking to improve child mental health worldwide, but achieving this may require more than questionnaire comparisons.