Introduction
According to statistics by ROAMER (
Roadmap for Mental Health and Wellbeing Research in Europe), mental health disorders form about 11% to 27% of all diseases (Haro et al.,
2014). One of the most common mental health disorders is depression, which is a leading cause of disability worldwide with more than 322 million people all around the world suffering from it (WHO,
2017,
2020). Depression can lead to problems at work, at school, and in the family (WHO,
2020). The total estimated number of people living with depression increased by 18.4% between 2005 and 2015 (Vos et al.,
2016). The other frequently common mental health problem besides depression is anxiety. The total estimated number of people living with anxiety disorders in the world is 264 million (WHO,
2017). There is an increase of 14.9% between 2005 and 2015 (Vos et al.,
2016). Depression and anxiety affect all age groups, including children and adolescents. According to the World Health Organization, 10–20% of children and adolescents worldwide experience mental disorders (WHO,
1992), while 2–3% of children ages 6 to 12 and 6–8% of teens may have serious depression (ADAA,
2020). According to the Anxiety and Depression Association of America, about 80% of kids with anxiety, and 60% with depression are not getting treatment (ADAA,
2020).
The other common mental health problem among children is attention-deficit hyperactivity disorder (ADHD). Studies show that between 5 and 8.5 percent of children and 2.5 percent of adults have ADHD (Danielson et al.,
2016; Dulcan,
1997; Polanczyk et al.,
2007; Simon et al.,
2009), while symptoms may persist into adulthood in 50% of them (Karam et al.,
2015; Schmitz et al.,
2007).
Both genetic and non-genetic factors play important roles in mental health problems (e.g. Hettema et al.,
2005; Kendler et al.,
2011; Nadder et al.,
1998; Silberg et al.,
1999; Silberg et al.,
2001; Silove et al.,
1995; van den Berg et al.,
2006). Therefore, it is important to study the causal mechanisms and the development of mental health symptoms across the entire lifespan and study prevention and the effect of interventions. This with the help of either genetically informative cohorts, with genotyping, or with a combination of both.
The EU-funded project CAPICE (
Childhood and Adolescence Psychopathology: unravelling the complex etiology by a large interdisciplinary collaboration in Europe; Rajula et al.,
2021) has objectives relevant to this. It focuses on the improvement of later outcomes of child and adolescent mental health problems related to anxiety, depression, and ADHD. The CAPICE consortium consists of several research groups with both phenotypic and genotypic data (Rajula et al.,
2021). A common problem in research consortia is that the phenotypes are assessed with different questionnaires, resulting in difficulties combining the data from different studies (Luningham et al.,
2019; van den Berg et al.,
2014). This paper focuses on one objective of CAPICE: to construct a common metric for anxiety, depression, and ADHD phenotypes to harmonize them across research groups. At a later stage, these harmonized phenotypes can be used to make meaningful comparisons across countries, and generally boost statistical power in by increasing sample size, which is particularly important for genetic studies.
Two widely used screening instruments (or questionnaires) for psychopathology in children are
(a) the Strengths and Difficulties Questionnaire (SDQ) and
(b) the Child Behaviour Checklist (CBCL). These questionnaires have been used a lot by the consortium partners and it would help research tremendously if the data from the various partners could somehow be harmonized. Various studies have explored the development of internalizing and externalizing problem behaviour using the CBCL (Achenbach,
1991; Allen & Prior,
1995; Caspi et al.,
1995) and SDQ (Muris et al.,
2003; Ortuno-Sierra et al.,
2015). The CBCL consists of 113 items and operationalizes childhood behaviour on eight subscales/dimensions (social withdrawal, somatic complaints, anxiety/depression, social problems, thought problems, attention problems, delinquent behaviour, and aggressive behaviour; Achenbach & Ruffle,
2000; Achenbach et al.,
1991). The SDQ consists of 25 items equally divided across five scales, also called
dimensions (Emotional, Conduct, Hyperactivity, Peer, and Prosocial problems; Goodman,
1997,
2001) and it is used for children aged 3–16 years.
The two questionnaires largely assess the same dimensions of child psychopathology: Emotional problems, Hyperactivity, Social problems, and Conduct problems. When it comes to anxiety and depression, CBCL has Anxiety/depressed and Withdrawn/depressed subscales. Although the SDQ does not have separate scales for anxiety and depression, it has an Emotional problems scale that addresses these difficulties. On the other hand, when it comes to ADHD type of problems, SDQ has a hyperactivity-inattention scale that also includes items related to concentration problems, while the CBCL has an attention problems subscale that includes both hyperactivity and attention problems.
The main differences between the two questionnaires are when it comes to the phrasing of the individual questions and the response format. For example, the SDQ asks whether the child is easily distracted, and the CBCL asks whether the child can concentrate. The SDQ is rated with answer categories not true, somewhat true, and certainly true, whereas the CBCL is rated with answer categories not true, somewhat/sometimes true, and very true/often true. Given these differences in phrasing and response formats, item responses on the SDQ and CBCL or scale scores cannot be directly compared, as they contain slightly different sets of behaviour and different numbers of items. This means that if one child scores 4 points on the SDQ subscale for anxiety/depression (Emotional problems), and another child scores 4 points on the anxiety/depression subscale of the CBCL, it is usually not possible to say which child shows the most problematic behaviour and by how much they differ. We need to know how the items were phrased (do they address serious or less serious problems), how the items are scored with what response categories, and the number of items in order to compare SDQ and CBCL scores. Even if we know all the above-mentioned things, it is still hard to make conclusions regarding the quantitative differences between these two scores. Therefore, a common metric is needed to quantify individual differences on two subscales from different questionnaires. Once a common metric is found, scores from different questionnaires can be harmonized by transforming the original scores to new scores on the common metric.
Finding a common metric for comparable subscale scores can be achieved by different methodologies. A common choice is to use the methodology of test linking (Kolen & Brennan,
2014). There are multiple ways to carry out test linking and one of the widely used approaches is the framework of Item Response Theory (IRT; Embretson & Reise,
2000; Kolen & Brennan,
2014). The use of IRT is a common and flexible method for modeling the relationship between participants` trait levels and their responses to items (Park et al.,
2019; van den Berg et al.,
2007,
2014). In the IRT approach, a participant`s response to an item depends on both the participant`s trait level and the item parameters of that particular item (Embretson & Reise,
2000). In order to construct a common metric, we need a sample from individuals with overlapping data on both questionnaires (Hussong et al.,
2013; Luningham et al.,
2019). By applying the IRT approach to such a sample with data on both questionnaires, we are able to define a common metric. As an example, in the Genetics of Personality Consortium, big-five personality phenotypes were harmonized across several personality questionnaires (Van den Berg et al.,
2014). They used IRT models on the data sets where groups of participants responded to multiple questionnaires. Assuming that the level on the underlying trait does not change between filling in Questionnaire A and Questionnaire B, using IRT one can link the items from both questionnaires and define one common metric. Using the IRT approach, the data from more than 23 cohorts worldwide could be harmonized, resulting in genome-wide association (GWA) meta-analysis of neuroticism (De Moor et al.,
2015) and extraversion (van den Berg et al.,
2016) with large sample sizes. It not only increases sample size, but also overcomes the problem of comparing allelic effect sizes across studies (van den Berg & de Moor,
2020).
In this paper, we focus on harmonizing phenotypes for childhood psychopathology with a specific interest in the CAPICE phenotypes: anxiety, depression, and ADHD. We used the data from the Western Australian Pregnancy Cohort – Raine study (Newnham et al.,
1993) to link the CBCL and SDQ data from children aged 10–11.5 years. In the Raine study, parents were asked to fill in the CBCL and SDQ at the same moment in time on their child between age 10–11.5 years. In this study, we have responses of 1330 participants on both CBCL and SDQ questionnaires and based on that we are able to link the items from both the questionnaires and to define one common metric. This provided a unique opportunity to see whether one can find anxiety, depression, and ADHD dimensions shared by both these questionnaires and to see whether common metrics can be found. The item-response theory models were applied to define common metrics.
The results of this study are useful for data harmonization of anxiety, depression, and ADHD in the CAPICE project as well as in other research consortia. Harmonization of phenotypes is especially important in large research consortia, but it can also be helpful in smaller research studies. Our findings can be applied in all situations where researchers need to harmonize data from two samples which have filled in two different questionnaires for measuring anxiety, depression, or ADHD (one sample with CBCL data, one sample with SDQ data) to be able to compare research results across countries or subpopulations or to increase the size of the sample (Hamilton et al.,
2011; Smith-Warner et al.,
2006; Thompson,
2009; van den Berg et al.,
2014). Besides, it can also be used to analyse longitudinal data where CBCL data were gathered at age
x and SDQ data were gathered at age
y. Lastly, if a sample contains both CBCL and SDQ data at the same age from the same individual, our results can be used to increase measurement reliability or validity (Fortier et al.,
2010,
2011).
It is especially important to mention the advantages of using harmonized scores in behaviour genetic studies. Most importantly, the use of harmonized scores leads to an increase in sample size and/or an increase in measurement precision of the phenotypes, and that in turn increases the statistical power of the results (van den Berg et al.,
2014). We will introduce the advantages of using harmonized scores in behaviour genetic studies in more detail below.
Increasing Statistical Power of the Results Using Harmonized Scores in Behaviour Genetic Studies
The effect sizes in the case of complex human traits are often small and that is one of the main reasons why meta-analytic studies (e.g. genome-wide association (GWA) studies) are required in the field of behaviour genetics (van den Berg et al.,
2014; Zeggini & Ioannidis,
2009). One of the biggest problems in the meta-analysis of behavioural measures is caused by the use of different measurement instruments across studies for assessing a particular phenotype (van den Berg et al.,
2014). In most meta-analytic studies, one is not able to meaningfully compare effect sizes across studies (van den Berg & de Moor,
2020). The coefficient of regression of phenotype on the number of alleles for a single-nucleotide polymorphism (SNP) gets a different meaning every time that the scale of the phenotypic measure changes. Instead, one could compare
p-values, but that leads to the difficulty that we lose information about the direction of the effect (Begum et al.,
2012; Sullivan & Feinn,
2012; Zeggini & Ioannidis,
2009). Instead of
p-values, one could look at standardized regression coefficients, that yield information about the direction of the effect. But unfortunately, they do not give information about the absolute size of the effect, as unstandardized regression coefficients do. If the phenotype is assessed by the same instrument in all studies, the unstandardized regression coefficient tells us how many units on the scale of the common instrument we gain for every additional allele (in case of an additive inheritance pattern). It yields information on both the direction and size of the effect.
Thanks to harmonized scores, researchers can use larger samples in the meta-analytic studies, leading to the higher statistical power of the results (Sullivan et al.,
2012). For example, van den Berg et al. (
2014) showed that the statistical power to detect a single nucleotide polymorphism (SNP) at the genome-wide significance level that explains 0.1% of the true phenotypic variance with an allele frequency of 0.5, substantially increased from 18 to 44% after harmonization.
Besides that, the use of harmonized scores also leads to an increase in measurement precision, and higher measurement precision leads to an increase in the statistical power of the results (van den Berg et al.,
2014). In other words, the use of harmonized scores can also lead to higher statistical power than the use of non-harmonized scores (i.e. sum scores) even if the size of the sample is the same in both of the cases. This happens for example when you include items from other scales into the measurement model.
In this study, after the scale linking, we will conduct a simulation analysis in order to compare the statistical power of the results in meta-analyses based on sum scores and harmonized scores. In both cases, we will use the same sample size and the same number of items.
Discussion
This study aimed to construct common metrics for anxiety, depression, and ADHD as measured by the Child Behaviour Checklist (CBCL) and Strengths and Difficulties Questionnaire (SDQ). We used a top-down (theoretical) and bottom-up (data-driven) approach. In the top-down approach, we used existing scales related to anxiety/depression and ADHD, while in the bottom-up approach we conducted a factor analysis to identify anxiety/depression and ADHD scales and to examine whether the theoretical (top-down) approach is supported by the data-driven (bottom-up) approach.
When it comes to the theoretical approach, existing anxiety/depression scales consisting only of the CBCL, and only of the SDQ items showed good measurement properties in the Raine data set. Also, a combined anxiety/depression scale consisting of both CBCL and SDQ items is a good scale. In the case of ADHD, separate CBCL and SDQ scales, as well as combined scale consisting of both CBCL and SDQ items, are good quality scales.
Regarding the bottom-up (data-driven) approach, the psychometric analysis showed that all the items in the anxiety/depression scale consisting only of the CBCL, and only of the SDQ items show a good fit. Both of the scales, analysed separately, are good. The psychometric analysis showed that the items from the scale consisting of both the CBCL and the SDQ items have good item fit. We concluded that CBCL and SDQ items form a good-quality scale that operationalizes anxiety/depression. When it comes to ADHD, the scale consisting only of the CBCL items show a good fit as well as the scale consisting only of SDQ items. The results showed that the combined ADHD scale consisting of both the CBCL and SDQ items is good, too.
A comparison between the top-down and bottom-up scales showed a large overlap. The top-down approach where we started from existing subscales is therefore supported by a purely data-driven approach. Accordingly, we can advise using the top-down approach scales in both anxiety/depression and ADHD for data harmonization in other studies (Tables
3 and
4). Based on these item parameters, EAP estimates can be calculated for each participant which can function as harmonized scores.
One problem we stumbled on is that some items are very similar across questionnaires. For psychometric reasons, such items had to be deleted from one of the scales (CBCL). This, however, leads to loss of information: content-wise as well as in the number of items, affecting both validity and reliability. We showed that an approach where latent trait levels are estimated using parameters only of CBCL items from the combined scale (without the use of CBCL items that are excluded due to psychometric reasons) leads to loss of information. Accordingly, we devised an approach using linear regression to overcome this problem. We obtained item parameters of excluded items using a linear transformation so that other researchers also have item parameters for these items (Tables
3 and
4). This restores both reliability and validity.
The results also showed that the use of harmonized scores solves one of the biggest problems in the meta-analysis of behavioural measures. Namely, one of the main difficulties in the meta-analytic approach is the use of different measurement instruments across studies for assessing a particular phenotype (van den Berg et al.,
2014). Consequently, effect sizes cannot be meaningfully compared across studies (van den Berg & de Moor,
2020). Instead, researchers need to use standardized regression coefficients or
p-values, but that leads to loss of information about the absolut size of the effect, in the case of standardized regression coefficients, and both the absolut size and direction of the effect, in the case of
p-values. The use of harmonized scores solves this problem because it allows using of unstandardized regression coefficients as a measure of the effect size. They provide us with information on both the direction and size of the effect, that is, they allow meaningful comparison of the absolute effect sizes across studies. In addition, the previous studies showed that the use of harmonized scores can increase the size of the sample and, accordingly, statistical power of the results (e.g. van den Berg et al.,
2014). In this study, we showed that the use of harmonized scores leads to the higher statistical power of the results than the use of sum scores, even if the size of the sample is the same in both cases.
These findings can help future researchers to harmonize data from different samples and/or different questionnaires. The findings can be useful in various ways, both in CAPICE and other research consortia. It can be helpful for researchers to combine data from different groups of respondents with different questionnaires to obtain the larger sample sizes, to be able to compare research results across subpopulations or to increase generalizability, the validity or statistical power of research results (Fortier et al.,
2010,
2011; Hamilton et al.,
2011; Smith-Warner et al.,
2006; Thompson,
2009; van den Berg et al.,
2014). For example, researchers can use CBCL item parameters from a combined scale to estimate latent trait levels among participants who filled in only CBCL items, and they can use SDQ item parameters from a combined scale to estimate latent trait levels for participants who filled in only SDQ items. Furthermore, based on those estimations they can compare results between persons who filled in only CBCL or only SDQ items. In the situation where they have participants` answers on both questionnaires, they can use the results of this study to estimate participants` latent trait levels based on both questionnaires together, thereby increasing the measurement reliability. These findings can also be used in longitudinal studies where CBCL data are gathered at age
x and SDQ data are gathered at age
y.
In the supplementary material, we describe how to use these results in practice. The procedure is very simple and a detailed example R script is provided that shows how item responses can be used to obtain harmonized scores for anxiety/depression and ADHD on the respective common metrics using the Computerized Adaptive Testing with Multidimensional Item Response Theory (mirtCAT) package (Chalmers,
2016) of R. In Supplementary Table
16, we presented harmonized scores for various example data vectors. For instance, response pattern A includes only CBCL items with a total sum score of 6. Response pattern B includes only SDQ items with the same sum score of 6. However, based on our combined scale, we see that the scores on the common metric are different, with also different standard errors (reliability).
One important limitation of the current study is that the quality of the harmonization depends on the extent that there is measurement invariance across different populations: other cohorts from other countries and using different languages, and different ages. Future research should compare item parameters from different cohorts, to determine to what extent the results from this harmonization effort extend beyond Australian 10-year-olds.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.