Introduction

Children scoring high on withdrawn behavioral scales are characterized by shy, inhibited, introvert and withdrawn behavior (WB). WB correlates with symptoms of anxiety and depression (Verhulst et al. 1996), and WB in childhood has been shown to predict anxiety disorders and major depression in adolescence and adulthood (Goodwin et al. 2004). In a follow-up study spanning 14 years, Hofstra et al. (2000) found that parent reported WB was an important predictor for malfunctioning in adulthood. WB at the time of first measurement predicted both adult internalizing and externalizing problems 14 years later. Furthermore, inhibited 3-year-olds (children who are shy, fearful and easily upset) were more likely to meet diagnostic criteria for depression when they were 21 years old (Caspi et al. 1996). Children described as “shy” on multiple time points showed increased incidence of anxiety problems in adolescence (Prior et al. 2000). The evidence that childhood WB is a predictor for anxiety and depression later in life is further supported by laboratory studies of behavioral inhibition. Behavioral inhibition (characterized by shy, inhibited behavior, and fear for novel situations) is present in about 10 to 15% of children (Kagan et al. 1988). Behaviorally inhibited children have higher rates of childhood anxiety disorders (Rosenbaum et al. 1993; Biederman et al. 2001) and are at increased risk of developing adolescent social phobia (Hayward et al. 1998).

Longitudinal studies indicate that behavioral problems show considerable stability. In a large population sample of Dutch children, a correlation of .48 for problem behaviors across an 8-year period was found (Verhulst et al. 1996). In a follow-up of this study the stability of general behavioral problems over 14 years was found to be .43 (Hofstra et al. 2000), whilst the continuity of withdrawn behavioral problems was .36. The continuity of problem behaviors advocates research into the underlying mechanisms influencing stability of behavioral traits. In the last decades a range of longitudinal studies have focused on childhood externalizing problem behaviors in general (e.g. Van der Valk et al. 2003a; Bartels et al. 2004b; Fergusson 1998; Haberstick et al. 2005) and more specific problem behaviors such as attention problems (e.g. Rietveld et al. 2004; Mannuzza et al. 2003), aggression (e.g. Campbell et al. 2006; Alink et al. 2006) or conduct disorder (e.g. Fergusson et al. 2005; Kim-Cohen et al. 2003). Likewise, substantial attention has been devoted to internalizing behaviors in general (e.g. Van der Valk et al. 2003a; Bartels et al. 2004b; Haberstick et al. 2005) and more narrowly defined problems such as anxiety disorder and depression (e.g. Fombonne et al. 2001; Tram and Cole 2006). However, surprisingly little research has focused on withdrawn behavioral problems.

A powerful way of unraveling the genetic and environmental effects on individual differences in the development of behavioral problems is the study of genetically related individuals. Both cross-sectional and longitudinal studies using the classical twin design have been conducted to assess heritability estimates for broad band internalizing and externalizing problem behaviors (Bartels et al. 2004b; Van der Valk et al. 2003a) as well as for specific syndrome scales such as aggression (Van Beijsterveldt et al. 2003; Haberstick et al. 2006), obsessive compulsive disorder (Hudziak et al. 2004; van Grootheest et al. 2007), juvenile bipolar disorder (Boomsma et al. 2006b), attention problems (Rietveld et al. 2004), and anxious/depression (Boomsma et al. 2005). However, no large scale longitudinal twin studies into WB have been reported.

Family studies into childhood WB have been scarce, but there are indications that familial factors play a role. Behavioral inhibition is more frequent in children whose parents have agoraphobia and panic disorder (Rosenbaum et al. 1988), and anxiety disorders are more frequent in the families of behaviorally inhibited children (Rosenbaum et al. 1991). Furthermore, a study in a large sample of 4-year-old twins reported a heritability of 76% (boys) and 66% (girls) for shyness/inhibition, as assessed with a 3-item questionnaire (Eley et al. 2003). A few twin studies examined the cross-sectional heritability of WB at various ages in childhood using the Child Behavior Checklist (CBCL; Achenbach 1991; Achenbach 1992). An early twin study in a relatively small sample of 2 to 3 year-old twins found no significant genetic effects on variance in WB (Schmitz et al. 1995). Contrary to these findings, Van den Oord et al. (1996) reported major genetic influences (74%) and no evidence for shared environmental influences on individual differences in WB in a sample of 1,358 3-year-old twin pairs. Eight years later, Derks et al. (2004b) analyzed data on WB of more than 9,000 3-year-old twin pairs, including the data used in Van den Oord’s study and found moderate heritability (about 60% in boys; 45% in girls) and significant shared environmental effects. Two early twin studies examined the heritability of WB in middle childhood (sample sizes 181 and 203 pairs) and reported significant genetic effects (Edelbrock et al. 1995; Schmitz et al. 1995). On the other hand, a twin study from Taiwan including 279 12 to 16-year-old twin pairs (Kuo et al. 2004) found no significant genetic influences and major effects of shared and non-shared environment. One study compared WB data of biological and non-biological adopted siblings and found modest genetic influences at first assessment (age between 10 and 15 years), but no significant genetic effects 3 years later (Van der Valk et al. 1998). These family studies are all based on parental ratings of WB. Using teacher report data of WB in 5-year-old twins, Polderman et al. (2006) found moderate genetic (49%) and non-shared environmental (51%) effects.

Only two longitudinal studies (Van der Valk et al. 1998; Schmitz et al. 1995) have examined the genetic influences on the stability of childhood WB, and both failed to find significant genetic contributions to stability. However, in both studies the power to detect such effects was very low, due to limited sample size (Schmitz et al. 1995) or the design of the study (Van der Valk et al. 1998).

To summarize, the results of studies into the heritability of childhood WB have yielded varying results. Large scale studies into WB at later ages in childhood are lacking. Moreover, little is known about the genetic and environmental mechanisms underlying stability in WB. Unlike the WB syndrome scale of the CBCL, the broad band internalizing scale that it is part of, has received a fair amount of attention in the field of behavior genetics. Longitudinal studies in our Dutch twin sample suggested increasing influence of the shared environment and decreasing genetic effects between the age of 3 and 12 years (Van der Valk et al. 2003a; Bartels et al. 2004b, 2007a). Exploring the relative influence of genes and environment on behavioral stability, Bartels et al. (2004b) found that both genes (average influence 43%) and shared environment (average influence 47%) had a major impact on the continuity of internalizing problems over time. A study using teacher ratings of internalizing behavior failed to detect significant effects of the shared environment and found that genetic effects were mainly responsible for the stability of the behavior (Haberstick et al. 2005). Apart from WB, the internalizing problems scale of the CBCL also encompasses the syndrome scales Somatic Complaints and Anxious/Depressed behavior. Focusing on the latter syndrome scale, a recent study in our twin sample showed decreasing genetic effects with age and increasing shared environmental effects (Boomsma et al. 2005). Furthermore, virtually no sex differences in the magnitude of the heritability estimates were observed, suggesting that the influences of genetic and environmental effects are similar across the sexes. Interestingly, the two largest studies into WB (Derks et al. 2004b) and shyness/inhibition (Eley et al. 2003), both conducted in early childhood, did detect significant sex differences in the heritability estimates, with the influence of genetic effects being larger in boys than in girls. These results suggest that, in contrast to the Anxious/Depressed scale, the genetic and environmental influences on WB may be different for boys and girls, thus warranting further studies into WB at later ages. The aims of the current study are twofold. Firstly, this project, which is a follow-up of the twin sample studied by Derks et al. (2004b), aims to examine the etiology of variation in WB at various time points across childhood. Secondly, using the longitudinal nature of this study, we aim to assess the genetic and environmental factors underlying the stability of childhood WB.

Ratings of both maternal and paternal reported WB were incorporated in the analyses. Several studies into childhood behavioral problems have shown that different informants can provide different information about children’s behavior (Verhulst et al. 1996; Achenbach and Rescorla 2000; Achenbach and Rescorla 2001; Van der Ende and Verhulst 2005; Seiffge-Krenke and Kollmar 1998). Achenbach and Rescorla (2000; 2001) reported correlations between maternal and paternal ratings of WB of .69 for preschool children and of .57 for school-aged children. These correlations were based on data from a combined clinical and general population sample. In a general population only sample using the Dutch CBCL, the correlation between parental ratings of WB was found to be between .48 and .79 (Verhulst et al. 1996). The less-than-perfect correlation between parental ratings implies rater disagreement. Various studies have explored the sources of parental disagreement in ratings of problem behavior (Derks et al. 2004b; Bartels et al. 2003, 2004a; Van der Valk et al. 2003b) using different structural equation models. Generally, it was found that parental agreement and disagreement was best explained by a psychometric model. This model, developed by Hewitt et al. (1992), assumes that parents not only assess the exact same behavior of a child, but also rate an informant specific aspect of the child’s behavior. This unique perception of the child’s behavior can arise if the child behaves differently towards the different raters (e.g. the child is more withdrawn when it is with its mother than when it spends time with its father), or if the raters observe the child in different situations (e.g. the mother observes the child more often in the home environment, whilst the father often observes the child interacting with other children in the playground). Apart from these real differences in behavior, the unique perception of the child’s behavior may also be influenced by rater bias and unreliability. Rater bias may arise if parents hold on to different normative standards, have specific response styles, or tend to stereotype the child’s behavior. Unreliability may be an important source of rater disagreement if raters cannot give an accurate description of the behavior under study. This may be relevant to our analyses, as some studies suggest that parents may be relatively insensitive to the more covert internalizing problems of children (Seiffge-Krenke and Kollmar 1998; Ollendick and King 1994). A previous longitudinal multiple rater study in our sample found that rater disagreement variance accounted for 35% of the individual differences in internalizing problems (Bartels et al. 2007a).

In the present study, stability of WB was assessed in longitudinal CBCL data from a large sample of 3, 7, 10 and 12 year-old twin pairs. The sample included roughly equal numbers of boys and girls. Genetic and environmental effects on stability in childhood WB were examined for both sexes. As both mother and father ratings of the twin’s behavior were incorporated in the analyses, we could partial out rater bias effects by distinguishing between variance that is shared between parents (i.e. perception of the child’s behavior common to both raters) and variance that is specific to one rater and might include variance due to rater bias. To examine the mechanism underlying the stability of WB, various developmental models were fitted to this longitudinal data set.

Method

Participants

All participants were contacted via the Netherlands Twin Register (NTR), kept by the Department of Biological Psychology at the VU University in Amsterdam (Boomsma et al. 2002, 2006a; Bartels et al. 2007b). From 1986 onwards the NTR has recruited families with multiples a few weeks or months after birth. Currently 40–50% of all multiple births are registered at the NTR. For the present study data from twins born in 1986–2001 were included. Parents of the twins were asked to fill out a questionnaire assessing the twin’s behavior at age 3, 7, 10 and 12 years. The questionnaires were mailed within 3 months of the twin’s 3rd, 7th, 10th and 12th birthdays. Two to three months after this mailing reminders were sent to the non-responders. If finances permitted, persistent non-responders were contacted by phone. This procedure yielded a response rate between 61% and 73% (Bartels et al. 2007b). Non-responders also include twin families who moved to an unknown address. From the original sample 281 families were excluded because either one or both of the children had a disease or handicap that interfered with daily functioning. The total sample consisted of 14,889 twin families. Ratings from both parents were available for 8,479 families when the twins were 3 years old, 6,414 at age 7, 4,133 at age 10, and 2,900 at age 12. Complete data from both parents at all time points were available for 1,160 families. Maternal ratings were available for 14,735 families, of which 13,095 participated at age 3, 8,855 at age 7, 5,863 at age 10, and 3,958 at age 12. Maternal data at all ages were available for 2,797 families. Paternal ratings were available for 11,499 families, of which 8,794 families participated when the twins were 3 years old, 6,522 at age 7, 4,237 at age 10, 2,974 at age 12. Complete father data on all four time points were available for 1,290 families. This study is part of an ongoing project; the children born in later birth cohorts have not reached the age of 7, 10 or 12 years yet, which explains the decreasing numbers of participating families at the later ages.

To examine effects of sample attrition, we compared WB scores at age 3 of families who continued to participate at all other time points (when the twins were 7, 10, and 12 years of age), with families who only responded twice, once, or zero times at the subsequent time points. For the father ratings there were no mean differences between these groups, neither for boys nor for girls. For the mother ratings a significant effect of attrition was observed for both boys and girls. Mothers who continued to participate reported lower WB scores when their twins were 3 years old than mothers who did not participate at one or more of the later measurement occasions However, these effects were small (effect size r = .07 in both boys and girls). No systematic differences in variance were observed between complete versus partial responders.

Of all participating twin pairs, 2,310 were monozygotic males (MZM), 2,591 were dizygotic males (DZM), 2,619 monozygotic females (MZF), 2,339 dizygotic females (DZF), 2,566 opposite sex twins with a male firstborn (DOSMF), and 2,464 opposite sex twins with a female firstborn (DOSFM). For 1,380 same-sex twin pairs zygosity was based on DNA polymorphisms (n = 1,039) or blood group (n = 341; Van Dijk et al. 1996). For the remaining same sex twin pairs (n = 8,479) zygosity was determined by discriminant analysis using longitudinally collected questionnaire items. This method has proven to be of sufficient reliability: Rietveld et al. (2000) reported that agreement between this method and zygosity determination by blood/DNA polymorphisms was 93%.

Measures

Mother and father ratings of WB problems were obtained from the withdrawn/depressed syndrome scale of the CBCL/2-3 at age 3. The CBCL/2-3 (Achenbach 1992) has been translated and validated for the Dutch population (Koot et al. 1997). The withdrawn/depressed scale in the CBCL/2-3 consists of 10 items (see Table 1). The parents were asked to rate the behavior of the child on a 3-point scale based on the occurrence of the behavior in the past 2 months. They were asked to rate the behavior as 0 if the problem item was not true; 1 if the item was somewhat or sometimes true; and 2 if it was very true or often true.

Table 1 Overview of the items included in the withdrawn behavior scale of the CBCL

At age 7, 10 and 12 WB was assessed using ratings of the WB syndrome scale of the CBCL/4-18 (Achenbach 1991; Verhulst et al. 1996). The syndrome scale withdrawn encompasses nine items, partially overlapping with the items in the CBCL/2-3 (see Table 1). This time the parents were asked to rate the behavior of the child in the preceding 6 months, on a 3-point scale identical to the scale used in the CBCL/2-3. For all ages, if more than two items on the WB scale were missing, the data were regarded incomplete and excluded from the analyses.

Data analyses

Descriptive statistics for WB at age 3, 7, 10 and 12 were calculated using SPSS 13, correlations and cross-correlations were estimated using the software package Mx (Neale et al. 2006). To assess stability of WB over time, phenotypic correlations between the time points were estimated for boys and girls separately. Twin correlations at each age and cross-twin-cross-age correlations were estimated for each zygosity group separately. These correlations give a first impression of the contribution of genetic and environmental effects on the variance of WB at each age, and on the stability of WB over time. Within-person inter-parent correlations were inspected to examine parental agreement on WB. The cross-rater cross-twin correlations (e.g. the correlation between the mother rating of the oldest of the twins with the father rating of the youngest of the twins) were estimated to gain insight in the contribution of genes and environment on the phenotypic variance that both raters agree on.

Genetic modeling

Since monozygotic (MZ) twins are (nearly) genetically identical, while dizygotic (DZ) twins on average only share 50% of their segregating genes, genetic model fitting of twin data allows for separation of the observed phenotypic variance into variance due to additive genetic factors (A), shared environmental (C) and non-shared environmental (E) factors. To incorporate the WB ratings from both parents simultaneously, a psychometric model was used (see Fig. 1). The psychometric model enables a distinction between the variance that is shared by both raters and is independent of rater bias and unreliability. This shared variance is also called common phenotypic variance or reliable trait variance. The variance that is rater specific is variance in the child’s behavior that is uniquely perceived by one of the parents, or is associated with parental characteristics. This is also called unique or rater specific variance. In the psychometric twin model, both the common and the unique phenotypic variance are decomposed into components due to genetic, shared environmental and non-shared environmental influences. Significant genetic effects on the rater specific variance indicate that these unique perceptions of the child’s behavior are real, in the sense that they reflect variance associated with heritable behavior rather than systematic or unsystematic error, as error and unreliability do not cause systematic effects and cannot mimic genetic influences. Shared environmental effects on the unique phenotype may be confounded by rater bias, as possible influences of rater bias will act independently of the zygosity of the twins. Unique non-shared environmental influences may be confounded by measurement error or unreliability.

Fig. 1
figure 1

Psychometric model for multiple raters. A = Additive genetic effects; C = Shared environmental effects; E = Non-shared environmental effects; MRT1/2/FRT1/2 = Mother/Father rating twin 1/2

To examine the stability of WB throughout childhood, the psychometric model was extended to incorporate data on all four time points (see Bartels et al. (2007a) for a comprehensive description of the longitudinal application of the psychometric model). To gain insight in what factors are important for the continuity of WB, all possible genetic and environmental influences were specified using Cholesky decompositions for both the common and rater specific phenotype. A path diagram depicting the genetic influences in this model is shown in Fig. 2; the same structures were applied to capture the shared environmental and non-shared environmental influences. This model (a saturated psychometric model) is descriptive rather than driven by any specific developmental hypothesis. However, it is useful to gain a first insight in what factors are important for the stability of WB and serves as a reference to evaluate the fit of more specific developmental models. If the more constrained developmental model fits the data significantly worse than the fully parameterized model, this indicates that the predicted developmental mechanism is inconsistent with the data, and the hypothesized model should be rejected.

Fig. 2
figure 2

Longitudinal psychometric model for multiple raters, shown for one member of the twin pair and for additive genetic influences (A) only. MRT1/FRT1 = Mother/Father rating twin 1

Two types of developmental models were fitted to the data. Both models were fitted for A, C, and E influences separately (leaving the other influences specified using Chlolesky decompositions), and evaluated against the fit of the saturated model. Firstly, the transmission, or simplex model (Boomsma and Molenaar 1987) was tested. In this model, covariances between the four times of measurement are accounted for by genetic and/or environmental influences that are carried over to subsequent time points. Apart from the influences from prior time points, an innovation term unique to each measurement occasion can affect the variance. The total variance at each time point is the sum of the innovation effect and the age-to-age carry-over effect. The second developmental model that was tested was the common factor model (Martin and Eaves 1977). In the common factor model, one underlying factor is specified, implying a continuous influence throughout the different time points. To also account for age specific variance, age specific influences are added to the model. If the common factor model fitted the data well, the significance of the age specific influences was tested subsequently, by dropping these effects from the model. Likewise, the fit of a model with only age specific influences (without an underlying common factor) was tested. Since the non-shared environmental component also includes measurement error, the age specific non-shared environmental influences on the unique phenotype were per definition specified in the model.

The model fit was evaluated using likelihood ratio tests and Akaike’s information criterion (AIC = χ2 − 2Δdf). The best fitting most parsimonious model was used to obtain estimates of genetic, shared environmental and non-shared environmental effects on the variances and covariances of WB. Genetic modeling was performed in Mx (Neale et al. 2006). In order to use all available data, including information of incomplete longitudinal data or data of which one of the parental ratings is missing, analyses were performed on the raw data.

Results

Table 2 shows the means and standard deviations for maternal and paternal rated WB, as assessed with the WB syndrome scale of the CBCL. Problem scores are shown for boys and girls from the twin sample, and from a community sample (Verhulst et al. 1996; Van den Oord et al. 1995). Mother ratings of WB were significantly higher than father ratings at all ages and in both boys and girls (all tests significant at P < .001 level). Withdrawn behavioral problems were significantly higher in boys than in girls at age 3 (P < .001 for both raters), age 10 (mother rating P = .04; father rating P = .01) and age 12 years (mother rating P = .05; father rating P < .001), but not at age 7 years (mother rating P = .24; father rating P = .72). As the power to detect mean differences was high in our study, these mean differences seem only of practical importance at age 3. In all subsequent analyses, the means were specified per sex, zygosity, birth order and age. The ratings of WB in our twin sample were similar to the scores in the community sample at age 3, 7, and 10 years. At age 12, both the mean scores and the variances were higher in the community sample.

Table 2 Sample sizes, means and standard deviations (SD) for mother and father ratings of withdrawn behavior at age 3, 7, 10 and 12 for boys and girls separately

Although skewness was observed, we used the untransformed scores. A simulation study by Derks et al. (2004a) showed that a square root transformation of the data (the most commonly used transformation when data are censored) does not remove bias induced by non-normality of the data.

The phenotypic correlations across age are given in Table 3 for both the mother and the father ratings, separately for boys and girls. These correlations give an indication of the stability of WB over time. Correlations were around .30 between age 3 and later ages, and increased to .44–.65 between age 7, 10, and 12 years. This pattern was similar in both parental ratings and was observed in both boys and girls. The last two columns of Table 3 show the level of agreement between maternal and paternal ratings. Within-person-cross-rater correlations were similar across sex and age and varied between .51 and .62.

Table 3 Phenotypic correlations across age for mother and father ratings of withdrawn behavior in boys (above diagonal) and girls (below diagonal). The last 2 columns give cross-rater correlations (r m−f) for boys and girls as a function of age

The twin correlations and cross-twin-cross-age correlations are presented in Table 4 for both the mother and the father ratings. Inspection of the MZ and DZ twin correlation (on the diagonal) at the four time points gives a first impression of what factors influence variance in WB. At all ages, MZ correlations were higher than DZ correlations in both sexes, indicating that genetic factors play a role. Twin correlations in opposite sex twins were similar to the correlations in DZ same sex twins, suggesting that there are no qualitative sex differences in genes or shared environment influencing WB. This was confirmed in the genetic model. When the genetic correlation in opposite sex twins was estimated freely (within the biologically plausible range of 0–.5) it was estimated at .5, identical to the genetic correlation in same sex twins. Apart from age 7, the MZ correlations were not twice as high as the DZ correlations, suggesting that shared environmental factors also play a role. Inspection of the MZ and DZ cross-twin-cross-age correlations (off-diagonal in Table 4) can provide insight in what factors are important for the stability of WB over time. As compared to the DZ cross correlations, MZ cross correlations were slightly higher between age 3 and subsequent ages, and considerably higher between age 7 and later ages, indicating genetic effects on stability. However, MZ cross correlations were not twice as high, particularly between age 3 and later ages, suggesting that shared environmental effects on stability are also important.

Table 4 Twin correlations, cross-twin-cross-age correlations and cross-twin-cross-rater correlations (within and across age) for mother and father ratings of withdrawn behavioral problems per zygosity group (MZM, DZM, DOSMF above diagonals; MZF, DZF, DOSFM below diagonals)

The far right panel of Table 4 displays the cross-twin-cross-rater correlations within age (diagonal) and across age (off-diagonal). These correlations yield a first impression on the importance of genes and environment on the common phenotypic variance, and thus the reliable trait variance. At all ages, the MZ cross-rater correlations were larger than the DZ cross-rater correlations, indicating genetic effects on the common phenotype. Apart from the correlations at age 7, the DZ correlations were higher than would be expected based on genetic influences alone, therefore shared environmental influences also seem to influence the common phenotypic variance. For all ages the cross-twin-cross-rater correlations were lower than the cross-twin-within-rater correlations. These differences indicate parental disagreement, and reveal the part of the total variance that is uniquely observed by a specific rater. Similar to the pattern of the within age twin correlations, the cross-twin-cross-rater correlations across age were higher in MZ than in DZ twins, indicating genetic effects. These correlations were less than twice as high in MZ twins compared to DZ twins, especially in girls, indicating that shared environmental influences also play a role in the stability of the common phenotype. The cross-twin-cross-rater-cross-age correlations were similar to the cross-twin-within-rater-cross-age correlations between age 3 and later ages. This pattern indicates that practically all the stability of WB was perceived by both raters. In later phases of childhood, the cross-twin-within-rater-cross-age correlations were larger than the cross-twin-cross-rater-cross-age correlations, indicating that both the common and the unique phenotype show considerable continuity over time.

Table 5 displays the results of the model fitting procedure. The relative importance of the different variance components was found to be significantly different in boys and girls (χ2(90) = 579.841, P < .001; model 2), indicating sex differences in heritability. The significance of all genetic and environmental components was tested by examining the deterioration of the model fit after each component was dropped from the saturated psychometric model. Shared environmental influences on the phenotype unique to the mother could be omitted from the model without a significant reduction in fit (χ2(20) = 25.133, P = .196; model 7). All other variance components were significant, for both the common phenotype (models 3, 6, and 9) and the unique phenotype (models 4, 5, and 8). Next, both transmission and common factor models were fitted to the different variance components. The saturated psychometric model indicated that the contribution of the rater specific variance to the stability of WB over time was negligible. Therefore, the developmental models were only tested for the variance common to both raters. When the developmental models were fitted to the common phenotype, the Cholesky decomposition was maintained for the rater specific factors. Both the genetic and the non-shared environmental influences appeared to be too complex to be captured by one of the developmental models; fitting these models to the common genetic and non-shared environmental influences resulted in a significant deterioration of the model fit (models 10–12 and 17–19). The shared environmental influences however, could be described by a common factor model without age specific influences (χ2(12) = .130, P = .999; model 15). A good fit of this developmental structure implies that there is a continuous influence of one underlying factor (in this case: the environment shared by the twins) over time. The best fitting most parsimonious model was a longitudinal psychometric model without shared environmental influences on the mother specific phenotype and including a common factor structure capturing the shared environmental influences on the common phenotype (model 20).

Table 5 Model fitting results for multivariate longitudinal analyses of withdrawn behavior

Table 6 presents the relative contributions of genetic, shared and non-shared environmental influences to the common phenotype (Ac, Cc, and Ec), and to the phenotype unique to either the mother or the father ratings (Au, Cu, Eu) in the best fitting model. Together these influences make up the total (common + unique) variances of WB at each age (diagonal) and the total covariance of WB over time (off-diagonal). The heritability of paternal and maternal rated behavior ranged between 50 and 66% in both sexes at age 3 and age 7, and decreased slightly to 38–59% at age 12 years. Shared environmental effects were of modest importance for the variance in both sexes and at all ages but were slightly more pronounced in girls. Non-shared environmental influences increased somewhat with age, and explained 22–28% of the variance at age 3 to 35–41% at age 12 years. This increase is mainly reflected in the rater specific non-shared environmental factors. At age 3 and age 7, the common and the unique phenotypic variance contributed equally to the total variance. At later ages in childhood, substantial extra information was added by the specific raters, especially by the mothers. A large proportion of the rater specific variance was due to genetic influences, and therefore reflects variance associated with heritable behavior rather than systematic or unsystematic error. No significant mother specific shared environmental influences were found, about half of the shared environmental variance in father-rated WB was rater specific. These effects may be real father specific shared environmental effects, but may also be due to rater bias.

Table 6 Relative contributions of genetic (A), shared (C) and non-shared (E) environmental influences to the variance (diagonal) and covariances (off-diagonal) of withdrawn behavior for the common phenotype and the unique (rater specific) phenotype for boys (above diagonal) and girls (below diagonal)

The off-diagonals of Table 6 show the influences of common and rater specific genetic and environmental influences on the covariances. Common genetic influences were most important for explaining continuity of WB, and accounted on average for 53% ([69% + 60% + 58% + 36% + 66% + 24% + 75% + 60% + 60% + 41% + 59% + 29%=]/12 = 53%) of the total covariance in girls and 64% of the covariance in boys. Rater specific genetic influences were mainly of importance for explaining stability between short time intervals (e.g. between age 3 and 7, 7 and 10, or 10 and 12 years) and accounted for 0–44% of the stability. Shared environmental influences common to both raters were important for the stability of WB in girls, and explained on average 14% of the behavioral continuity. In boys, these effects only explained about 4% of the covariance. Shared environmental influences unique to the father explained 6% of the stability in both sexes; these effects might reflect rater bias. Shared environmental influences on the mother specific phenotype were not significant. All in all, most of the shared environmental influences to stability were common to both raters. As this covariance reflects behavior that was perceived by both raters, these effects cannot be due to rater bias. The bulk of the non-shared environmental influences on the continuity of WB was perceived by both raters and can thus not be attributed to systematic or unsystematic error.

Lastly, the correlations between the genetic influences over time and the correlations between shared and non-shared environmental effects across development are displayed in Table 7. The genetic correlation of the common phenotype remained high across time, indicating that roughly the same genes influence the stability of the reliable WB phenotype between age 3 and age 12 years. The shared environmental effects on the stability of the common phenotype could be described by a common factor model and thus reflect one continuous influence over time. The non-shared environmental correlations and the rater specific genetic and shared environmental correlations over time were lower, indicating that these effects are more variable over time.

Table 7 Genetic, shared and non-shared environmental correlations across time for boys (above diagonal) and girls (below diagonal)

Discussion

We studied the etiology of WB in a large longitudinal sample of 3, 7, 10, and 12-year-old twins and explored the genetic and environmental influences on stability of WB across childhood. Both maternal and paternal ratings of their children’s behavior were analyzed in order to identify the part of the phenotype that both raters agree on. This way the variance and covariance in behavioral measurements could be distinguished from the effects of rater bias.

Individual differences in WB were largely influenced by genetic effects at all ages, in both boys (heritability estimates 50–66%) and girls (heritability estimates 38–64%). Shared environmental influences explained a small to modest proportion (0–24%) of the variance at all ages in both sexes, but were slightly more pronounced in girls. Non-shared environmental influences increased slightly with age and accounted for 22–28% of the variance at age 3 to 35–41% at age 12 years. This study is the first large scale study examining genetic and environmental influences on WB at multiple time points in childhood which is adequately powered to test for shared environmental effects. Results from previous family studies into childhood WB provided varying results. Two earlier twin studies in 3-year-olds from the NTR (Van den Oord et al. 1996; Derks et al. 2004b) found significant genetic effects, varying in magnitude from 45 to 74%. The most recent study with the largest sample size (Derks et al. 2004b) also found significant shared environmental effects that were more pronounced in girls than in boys. Two studies in middle to late childhood reported heritabilities of 40% (Schmitz et al. 1995) and 53% (Edelbrock et al. 1995). Two other family studies in middle to late childhood found no or little genetic effects (Kuo et al. 2004; Van der Valk et al. 1998; Schmitz et al. 1995). In the Taiwanese twin study (Kuo et al. 2004) the 95% confidence interval for additive genetic effects, however, was between 0% and 55%, indicating that the proportion of the variance explained by additive genetic influences may have been as large as 55%. On the other hand, the different results in the Taiwanese study compared to ours may also be due to cultural differences between the populations. Previous studies have suggested cultural effects on WB (Murad et al. 2003; Crijnen et al. 1999).

The longitudinal nature of the current study allowed for examination of the stability of WB throughout childhood. WB showed considerable continuity over time, with stability coefficients ranging from .23–.29 between age 3 and 12 years to .56–.65 for the shortest time interval. These correlations across age were similar to the correlations reported in other studies of childhood WB. In a small longitudinal twin sample, Schmitz et al. (1995) found a correlation of .33 between CBCL/2-3 scores and CBCL/4-18 scores of WB. In a large general population sample from the Netherlands (Verhulst et al. 1996), an 8-year stability coefficient of .36 was reported for the CBCL/4-18 WB scale. Smaller time intervals gave higher stability coefficients (.47 for a 6-year interval; .46 for a 4 year interval and .60 for a 2-year interval). Studying the development of behavioral problems over a broad time span necessitates the use of different instruments at different ages. At ages 7, 10 and 12 years the CBCL/4-18 was used, whilst the age-adjusted CBCL/2-3 was used when the twins were 3 years old. The use of different instruments could effect the longitudinal correlations in our study. However, the CBCL/2-3 and CBCL/4-18 are developed in parallel, and our 9-year stability coefficient (.23–.29) using the two different instruments is not much lower than the 8-year stability coefficient (.36) using the same instrument reported by Verhulst et al. (1996). Therefore we feel that it is likely that our longitudinal measures capture true development. Since both maternal and paternal ratings of WB were included in this study, we could also examine the extent to which the raters agree on the behavior of the child at the various ages. The cross rater correlations varied between .51 and .62 in both sexes and at all ages, indicating that rater agreement was similar in boys and girls and throughout childhood.

The stability in WB problems was largely accounted for by genetic effects, both in boys (on average 74%) and in girls (on average 65%). In girls, the shared environment was of moderate importance for continuity of WB, these influences explained 17% of the stability over time. In boys, these effects were less important, explaining about 7% of the stability. Interestingly, most of the shared environmental influences on stability were common to both raters, indicating that these effects are not due to rater bias. Non-shared environmental effects explained 18–19% of the stability over time in both sexes. Most of these effects were common to both raters.

In a longitudinal study into childhood internalizing behavior conducted in the same sample as the current report, Bartels et al. (2004b) found that stability in internalizing problems was accounted for by both genetic and shared environmental effects, and that these effects were roughly of the same importance for stability (43 vs. 47%). A multiple rater analysis including both maternal and paternal ratings (Bartels et al. 2007a) indicated similar influence of genes and shared environment on stability of internalizing problems. We found that genetic effects largely explained stability in WB (65% in girls; 74% in boys), whilst shared environmental influences were only of modest importance (7% in boys, 17% in girls). Our study suggests that, unlike the influences on general internalizing problems, shared environmental effects on stability in WB are only modest. Moreover, in line with findings in early childhood (Derks et al. 2004b; Eley et al. 2003), we found that genetic influences were stronger in boys than in girls. This is in contrast to studies into Anxious/Depressed behavior in which no meaningful sex differences in heritability estimates were found (Derks et al. 2004b; Boomsma et al. 2005).

The correlation between the genetic influences on the common phenotype over the course of development approaches unity. As the common phenotype represents the behavior that both raters agree upon, and can thus be considered as a reliable phenotype, the high genetic correlations suggest that the same genes influence the continuity of WB over time. The shared environmental influences on the common phenotype could be modeled as a common factor, indicating that a stable persistent shared environmental influence is important for behavioral stability. Non-shared environmental correlations on the other hand vary over time in both boys and girls, suggesting that these effects are less persistent. Some studies suggest that parental behavior may moderate the stability of behavioral inhibition and shyness in young children. Inappropriate affectionate parenting (Park et al. 1997) or maternal over-control (Rubin et al. 2002) could increase stability of these behaviors. A study by Hastings and Rubin (1999) suggested that over-protective mothers may particularly show this behavior toward shy daughters. Such parenting practices (if they are employed toward both twins in the family) could explain why the shared environment has a stronger influence on WB in girls than in boys. If the parenting is child specific, this influence would be reflected in the non-shared environmental component. Other non-shared environmental influences, such as traumatic experiences, or the consequences of an accident or illness could also account for stability in WB.

The extent to which a child displays WB might be influenced by the composition of the family in which the child is raised. Our study focused on variation in WB in twins. It may be that twins show less WB compared to singletons, because they are raised with a sibling of the exact same age. On the other hand, if twins mainly interact with each other in childhood, twins may be more inhibited than non-twin siblings or singletons in interaction with others than their co-twin. Comparing the mean scores of our twin sample with the scores in a Dutch community sample showed that withdrawn scores are similar at age 3, 7 and 10 years. At age 12, however, mean problem scores were higher in the community sample than in the twin sample. In a large twin-singleton comparison study (Pulkkinen et al. 2003), 12-year-old twins were reported to be more socially adaptable than non-twins, but no twin-singleton differences were found for social anxiety. A twin-singleton comparison of both maternal CBCL withdrawn ratings and laboratory assessment of inhibition in 5-year-olds yielded inconsistent results (DiLalla and Caraway 2004). According to laboratory ratings, twins were more inhibited than non-twins, whilst maternal ratings showed the opposite. These studies yield no explanation for the low mean withdrawn scores observed in our twin sample at age 12. To explore possible twin-singleton differences, future studies are needed.

The present study assessed withdrawn behavioral problems using the WB syndrome scale of the CBCL. The CBCL is a clinical rating scale; since the majority of the twins included in our study are typically developing children, mean problem scores were low, and the score distribution was significantly skewed. A data simulation study (Derks et al. 2004a) showed that skewness could lead to overestimation of non-shared environmental effects and underestimation of shared environmental effects. The same study indicated that a square root data transformation did not attenuate the observed bias. Inspection of the level of censoring in our data showed that respectively 41/44% (3-year-olds maternal/paternal ratings); 28/36% (7-year-olds); 32/40% (10-year-olds) and 39/46% (12-year-olds) of the children scored at base level. Data simulation showed that a similar degree of censoring (39%) resulted in an underestimation of the shared environmental influences of 8%, and an overestimation of the non-shared environment by 10% (Derks et al. 2004a). These findings should be kept in mind when interpreting the results of our present study. Future studies could avoid the problem of censoring by using a questionnaire assessing continuously distributed behaviors.

The current study highlights the importance of genetic effects on stability in WB. Unlike general internalizing problems, shared environmental influences do only have a modest effect on the stability of WB throughout childhood. Results from longitudinal (Bongers et al. 2003) and cross-sectional studies (Achenbach 1991; Verhulst et al. 1996) indicate an increase in WB from childhood to adolescence. Future longitudinal research should extend the current study and investigate stability of WB into adolescence. Since childhood WB has been shown to be a predictor for anxiety disorders and depression later in life (Goodwin et al. 2004), insight into the developmental mechanisms underlying stability of WB into adolescence would be highly desirable. This is particularly true given the differences between the Anxious/Depressed and WB scales of the broad band internalizing scale. It appears from our studies that WB has a different genetic architecture and longitudinal course than Anxious/Depressed. While Anxious/Depressed has received a great deal of research attention, there has been relatively little investigation into the long term sequelae of WB. Our research provides evidence for the need for further work on this phenotype.