Introduction

The Fundación Musical Simón Bolívar (FMSB) manages a Venezuelan network of núcleos (music centers), collectively known as El Sistema. Children and youth receive instrumental and choral training in classical music but also in traditional and popular genres. Training includes group and individual practice and regular group performances. Notwithstanding its potential effects on musical ability and appreciation, El Sistema emphasizes the importance of holistic child development and social inclusion (Abreu 2009). The instructional model has been internationally praised (Majno 2012; Wakin 2012) and replicated (El Sistema 2015) but subjected to little empirical study.

A vast literature reports associations between musical training and a variety of positive child developmental outcomes (e.g., Hallam 2010), with some studies showing that music experiences are beneficial for violence-exposed populations (e.g., Garrido et al. 2015). However, estimates largely rely on non-random variation in exposure to musical training, which may confound training effects with those of correlated and unobserved variables such as parent income or motivation. A small number of studies randomly assign exposure to musical training, usually focusing on cognitive ability or academic achievement outcomes. Thirty-six weeks of music training improved the general intelligence of Canadian 6 year-olds (Schellenberg 2004). Portuguese third-graders had improved reading abilities but not intelligence after 24 weeks of musical training (Moreno et al. 2009). Four weeks of computer-based listening activities improved verbal (but not spatial) ability of preschool children and performance on a go/no-go task of executive function (Moreno et al. 2011). Boston preschoolers exposed to 6 weeks of parent-accompanied music enrichment showed no differences in vocabulary or numerical discrimination skills relative to a control group (Mehr et al. 2013). A meta-analysis of 19 music intervention studies for children ages 3–12 years, which included a few experiments, found increased visual-spatial skills compared to control conditions (Hetland 2000).

Even less experimental research exists on outcomes beyond intelligence and academic achievement. In a South Korean experiment, 15 weeks of music training improved self-esteem and reduced aggressive behavior among highly aggressive 10–12 year-olds (Choi et al. 2010). Quasi-experimental evaluations of school-based music lessons found impacts on self-esteem in Australia (Rickard et al. 2013) and on school engagement in Finland (Eerola and Eerola 2014).

This study aims to evaluate the effects of a large-scale music program on child functioning in the context of high rates of violence exposure. The study makes four contributions to the literature on musical training and child outcomes. First, it presents the only experimental evidence on the effects of musical training in a developing country with high rates of violence. Second, it is the only experimental evaluation in any country of a scaled-up, government-implemented intervention. Third, it uses a considerably larger sample than prior experiments, as we randomly assigned 2529 guardians (with 2914 children) to early or delayed admission in 16 orchestra centers. Fourth, we measure a wider set of child outcomes than prior research, including self-regulation, behavior, prosocial skills and connections, and cognitive skills.

Study outcomes are based on a theory of change developed through consultation with FMSB administrators, site observations, and the extant literature. Although childhood music participation may contribute to long-term outcomes, such as secondary school graduation and workforce engagement, such outcomes are beyond the scope of this evaluation. Our theory of change thus specifies intermediate processes (short-term outcomes) that have been associated with long-term outcomes and are likely to be affected by short-term participation in El Sistema.

We hypothesized that short-term participation in orchestras or choruses may foster positive change in four child functioning domains: self-regulatory skills, behavior, prosocial skills and connections, and cognitive skills. Participation may increase self-regulation skills, or the modulation of emotion and behavior, as it requires dedicated practice as well as turn-taking, patience, and careful monitoring one’s performance to synchronize playing and singing with others (McPherson and Renwick 2001). The collaborative nature of participation in an orchestra or chorus as well as the increased demands for self-regulation suggests that the experience may also increase prosocial behaviors and reduce negative conduct. It may also discourage individual risk-taking (such as playing out of sequence) and reward collective action. Music-making may also foster social bonding, group cohesion, and shared goals (Eerola and Eerola 2014; Kirschner and Tomasello 2010), which could, in turn, increase prosocial connections or engagement with peers and family. Finally, although the program was not designed with an explicit goal of improving cognitive skills, we hypothesized that short-term participation could improve working memory, visual-spatial skills, and processing speed, as these cognitive skills have been associated with musical training (Hetland 2000; Kraus et al. 2014; Schellenberg 2004).

We conducted pre-specified moderation analyses of program outcomes according to several socio-demographic variables. Given that research indicates that the child functioning domains included in our theory of change may be more or less malleable during different developmental periods (e.g., Berger 2011; Skoe and Kraus 2013), we examined program effects by age. As social inclusion is particularly important in contexts of economic inequality and exposure to violence, we additionally examined whether program effects varied by maternal education, a proxy for economic disadvantage, and by violence exposure for male and female subgroups. Violence exposure has deleterious effects on development and has been shown to disproportionately impact disadvantaged youth (Fowler et al. 2009). Rates of youth violence and homicide in Venezuela are among the highest worldwide (Munyo 2013; World Health Organization 2014). We anticipated that disadvantaged and/or violence-exposed youth would benefit the most from orchestra or chorus participation, as the program provides a free, developmental opportunity that includes adult supervision in a safe and accessible setting. Such opportunities are rare or are costly for disadvantaged youth or for those living in high-violence contexts (Roffman et al. 2001). Given that young males in Latin America are at increased risk for both being victimized and perpetrating violence (Munyo 2013), we examined violence exposure by gender in relation to program outcomes, as Venezuelan boys in particular may show increased benefit from structured, supervised engagement in a prosocial activity.

Method

Study Design and Experimental Sample

We initially assessed 24 music centers—the experimental sites—for potential inclusion in the experiment (see Fig. 1). In consultation with FMSB administrators, the sites were chosen because of likely excess demand by families in the 2012–2013 academic year and their dispersion across five states: Aragua, Bolívar, the Capital District (Caracas), Lara, and Miranda. Two sites were excluded because their directors declined to follow the experimental protocol, and six were excluded because of insufficient demand. In the remaining 16 sites, directors agreed to participate in the experiment and received training in the experimental protocol.

Fig. 1
figure 1

Trial profile

We conducted a cluster-randomized, controlled trial in the 16 music centers between May 2012 and November 2013. Under normal circumstances, the music centers accept written applications from adult guardians on behalf of one or more children. All children of a particular guardian are admitted on a rolling basis until (and after) classes begin in September. As a condition of participating in the experiment, directors agreed to accept written applications from guardians between May 7 and July 8, 2012 (without informing guardians of admission decisions). Children were eligible to apply if they would be 6 to 14 years old on September 1, 2012. Sites received applications from 2603 guardians on behalf of 2999 children. By prior agreement, each site director could award early admission to a small number of applicants (no more than 5% of positions). In eight sites, 85 children (of 74 guardians) were admitted thusly.

Approximately half of the remaining 2529 guardians (representing 2914 children) were randomly offered early admission in September 2012, with the rest offered admission in September 2013. This constitutes the experimental sample. Baseline and follow-up data were collected on their outcomes, as further described in a later section.

Randomization and Masking

We randomized admissions to the guardians of children (the experimental clusters), rather than children. This maintained FMSB’s policy of jointly admitting siblings. It also reduced the likelihood of spillover effects between treated and untreated children within a household. On July 12, 2012, we assigned each guardian a random number between 0 and 1, drawn from a continuous uniform distribution. Within each site, guardians and their applicants were allocated to the treatment group (early admission in September 2012) in ascending order of the random number until the number of positions was exhausted. Remaining guardians and children were allocated to the control group (delayed admission in September 2013). The median site allocated 50% of its applicants to the treatment group (with a range of 39 to 67%).

We prepared identically formatted rosters of each site’s treatment and control groups. Site directors were instructed to contact guardians during July and August using a consistent script to provide the early or delayed admission offer. Guardians in the treatment group were not obligated to accept the offer of early admission, and guardians in both groups were not prevented from seeking admission to a non-experimental site. Given the nature of the intervention, it was not possible to prevent guardians and children from learning their treatment status, although treatment status was not revealed to interviewers during baseline or follow-up data collection.

We calculated a minimum detectable effect size (MDES) of 0.106 standard deviations in the randomized sample. This assumes two-tailed hypothesis tests with α = 0.05, power of 80%, an intracluster correlation (ICC) of 0.3 (recalling that guardians are the experimental clusters), fixed site effects, and balanced allocation to the treatment and control groups (Schochet 2005). Even with an ICC of 0.6, the MDES is 0.109. The MDES is 0.094 when α = 0.1.

Intervention

Each site includes at least one orchestra and one choir. Sites are guided by a national curriculum (or “sequence”) that specifies compositions and arrangements of increasing complexity, although site directors can modify it at their discretion. During their initial year of participation, school-aged children typically receive instruction in both an instrument (usually the recorder and/or a percussion instrument) and in choral singing. In subsequent years, children select a string, wind, or percussion instrument. Teacher-led musical instruction typically occurs several times per week. The instruction may take place in a full ensemble, within instrument sections, or during individual lessons. Advanced students may additionally provide instruction to less advanced peers. All children, even beginners, are included in public, community performances. Children attend performances of their peers and, in some cases, of regional or national orchestras composed of advanced students. Musical instruction and instruments are free.

Data Collection

During the initial application period mentioned above, guardians provided written responses to a few demographic and socioeconomic questions. We additionally collected data from guardians and children on two further occasions. First, we conducted a baseline survey of outcome measures between October 2012 and February 2013 (the median survey date was November 19). Surveys were completed in households by children and adult caregivers (usually mothers). Trained enumerators blinded to treatment and control status used laptops and a data entry tool during household interviews. Second, we conducted a similarly structured follow-up survey between September and November 2013 (the median survey date was October 3).

Outcome Measures

We measured 26 primary outcome variables within the four domains of self-regulatory skills, behaviors, prosocial skills and connections, and cognitive skills. Self-regulation variables include self- and guardian-reported questionnaires as well as computerized games measuring future orientation (delay discount), response inhibition (go/no-go), attention functioning (Flanker task), and planning skills (Tower of London). Regarding behaviors, we focused on self- and guardian-reported measures of broad prosocial behavior, difficulties (Strengths and Difficulties Questionnaire) and aggression, with a risk-taking task (risky driving game). Prosocial skills and connections included scale measures of self-esteem, empathy, and school and family engagement. Cognitive skills included working memory (a digit recall), processing speed (a symbol search), and visual-spatial reasoning. Appendix Table 6 provides information about measure scoring methodology, baseline internal consistency reliability, and references.

Program Moderators

Child age, gender, and maternal education were collected in the application form. Violence exposure (Pynoos et al. 1998) was collected in the baseline survey. Violence-exposed children responded affirmatively to at least one of the following: “in the city where you live, have you (1) been hit, shot or threatened? (2) seen someone get shot or killed? (3) seen a dead body (except for a funeral)? or (4) learned about the violent death or injury of a loved one?”

Statistical Analysis

Each outcome includes difference-in-differences estimate of the intention-to-treat (ITT). We used a long-format dataset, in which the number of observations is equal to the number of valid observations on the outcome in the baseline and follow-up surveys, excluding observations missing either a baseline or follow-up measure. We estimated the following regression:

$$ {O}_{ijt}={\beta}_0+{\beta}_1{\mathrm{Post}}_t+{\beta}_2{\mathrm{Treatment}}_j*{\mathrm{Post}}_t+{\delta}_{ij}+{\varepsilon}_{ijt} $$

where \( {O}_{ijt} \) is the outcome of child i with guardian j at time t. \( {\mathrm{Treatment}}_j \) indicates whether the children of guardian j were ever assigned to the treatment group (versus control), while \( {\mathrm{Post}}_t \) indicates follow-up observations (versus baseline). The \( {\delta}_{ij} \) are fixed effects, or separate intercepts, for each child. (The fixed effects for music centers are absorbed by the child fixed effects and cannot be separately estimated.) The coefficient on the interaction term, \( {\beta}_2 \), is the ITT effect, that is, the effect of being offered the treatment. Robust standard errors are clustered by guardians. (All reported p values are similarly adjusted for clustering.)

Due to the large number of hypothesis tests, we control the k familywise error rate (k-FWE), or the probability of making k or more false rejections. While we follow the Romano-Wolf procedure (Romano et al. 2008; Romano and Wolf 2005), we do not apply the traditional familywise error rate k-FWE of k = 1, given that this has been shown to be too conservative (Delattre and Roquain 2015). We have set k = h/2 where h is the number of outcomes within a domain (when h/2 is not an integer, k is rounded down; Guo et al. 2014). There is no doubt a tradeoff between setting the screen for false discoveries too conservatively or too low. In addition to reporting standard p values, we report statistical significant at 10% post-adjustment. The adjustment was implemented through a Stata 13.0 bootstrap procedure which establishes an empirical distribution of adjusted critical t values. A Matlab algorithm uses this distribution to determine the level of significance according to the Romano-Wolf procedure.

Results

Sample Description

Table 1 reports means of demographic and socioeconomic characteristics of households and children in the treatment and control groups. Guardians reported the variables at the time of application from May to July 2012. The table reports adjusted treatment-control differences that control for site-specific dummy variables because the probability of treatment was not equal across sites. As expected, the treatment-control differences in child and household variables are small and not statistically different from zero. Table 2 further shows means for the 26 outcome measures collected during the baseline survey. Each variable is standardized to a z-score, using the baseline mean and standard deviation (follow-up outcomes are standardized using the same values). None of the differences are larger than 10% of a standard deviation.

Table 1 Baseline demographic and socioeconomic characteristics
Table 2 Definitions of outcome measures and baseline means

How representative is the experimental sample of the population of similarly aged Venezuelan children in 2012? To partially assess this, we compared the experimental sample to a representative sample from the Encuesta de Hogares por Muestreo, collected in the first half of 2012. Among 6 to 14 year-olds in the five states represented in the experiment, 46.5% reside in a household with an income per capita below a poverty line of US$4 per day (or 678 Bs.F. per month). Using a logit specification, we regressed a dummy variable indicating poverty on the child and household variables in Table 1. We used parameter estimates and application form data to predict each experimental child’s probability of residing in a poor household. The sample mean (16.7%) is an estimate of the poverty rate in the experimental sample (Tarozzi and Deaton 2009). We conclude that experimental children are less poor, on average, than all 6 to 14 year-olds residing in the same states. Data limitations prevent us from assessing whether the experimental sample is representative of all applicants to FMSB music centers.

Attrition

Figure 1 shows participant flow throughout the study. At baseline, 88.1% of treatment group children and 88.2% of controls completed the survey. At follow-up, 74.5% of the treatment group and 77.3% of the control group completed the survey. To examine any systematic differences in attrition by experimental group, Table 3 reports means of demographic and socioeconomic variables from the application form (as in Table 1), but limited to the sample of children who answered at least one question in both the baseline and follow-up surveys. The differences between treatment and control participants are small and not statistically different from zero. Given the possibility of imbalance in unobserved variables, our preferred estimates include child fixed effects that control for any time-invariant unobserved variables.

Table 3 Baseline demographic and socioeconomic characteristics for children who answered one question in both the baseline and follow-up surveys

Table 1 in the supplemental online appendix further compares treatment and control group response rates for each of the 26 outcome variables, conditional on participation. Response is defined as completing a scale or task, and a child is defined as participating if she has non-missing data for both rounds. Response was nearly universal for the scale measures (over 98%), as for these outcomes it was easier to ensure complete answers by repeating a question when necessary. Task outcomes had lower response rates, ranging from 77 to 89%. The lowest response rate was for the Tower of London task, due to the fact that the cumulative test stopped after failure to complete two consecutive trials, so that the difficulty level of the child was not exceeded. This table shows that response rates, conditional on participation, were not statistically significant across treatment and control groups except for the case of one task-based instrument. As this is not among the indicators for which we find an impact, we are reassured that neither attrition nor non-response poses a threat to the internal validity of our analysis.

Uptake and Implementation

The treatment period was characterized by important political events, including presidential elections on October 7, 2012, and the death of President Hugo Chavez on March 5, 2013. A retrospective qualitative survey collected from directors of the music centers in October 2015 inquired about implementation challenges in recent school years. Only one of the 16 directors reported that implementation in the 2012–2013 school year was temporarily disrupted by school closures related to election activities. Seven directors reported that implementation in 2012–2013 was normal or better compared to other years, and four reported that implementation was worse because of problems with crowding. At worst, interruptions would reduce the dosage of the intervention and dilute the impacts measured. In no case, however, did implementation challenges threaten the internal validity of the evaluation design, as treatment was assigned randomly at the guardian level, and there is no reason to believe that political events may have differentially affected the treatment and control group.

Guardian and child enrollment and attendance were voluntary. In the treatment and control groups, respectively, 69 and 15% of children participated in a music center during the first semester (September to December 2012); 58 and 14% participated during the second semester (January to June 2013); and 56 and 11% participated during both semesters. On average, treatment group children participated 0.98 semesters more than control group children (p < 0.001), controlling for site-specific fixed effects and clustering standard errors by guardians. The estimates are based on those who completed the follow-up survey and provided retrospective participation data. Consistent with the design of the experiment, members of the treatment group were not required to enroll, while members of the control group were not prohibited from enrolling in a music center not included in the experiment. The ITT approach addresses crossover by comparing children who were offered early or delayed admission.

The data suggest that music centers offered instruction consistent with El Sistema guidelines. Among first-semester participants, 35% received instruction 5 or more days per week and 47% between 2 and 4 days per week (39 and 45% among second-semester participants). During the first semester, 63% received choral training; 67% played an instrument (a recorder, in 6 of 10 cases); and 40% did both. During the second semester, 57% received choral training, 73% played an instrument, and 37% did both. The majority of instruction in the first year occurred in large-group sessions. In two semesters, respectively, only 15 and 18% of participants received section-specific instruction, while 16 and 17% received individual lessons. About half of participants gave a public performance to parents or the public.

Impacts

Table 4 reports ITT estimates in the full sample of applicants. Two outcomes are statistically significant at 10% after controlling the k-FWE (as described above). The offer of early admission to a site increases child-reported self-control by 0.10 standard deviations, compared with delayed admission. It reduced child-reported behavioral difficulties by 0.08 standard deviations. There were no significant effects found for outcomes in other domains.

Table 4 ITT estimates in full sample

Table 5 reports ITT estimates for moderation analyses within maternal education, gender-by-violence, and age subgroups. It reports estimates that remain statistically significant at 10% after controlling the k-FWE (full results are available online). The effect sizes for children with less-educated mothers are approximately 50% higher for self-control and behavioral difficulties than the full-sample estimates, but there were not significant results in these outcomes for children with more-educated mothers. In the latter group, there were also two unexpectedly negative effects on guardian-reported measures of prosocial skills and connections.

Table 5 ITT estimates in subgroups

Overall, 46% of males and 42% of females were exposed to violence. The effect sizes for child-reported self-control and behavioral difficulties are more than doubled among boys exposed to violence. An offer of early admission also reduced aggressive behavior in this subgroup by 24% of a standard deviation. The results for girls are less consistent with the theory of change. We find unexpectedly negative effects on empathy (among girls exposed to violence) and on working memory and prosocial behavior (among girls not exposed to violence). The full-sample effect on self-control is observed among younger children (6 to 9 year-olds) but not older children (10 to 14). This finding is consistent with previous research indicating that executive functions and self-regulation skills in particular are more malleable at younger ages (Berger 2011). Finally, there is an unexpectedly negative effect on the go/no-go task among older children.

Discussion

After 1 year, the early-admission group had higher self-control and fewer behavioral difficulties, based on child reports. Larger effects were found for children with less-educated mothers, which may reflect the ability of more-educated mothers to finance alternative activities. The effects were concentrated among boys, especially those exposed to violence. The latter group also showed reductions in self-reported aggressive behavior. It bears emphasis that these are ITT estimates of the effect of offering the opportunity to enroll, rather than the effect on those actually treated. On average, the early-admission group actually attended about one semester more than the delayed-admission group.

We did not find any full-sample effects on cognitive skills—adding to the mixed findings from wealthier countries (Mehr et al. 2013; Moreno et al. 2009, 2011; Schellenberg 2004)—or on prosocial skills and connections. Unexpectedly, we found few effects for girls overall, with some unexpected decreases in different skill domains. While it could be that males and females engage in different aspects of the program in different ways, further study of El Sistema and the Venezuelan context is necessary to better understand these results.

The findings suggest that exposure to El Sistema might serve an important role as a preventive strategy to promote positive outcomes among disadvantaged children. The subgroup results are especially relevant given research showing that, relative to their female and higher-income peers, male youth are at increased risk for poor developmental outcomes when exposed to disadvantaged or high-violence contexts (Anderson 2008; Moffitt et al. 2011). That El Sistema is particularly effective for vulnerable males is promising, especially as many interventions have been found to be relatively less effective for this group or even to impose adverse effects (Kling et al. 2005; Osypuk et al. 2012; Rodríguez-Planas 2012). While it is possible that group music participation could mitigate the effects of violence exposure for males in particular, experimental studies of this gender effect and the potential benefits of music programs on violence-exposed populations are needed (e.g., Garrido et al. 2015).

Nonetheless, this study highlights the challenges of targeting interventions towards vulnerable groups of children in the context of a voluntary social program. As noted above, just above half of the early-admission group participated in a music center for two semesters. In light of these results, it may be desirable to consider additional targeting and retention mechanisms beyond free tuition and instruments (e.g., travel vouchers, scholarships, or other inducements).

Our lack of findings in cognitive and prosocial skills and connections could be due to the short duration of this evaluation, as changes in these domains may take longer than 1 year to emerge. For example, Kraus et al. (2014) found that changes in neural development took place after 2, but not 1, years of music exposure, suggesting that program duration is important to explore in subsequent studies of El Sistema. A benefit of continued study—and a limitation of the present results—is that many children were only exposed to introductory training in a single instrument (the recorder) or none at all. After the first year, children select a single string, wind, or percussion instrument and are exposed to individual and section training (in addition to group training), which could also facilitate positive change in these skill domains.

Although program impacts were concentrated in a few outcomes, these are increasingly identified as critical for individual wellbeing. The long-term economic returns to socio-emotional skills and behaviors can be as large if not larger than the returns to cognitive skills (Cunha and Heckman 2007, 2010). For example, Daly et al. (2015) found that self-control measured at age 11 is predictive of subsequent unemployment at older ages. This is plausibly because the skills that allow children to control their emotions and behavior during school age are closely related to skills used to secure and maintain good jobs and healthy relationships.

Limitations and Future Directions

There are some limitations of this study that have important implications for future research. Findings are limited to self- and guardian-reported outcomes, which may introduce bias associated with scale measures. Future research should utilize additional reporters (e.g., music directors, peers, or school teachers). In context of longer-run impacts, it is also necessary to examine possible fadeout of impacts and whether cognitive impacts emerge after a longer time frame. Although the evaluation was of a fully scaled program, the generalizability of these results is potentially hampered by a focus on a modest number of music centers. To facilitate experimental assignment, it was limited to over-subscribed music centers which may have implicitly favored better known and/or higher-quality sites. The sample did not affect the internal validity of the results, and assessing the effect on external validity would require additional information on the other music centers. Finally, although we examined program moderators, “how” characteristics such as gender and exposure to violence contribute to program outcomes are unknown; as such, further longitudinal and qualitative research is needed.

Despite these limitations, this study, to our knowledge, presents the only experimental evidence on the effects of musical training in a developing country. Previous research has not adequately addressed causality, as studies have primarily been correlational with the few experiments failing to correct for classroom or school-level clustering. The experiment is also notable for its analysis of a scaled-up, government-implemented musical training intervention. It can thus be considered an effectiveness trial rather than an efficacy trial, perhaps with increased generalizability to the growing body of developing-country policies inspired by El Sistema.