Randomized Controlled Trials
In an RCT conducted in the province of Manitoba, Canada, children in first grade in 197 schools were randomly assigned to receive PAX GBG in the 2011–2012 school year or to be in a waitlist control condition that received PAX in the subsequent school year (Jiang et al.
2018). Notably, the province-wide prevention strategy funding the study did not include coaching because of equity policy issues. Following up at the end of the school year, students in the PAX classrooms scored significantly better on teacher ratings of the Strengths and Difficulties Questionnaire (SDQ; Goodman
2001) compared with students in the waitlist condition, for all five SDQ subscales; Prosocial Behavior, Emotional Symptoms, Conduct Problems, Hyperactivity, and Peer Relationship Problems (Cohen’s
d ranging from 0.11 to 0.23). Latent transition analysis indicated that children in the moderate- and high-risk categories on the SDQ were significantly more likely to transition to a lower risk category if they were in a PAX classroom (35.1% probability for medium risk and 44.7% probability for high-risk group, Jiang et al.,
2018). Although sensitivity analyses were conducted and found no bias, the large attrition of SDQ-data both at pre- (30.7%) and post-intervention (37.9%) is a weakness in this study. Only relying on ratings by teachers, who are liable to bias as they are also delivering the intervention, is also notable. Other, more objective outcome measures would have added significantly to the strength of this study. Latent transition analysis is a suitable statistical model to analyze and report outcomes of a universal prevention trial beyond traditional effect sizes since most students do not have difficulties and are therefore unlikely to show significant improvements. Fidelity was determined by asking teachers at the end of the school year to fill out an implementation form on the extent of their PAX usage, which can be affected by recency bias and misestimation. The implementation form had an attrition rate of 50%, which undermines the possibility to draw conclusions.
In a second RCT, conducted in Baltimore, Ialongo et al. (
2019) compared the impact of PAX with a “standard setting control condition," and a condition that integrated PAX with the Promoting Alternative THinking Strategies program (PATHS; Greenberg et al.
1995). The PATHS program teaches children about their emotions and social skills relevant to emotional regulation. At the measurement conducted 6 months after beginning the intervention, PAX combined with PATHS (PATHS to PAX) showed a small effect (Cohen’s
d = 0.08) on independent observations of off-task, disruptive and/or aggressive behavior in the classroom. By using the Johnson-Neyman technique to identify regions of significance, the study found that students who were elevated in Total Problem Behavior at pre-intervention measurement had improvement on this outcome if they received PAX alone (Cohen’s
d = 0.05). Similarly, the PATHS to PAX intervention also produced significant minor improvements on four teacher-rated variables, but only for those rated low on these variables at pretest: Readiness to Learn, Authority Acceptance, Social Competence and Emotion Regulation (Cohen’s
d ranging from 0.03 to 0.09), measured with the TOCA-R (Werthamer-Larsson et al.
1991) and Social Health Profile (Conduct Problems Prevention Research Group
2010). However, there were no significant main effects for the PAX only (without PATHS) condition compared with control schools. No significant difference was found between the two PAX conditions regarding the dose of GBG use during the school year, nor in the fidelity ratings.
In a third RCT, conducted in Estonia (Streimann et al.
2017,
2015,
2019), PAX GBG was implemented in an initial ten first-grade classrooms to test its adaptation to the Estonian culture. It was then provided to 23 schools in a cluster-randomized trial with 23 schools allocated to a waitlist control. At the two-year follow-up, children receiving PAX had improved their overall mental health (Cohen’s
d = 0.39) and reduced conduct problems, peer problems, and hyperactivity (Cohen’s
d ranging from 0.17 to 0.24) on the teacher-rated SDQ (Streimann et al.
2019). No improvement was found for SDQ ratings of prosocial behavior. Subgroup analyses were done by using cut-off scores for the SDQ and calculating Odd’s ratios, but no statistically significant differences were found. The two-year follow-up period is a strength, this study would have been even stronger if it had reported independent data and not only teacher ratings as the primary outcome. Fidelity assessment in this study was exemplary, combining mentor ratings and ratings by independent researchers. The dose measure was less optimal, using a retrospective estimate by teachers at the end of the school year rather than collecting data on the number of games and their duration throughout the school year.
A pilot cluster-randomized controlled trial was conducted in Northern Ireland (O’Keeffe et al.
2017) with 353 children, ages 6–8, at 15 schools (19 classrooms) in areas of high socio-economic disadvantage. Short-term outcomes (12 weeks) reported in a thesis by Mulgrew (
2019) indicate significantly improved self-rated student self-regulation (Cohen’s
d = 0.42) compared to a passive control group. No other outcomes (SDQ and TOCA) were statistically significant after controlling for clustering on school level. As the study was a pilot trial, it suffered from low statistical power that was further affected by attrition, ending up with almost twice as large sample size in the intervention group compared to control at follow-up.
One RCT studied the use of PAX GBG in an afterschool setting with children in grades 2–5 (Smith et al.
2018), at 76 afterschool sites in diverse geographical areas. Sites were matched pairwise on relevant variables and randomized to either received PAX GBG or “business-as-usual.” Independent, blind to condition observers conducted observations on-site at two pre- and two post-intervention occasions. Fidelity was also independently rated. PAX GBG sites were found to improve observed belonging (γ = 0.23,
p < 0.05), and child self-reported hyperactivity using the SDQ (γ = 0.76,
p < 0.05), compared to control sites. Sites with high fidelity ratings also showed statistically significant effects on observer ratings of harshness/criticism (γ = -1.11,
p < 0.05), supportive relations with adults (γ = 1.83,
p < 0.01), appropriate structure (γ = 1.54,
p < 0.01), and levels of engagement (γ = 1.82,
p < 0.01). No subgroup analyses were conducted. The implementation procedure and outcomes, while outside the scope of this review section, were also studied in greater detail and published separately (Smith et al.
2014).
Summing up the RCT review, there is a fair amount of variation in outcome measures, statistical methods, and follow-up timespans in these studies. Several rely largely on teacher-reported evaluations of students’ progress, such as the SDQ and TOCA. While this is important data, the teachers are not independent raters (Pas & Bradshaw
2014). Increased use of independent behavior observations and objective data records, such as attendance and test scores, would strengthen study designs. The time span of follow-up varies from three months to two years and the Estonian study (Streimann et al.
2019) is the only one reporting outcomes at multiple time points (one and two years), which is helpful in understanding the development over time. There were three different approaches to exploring subgroup effects based on baseline measurements, Latent Transition Analysis (LTA), the Johnson-Neyman technique, and cut-off scores. Both Johnson-Neyman and cut-off scores use rating scale sum scores, which can be problematic (McNeish & Wolf
2020), while LTA also takes the measurement model and measurement error into account. LTA is likely to be the most robust analysis method, but it requires large datasets to be appropriate. All studies report Cronbach’s alpha for their measures, but they do so for pre- and post-measurement together for all groups. This may be described as standard practice, but it does not allow any insight into possible issues with measurement invariance between timepoints and groups. Most studies assessed both fidelity and dose delivered, which is praiseworthy, not least from an implementation perspective. However, the assessment methods could be stronger and more systematic. Retrospective ratings over long time-periods are likely to be unreliable. Finally, two of the studies published study protocols prior to conducting their trial, which is a commendable practice (Streimann et al.
2017; O’Keeffe et al.
2017).