Recent years have seen an ever-increasing interest in bilingualism, especially in how the bilingual experience leads to cognitive changes (Kroll & Bialystok, 2013). One of the key discoveries was the finding that the two languages of a bilingual speaker are always active, even if the speaker is using only one language in a particular situation (BijeljacBabic, Biardeau, & Grainger, 1997; Colome, 2001; Spivey & Marian, 1999; Wu & Thierry, 2010). Therefore, bilingual experience requires that the speaker constantly monitors and controls language choice. But the soaring interest in bilingualism rather stems from reports that the constant demand on control processes appears to lead to a bilingual cognitive control advantage (for a review see Bialystok et al., 2009). Possibly the most exciting finding was that the onset of dementia appears to be later for bilingual speakers than for monolingual speakers (Alladi et al., 2013; Bialystok, Craik, & Freedman, 2007), suggesting that bilingualism provides a cognitive reserve. Another reason for the interest in bilingualism is the fact that evidence regarding the bilingual cognitive advantage is inconclusive. Quite a large number of studies have reported a bilingual advantage in cognitive control. But recent developments in the field have suggested that the evidence concerning the bilingual advantage, especially in non-verbal inhibition tasks, is far from conclusive (de Bruin et al., 2015; Hilchey & Klein, 2011; Paap, 2014; Paap & Greenberg, 2013).

Evidence for and against bilingual advantage

Early studies investigating speakers’ cognitive control abilities have reported a bilingual advantage. These studies utilized the Simon task (Simon & Rudell, 1967)—a task in which participants have to respond to a stimulus feature (e.g. respond according to the colour of a stimulus, left hand for red, and right hand for blue). The position of the stimulus could be either compatible or incompatible with the response hand. In this paradigm, responses are typically slower when stimulus position and response hand are incompatible. Bilingual speakers outperformed monolingual speakers in this task: they showed smaller stimulus congruency effects and/or overall faster response times (Bialystok, Craik et al., 2005; Bialystok, Craik, Klein, & Viswanathan, 2004; Bialystok, Martin, & Viswanathan, 2005). Using other variants of the Simon task (e.g. Spatial Stroop task, Bialystok, 2006) or paradigms that also require interference inhibition/conflict resolution (e.g. Flanker task, Eriksen & Eriksen, 1974), such bilingual advantage over monolingual performance has been reported many times over the years since the first report (e.g. Costa, Hernandez, Costa-Faidella, & Sebastian-Galles, 2009; Costa, Hernandez, & Sebastian-Galles, 2008; Kapa & Colombo, 2013; Tao, Marzecova, Taft, Asanowicz, & Wodniecka, 2011).

But the evidence is not consistent. While a bilingual advantage is often found, this is not always the case (for a review see Hilchey & Klein, 2011). Most strikingly, in a recent comprehensive study, Paap and Greenberg (2013) did not find any evidence for the bilingual advantage. They conducted a series of non-verbal conflict tasks that had commonly been used in previous studies, including the Simon task, with monolingual and bilingual college students. Bilingual speakers were neither less vulnerable in the conflict condition nor faster overall. On the contrary, the only group difference pointed to a bilingual disadvantage.

These inconsistent findings have raised serious discussions about the nature of the bilingualism effect and thought-provoking debates on how the evidence should be evaluated (e.g. Kroll & Bialystok, 2013; Paap, 2014; Valian, 2015). Several reviews have drawn attention to the published and non-published null results regarding the bilingualism effect, arguing that one should not over-evaluate the significance of positive findings, and should not under-evaluate the meaning of null results. Others discussed methodological issues (Paap, Johnson, & Sawi, 2015), such as the appropriateness of using covariates to control for additional factors, and the convergent validity of tasks used to measure executive control, i.e. the lack of significant correlations between standard measures of inhibition. The series of discussions provided a great chance to reflect on the current status in this research field, and more importantly on where the field is heading. Bearing that in mind, one constructive way to enhance our understanding of the consequences of bilingualism is to understand what factor(s) drives the divergence of results. The focus of this article is to contribute to the discussion from a novel perspective, i.e. the impact of seemingly trivial data trimming procedures.

Factors that potentially drive the inconsistency

In order to shed light onto the reasons for the inconsistencies in the literature, it is important to understand how other factors might interact with speakers’ cognitive control ability. Quite a number of factors have been pointed out. It was evident from the beginning that bilingual research is challenging due to the diversity of speakers’ linguistic profiles and experiences (Bialystok, 2001; Grosjean, 1998). Depending on their life experience, one bilingual speaker can differ from another in many ways. Such heterogeneity in linguistic experiences has been shown to have led to diverse cognitive consequences, such as level of language proficiency (e.g. Mishra, Hilchey, Singh, & Klein, 2012), stage of second language acquisition (early bilingual vs. late bilingual, Kalia, Wilbourn, & Ghio, 2014), the degree of bilingualism (dominant vs. balanced bilingual, Goral, Campanelli, & Spiro, 2015), pattern of language use, varying experience with frequent language switch (Soveri, Rodriguez-Fornells, & Laine, 2011), the similarity between a bilingual speakers’ two languages (Coderre & van Heuven, 2014, but see Paap et al., 2015a) and multilingualism (Poarch & van Hell, 2012). In addition, there are factors that are closely related to bilingualism or factors that drive the different language experiences, which at the same time are related to general cognitive performances. These include social and economic status (SES) (Morton & Harper, 2007), different cultural backgrounds (Yang, Yang, & Lust, 2011) and immigration status. Last but not least there are factors that affect one’s general executive functioning and that probably affect monolingual and bilingual speakers in the same way, such as age, education, exercise, music training, active video game experience and others (for an overview see Valian, 2015). These latter factors emphasize that cognitive control can be trained in other ways than by being bilingual, and that the populations of monolinguals and bilinguals can substantially overlap with regards to their performance in cognitive tasks. Such an overlap would also explain why the bilingualism effect has not been found in every study.

A new proposal

Another reason for the inconclusiveness of the literature might be the nature of the bilingual cognitive effect, which is better described as a mixture of effects rather than a single effect. For instance, Hilchey and Klein (2011) differentiated between two patterns of bilingual advantage, namely an inhibitory control advantage (i.e. bilinguals showing a reduced conflict effect), and an overall response speed advantage. This distinction suggests two routes through which bilingual experience could affect cognitive control: inhibitory control and attentional control. While enhanced inhibitory control ability should help to resolve conflict, resulting in a reduced conflict effect, enhanced attentional control should help to maintain task goals, leading to an overall speed advantage. Due to the general impurity of cognitive control tasks, a specific task does not provide a pure measure of a single control ability, but draws on many aspects of cognitive control. Some tasks might be more sensitive to participants’ inhibitory control ability and some to their attentional control ability. Therefore, depending on the task, one might observe a result pattern rather consistent with a bilingual inhibitory control advantage and/or bilingual attentional control advantage.

Tse and Altarriba (2012) utilized a novel analytical approach to the bilingual advantage effect. They performed an ex-Gaussian analysis to investigate response time distributions of bilingual speakers’ performance in a Colour Stroop task. Response times in cognitive experiments typically present themselves in a positively skewed distribution, which can be approximated by an ex-Gaussian distribution (Heathcote, Popiel, & Mewhort, 1991), i.e. a convolution of a Gaussian distribution (the mean of which is captured by the parameter μ) and an exponential distribution (the mean and variance of which are captured by the parameter τ). While the Gaussian component (the parameter μ) can be understood as the main body of the distribution, the exponential component (the parameter τ) captures the tail of the distribution, i.e. extremely slow responses. Tse and Altarriba (2012) suggested that the parameters of ex-Gaussian models of response time distributions in a Colour Stroop experiment are differentially sensitive to inhibitory control and attentional control. They argued that inhibitory control ability modulates the Gaussian component (μ) because differences in inhibitory control would affect the ease one resolves the competition between the two conflicting responses, leading to an overall shift of the response distribution in the conflict condition as compared to the no-conflict condition. In contrast, attentional control ability modulates the tail of the distribution (τ) because lapses of attention should lead to extreme long responses independent of condition. They found that more proficient speakers in L1/L2 showed a smaller interference effect in μ, and also smaller τ, independent of condition. This result is important, first because it proposes a way to disentangle the contribution of inhibitory and attentional control, and second because it suggests that what is usually treated as unwanted responses (τ) conveys important information.

One other study that has utilised the ex-Gaussian approach to examine response time distributions is Calabria, Hernandez, Martin, and Costa (2011). They re-analysed results from an attentional network task (ANT) originally reported in Costa et al. (2008) and Costa et al. (2009). In the original studies, they tested participants’ attentional networks using the ANT, which generates measurements for three attentional networks: an alerting network, an orienting network and an executive network. In their re-analysis, they focused on the executive network, or more specifically, on response times for trials with conflict stimuli versus trials with non-conflict stimuli (regardless of cue type). Results revealed an overall speed advantage for bilinguals in both the Gaussian (μ) and exponential (τ) components of the response distributions. Also, for an experiment that contained only 25 % inconsistent trials, i.e. under high monitoring demands, monolinguals had significant longer distribution tails (larger τ) in the incongruent compared to the congruent condition. This congruency effect was absent for bilinguals. These results suggest that the conflict effect in interference tasks is at least partially located in the tail of response distributions.

These observations lead to a new proposal that we investigated in the present study, namely that one reason for observing or not observing the bilingual advantage might be how the data was handled with regards to slow responses. In a traditional central tendency analysis, the typical procedure is to trim extreme responses, treating them as outliers. This is problematic if long responses are most sensitive to the experimental manipulation and/or group differences, as in studies where group/condition differences only emerged in the tail of response distributions (e.g. Epstein et al., 2011; Hervey et al., 2006). In other words, if a difference resides in the tail of the response distribution, by trimming the tail one also trims the potential to observe an effect. For instance, Leth-Steensen, Elbaz, and Douglas (2000) investigated response time distributions in a four-choice reaction time task for a group of children diagnosed with attention deficit hyperactivity disorder (ADHD) and a matched group with typical developing children. Using an ex-Gaussian analysis, they found that the two groups’ performances differed only in the τ parameter (the distribution tails), not in the main part of the response time distribution. The authors concluded that data trimming in this situation is equivalent to an artificial elimination of effects.

Meta-analysis

In what follows, we present a new perspective on previous studies that compared monolingual and bilingual performance in non-verbal inhibition tasks, investigating their data trimming procedures. If bilingual advantages are at least partly located in the tails of response distributions, i.e. in the slow responses as in Calabria et al. (2011), then one would expect that cutting off slow responses would reduce the chance of finding such advantages. We focused on three non-linguistic inhibition tasks that have been used most intensely to investigate the bilingual advantage: the Simon task, the Spatial Stroop task and the Flanker task (the latter sometimes embedded in an ANT). To ensure comparability, some variations of the tasks were excluded (e.g. the Simon task with a delay component in Martin-Rhee & Bialystok, 2008). We found 68 experiments taken from 33 articles (see Table 1). Within these 68 experiments, 23 (34 %) reported data trimming procedures, 4 reported excluding very short responses but did not report trimming of response distribution tails, 1 stated explicitly that long responses were not trimmed, and the remaining 40 did not mention whether or not they trimmed the data. For the purpose of our analyses, we treated the latter studies as ones that did not trim the data.

Table 1 Data trimming procedures and results of studies using non-verbal interference tasks in bilingual research. ANT Attentional network task

Two issues need to be pointed out. First, for studies that did trim the data, the practices differed. While some studies rejected long responses using standard deviations (e.g. 2.5 SD in Paap & Greenberg, 2013, which was approximately 700 ms after stimulus onset), others used a specific time cut-off (e.g. response times above 1700 ms). For comparison purposes, we translated cut-offs based on standard deviations into time cut-offs (using reported means and standard deviations). Second, for studies that did not trim the data, there was a big variation in terms of the maximum time allowed for making a response. For example, in Kousaie and Phillips (2012) 750 ms were allowed for making a response, meaning that in this particular design it was not possible to observe responses slower than 750 ms. This is equivalent to trimming the data at 750 ms. For these reasons, we focused on maximum response times being included in analyses (Fig. 1). For studies that did report a data trimming procedure, this is either the explicitly stated cut-off time or the response time (RT) calculated by the mean and SD. For studies that did not report a data trimming procedure, this was the maximum time allowed for making a response. For simplicity reasons, we will refer to both types of trimming as the maximum RT allowance.

Fig. 1
figure 1

Reported bilingual advantage for maximum response times included in the analyses in 58 non-verbal conflict experiments. * Cut-off time was estimated by the mean and SD

We acknowledge that data trimming and varying maximum time allowances are not the same. Different maximum time allowances, but not data trimming, can lead to different response strategies. For example, participants might not monitor their response accuracy as thoroughly in an experiment with a response allowance of 1000 ms compared to an experiment with a response allowance of 5000 ms and data trimming at 1000 ms. However, in none of the 16 studies in the long allowance group data was trimming applied, which means that this cannot have affected our results.

Some of the studies reviewed were not included into further analysis because there was not adequate information about either the maximum time allowed or how the data was treated (the 3rd study in Bialystok, Martin, et al., 2005, all studies in Gathercole et al., 2014, Yang et al., 2011, and Mohades et al., 2014). Carlson and Meltzoff (2008) was not included either because they did not report RTs. This led to 58 studies being included in our statistical analysis.

Figure 1 shows the number of experiments that did and did not find a bilingual advantage for the various maximum times allowed. A clear pattern emerges: the shorter the maximum RT allowance, the less likely that a bilingualism effect was found. In order to statistically test whether observing a bilingualism effect depends on the maximum time allowed, we grouped the experiments into three types of maximum time allowance: short allowance (below 1000 ms), medium allowance (1001–3000 ms) and long allowance (above 3001 ms; see Table 2). A chi-square test of independence confirmed that the result patterns differed for the three allowance groups, χ2 (2, N = 58) = 21.99, P < .001. Thus, consistent with our hypothesis, studies with short RT allowance were more likely to report no group difference whereas studies with longer allowance were more likely to report a bilingualism effect.

Table 2 Number of experiments with short, medium or long cut-offs / maximum allowed times and findings of bilingual advantage

In a meta-analysis of the bilingual advantage literature, Donnelly, Brooks, & Homer (2015) reported an effect of research laboratory. They suggested that this effect might be due to laboratory differences in, for instance, access to bilingual populations. In our long allowance group (>3001 ms), many studies are from the same research group. In fact, 14 out of 16 data points are from a research group around one particular author. In order to rule out that the current result is driven by a potential effect of laboratory, we excluded all data points from this laboratory. This led to two data points in the long allowance group. We therefore focussed on the short and medium allowance group, which both constituted a mixed contribution from different laboratories, meaning that the new result could not have been driven by any laboratory effect. This analysis confirmed our original result. The likelihood of observing a bilingualism effect depended on RT allowance, χ2 (1, N = 42) = 12.14, P < .001, with the medium allowance group being more likely to observe a bilingualism effect than the short allowance group.

As introduced, it has been pointed out that the age of the participants might play a role in whether a bilingualism effect can be detected or not. More specifically, bilingual elderly have been found to show the cognitive control advantage more consistently than other age groups. In addition, elderly and children respond on average much slower than adults. It might therefore be that our result arose because the group with long response allowances consisted of studies with very young and very old populations. We therefore tested the relationship between maximum response time allowance and the likelihood of finding a bilingualism effect in children, adults and elderly participants separately (see Table 2). A chi-square test of independence showed that while the relationship was not significant for children, χ2 (2, N = 10) = 2.86, P = .24, it was significant for adults, χ2 (2, N = 37) = 13.40, P = .001, and was marginally significant for elderly participants, χ2 (2, N = 11) = 5.29, P = .071. The non-significant result for the children was likely driven by limited power due to small numbers, especially in the short allowance category. Nevertheless, descriptively 83 % (5 out of 6) of the studies in the long allowance group versus 66 % (2 out of 3) in the medium allowance group showed a bilingualism effect. This result pattern is consistent with the conclusion that studies with longer response allowances are more likely to show a bilingualism advantage. Similarly in the elderly group, studies with longer allowances are more likely to report a bilingualism effect descriptively. The marginal effect seems again be due to a small sample size. In summary, it appears that data trimming reduces the likelihood of observing a bilingualism effect regardless of age group. Also, the very robust finding for adults showed that our overall result is not driven by the results for children or older participants who tend to respond much slower than adult participants.

Discussion

Based on the assumption that the effect of bilingualism on cognitive control resides at least partly in the tail of response distributions, we investigated a potential relationship between data trimming procedures adopted and the likelihood of observing a bilingualism effect by reviewing 68 experiments reported in 33 articles that compared monolingual and bilingual speakers using non-verbal interference tasks. We found that studies that included longer responses in their analysis were more likely to report a bilingualism effect, either in the form of overall response speed advantage or in the form of reduced interference effect. And this was also the case when the potential effect of laboratory was eliminated. This is consistent with earlier findings that the bilingualism effect emerges partially in the tail of response distributions (Abutalebi et al., 2015; Calabria et al., 2011; Tse & Altarriba, 2012). It appears that, when these prolonged responses were trimmed or not recorded, group differences might have also been eliminated.

A further analysis showed that the general result pattern was true for studies testing children, adults and elderly alike, even though significantly so only for adults, suggesting that data trimming might be problematic independent of the age of the participants and therefore independent of the average response times or the cognitive abilities of the participants. It also showed that the overall result pattern was not caused by studies with participant groups with very long responses times. There is unfortunately an insufficient number of studies and/or information about the participants in studies published to date to test whether data trimming procedure affects results independent of other factors that have been suggested to interact with the bilingual advantage (such as SES, immigrant status, language dominance, age of language acquisition, language usage, etc.) Future studies will need to disentangle the relevance of and the potential interplay of all these factors.

This review provides some practical implications for future endeavours. Employing a more fine-grid investigation approach might be useful, particularly in situations where effects are subtle. As pointed out, one fruitful alternative to traditional approaches of analysis is the ex-Gaussian analysis of response time distributions (Abutalebi et al., 2015; Calabria et al., 2011). For instance, Abutalebi et al. (2015) reported that a group of bilingual elderly showed advantage in the τ component in the incongruent condition and the μ component in the congruent condition in a Flanker task, supporting the notion that bilingual speakers have enhanced attentional control. It has to be pointed out, though, that one needs to be cautious when interpreting the results from ex-Gaussian parameters in terms of underlying cognitive processes, because there is no one-to-one mapping between cognitive processes and those parameters (Matzke & Wagenmakers, 2009).

An alternative approach to the ex-Gaussian analysis is a delta plot analysis, which examines condition effects as a function of RT, and which has been used in analysing conflict tasks such as the Simon task and the Colour Stroop task (Ridderinkhof, 2002a, 2002b; Ridderinkhof, Wildenberg, Wijnen, & Burle, 2004). In delta plot analyses, responses for each condition are grouped into bins according to their RTs. Condition differences are then calculated and plotted for these bins. Delta plots prototypically have a positive slope due to effect sizes being larger for slower responses (Roelofs, Piai, & Garrido Rodriguez, 2011). If one condition requires more inhibition than the other, the difference between conditions does not linearly increase with RT, but is reduced instead, resulting in ‘levelling-off’ of the delta plot in longer RTs. The levelling-off has been explained by inhibition building up slowly (Ridderinkhof, 2002a). The extent to which the plots level off can effectively reflect the amount of inhibition involved. The stronger the inhibition is applied, the smaller the slope of the plot. For example, Ridderinkhof (2002b) observed that, in a Simon task, delta plots for participants with smaller Simon effect, who are believed to have more efficient inhibitory control, levelled off more than those with larger Simon effect. Because a delta plot analysis is most useful within the discussion of inhibitory control, this approach could be used to test whether inhibition was applied equally fast and to the same degree in monolinguals and bilinguals.

In conclusion, the current review adds to the discussion about the reality of the bilingualism cognitive advantage in that seemingly insignificant details such as data trimming and maximum time allowed for response might have a significant influence on the findings. Therefore, it is important to take these into account in order to fully judge the evidence for and against a bilingual cognitive effect, next to other factor already pointed out in the literature. In addition, future studies are encouraged to report in detail how data were handled, and possibly use more fine-grid analyses of RT data to shed light onto the effect of bilingualism on speakers’ cognitive control abilities.