The advance of eye-tracking technology has allowed an unprecedented opportunity to understand how the reading process unfolds in space and time. One of the advantages of this method is the possibility to precisely manipulate what participants see in real time while they are reading sentences on the screen. The use of such techniques has shown that readers not only process the currently fixated word, but that they also benefit from previewing the upcoming word in parafoveal vision (Rayner, 1998). This so-called preview benefit effect is measured as shorter fixation durations on a word for which correct preview information was available during the preceding fixation.

The preview benefit effect is arguably one of the most robust and least controversial findings in the literature, and it has inspired many subsequent experiments as well as models of reading (Schotter, Angele, & Rayner, 2012). A particular issue highlighted by the preview benefit effect is the distinction between gaze location and the attentional focus during reading. Although the gaze location and attentional focus are usually identical in single-word recognition tasks, readers appear to routinely attend the upcoming word as well. Many experiments have been dedicated to determining exactly which properties of an upcoming word readers can process while they are still fixating on the preceding word. It has been shown that reader can preprocess the orthographic, phonological, morphological, and possibly semantic properties of the upcoming word (for a review, see Schotter et al., 2012).

However, despite four decades of research on the topic, the size of this effect has not been systematically quantified for different experimental conditions and dependent measures. Even though many estimates of the size of this effect have been volunteered in the literature, no systematic statistical analysis has been undertaken. Therefore, this has not made it possible to evaluate in a precise manner the predictions of theories and the simulations of computational models of eye-movement control during reading (Engbert, Nuthmann, Richter, & Kliegl, 2005; Reichle, Pollatsek, Fisher, & Rayner, 1998; Reilly & Radach, 2006; see also Trukenbrod & Engbert, 2014). Also, although the existence of the preview benefit effect is accepted almost universally, currently there is no consensus as to whether readers also benefit from previewing the second word to the right of the current fixation. Moreover, the question of whether the preview benefit differs in languages with different writing systems, such as Western versus Chinese orthography, has not been answered satisfactorily. Finally, it is currently not clear whether and to what extent the size of the preview benefit is confounded by the interference effects associated with seeing a masked word in parafoveal vision.

The purpose of the present article is twofold. First, here we undertake a critical evaluation of the literature that highlights unresolved questions related to parafoveal processing during reading. In this aspect, it does not attempt to do a comprehensive review of the literature. Second, and more importantly, we present a meta-analytic investigation of parafoveal preview effects during reading, in order to answer the questions raised in this introduction.

The perceptual span during reading

When reading a sentence, the eyes do not move smoothly, but instead alternate between quick, jump-like movements, also known as saccades, and short periods in which they are relatively stable. These periods of stability, also known as fixations, are crucial for the word recognition process. When a given word is fixated, readers view it with their fovea, or the central 2° of visual angle, where visual acuity is at its highest. However, it is now well known that readers also obtain information from the parafovea, which extends to up to 5° of visual angle (Rayner, 1998). An illustration of this is presented in Fig. 1a.

Fig. 1
figure 1

(a) Illustration of the perceptual span during reading. (b) Demonstration of the gaze-contingent boundary paradigm (Rayner, 1975) for studying N + 1 and N + 2 preview effects. The invisible boundary is represented by a red vertical line. Panel B illustrates only the invalid preview condition; during valid preview, no change occurs on the screen

One paradigm that has been widely used to study parafoveal processing during reading is the gaze-contingent boundary technique (Rayner, 1975). In this technique, an invisible boundary is placed before the target word, to manipulate what participants see before the target word is fixated in foveal vision (see Fig. 1b for an illustration). In the control condition, participants view the actual target word in parafoveal vision (this will be referred to as a valid preview from now on). In the experimental condition, the target word is masked (e.g., with a string of Xs), and participants do not acquire any useful information from it prior to crossing the boundary (this will be referred to as an invalid preview). Once the eyes cross the invisible boundary, the parafoveal mask is permanently replaced with the target word (during valid preview, no physical change occurs on the monitor—i.e., the target word is replaced with itself). By using this manipulation, it is possible to calculate how much parafoveal information participants obtain from the word to the right of the current fixation, by subtracting the fixation durations during valid previews from the fixation durations during invalid previews. In this article, the standard terminology is adopted, in which the word before the boundary is referred to as word N, the first word following the boundary is referred to as word N + 1, and the second word following the boundary is referred to as word N + 2. As is shown in the lower panel of Fig. 1b, the gaze-contingent display change manipulation can also be applied to word N + 2.

The preview benefit effect: A reexamination

The advantage of previewing word N + 1 in parafoveal vision is known as the preview benefit effect (Rayner, 1998) and is often though to reflect the benefit in terms of word recognition times: larger preview benefits are interpreted to mean faster recognition times once word N + 1 is fixated in foveal vision. However, the exact size of this effect is not known- in part because it varies from study to study, and in part because it hasn’t been systematically investigated. Previous literature reviews have estimated that the preview benefit effect is between 20 and 50 ms (Rayner, 2009; Rayner, White, Kambe, Miller, & Liversedge, 2003). However, there are a number of issues with such estimates. First, they are not derived in ways that are sufficiently well-documented. Second, they are not based on all the available evidence, and thus are susceptible to researchers’ intuitions and their own experience with this type of research. Third, due to the fact that a number of different types of fixation duration measures are used, the effect size for each individual measure is also not known with certainty.

A few standard fixation duration measures are used for calculating the preview benefit effect. The first-fixation duration (FFD) measures the very first fixation on the target word. The single-fixation duration (SFD) is similar to the FFD, but it reflects cases in which the target word was fixated only once. Another measure is gaze duration (GD), which measures all fixations on the target word before the eyes move to another word. These three measures are often referred to as first-pass reading. In addition to them, second-pass reading measures are also sometimes used. The most common one is the total viewing time (TVT), which measures all fixations on the target word (including when the word is refixated during a regression).

The problems associated with using multiple fixation duration measures have not been seriously considered until very recently (von der Malsburg, & Angele, 2015). One issue with this practice is the high degree of collinearity between these measures. This means that the preview benefit of one measure already contains all of the information that is computed with the other measures. For example, since all the fixations that contribute to the FFD also contribute to GD and TVT, this has the consequences, in practice, that the preview benefit effect is additive and that both the effect sizes and general magnitudes of the fixation time measures are virtually always related in the following way: FFD ≤ SFD ≤ GD ≤ TVT. Due to the fact that the effect size for each measure is not known with certainty, not much can be said about how the effect is distributed across the four measures (e.g., is it constrained mostly to the first fixation? Do GD and TVT equally add to it?). A precise estimate of each measure would be useful not only for planning experiments (e.g., sample size calculations), but also for formulating more precise hypotheses and interpreting the results of boundary studies. For example, given that it is often not clear how to interpret an effect found in one measure but not the others, a more precise estimate of the effect would make it possible to make more precise hypotheses about the expected effects of boundary experiments.

Another aspect of the preview benefit effect that has not been addressed in a satisfactory way is whether it is similar in size across languages. Research in this area has a long-standing bias, in the sense that most studies have been conducted in a few Western languages, such as English, German, and Finnish. Fortunately, this situation has changed a bit in recent years, and studies are now being published in languages such as Chinese, Thai, and Korean, to name just a few (see Radach & Kennedy, 2013, for a recent discussion). Some languages, such as Chinese, use different writing systems. In this sense, the preview benefit effect may be different in nonalphabetical languages.

In fact, one common assumption is that the preview benefit effect would be bigger in Chinese because information is more densely packed and there are no interword spaces (e.g., see Yang, Wang, Tong, & Rayner, 2012). This assumption, however, has not been addressed empirically, and it is not immediately clear whether and to what extent the preview benefit effects differ between alphabetical languages and Chinese. This question holds great theoretical implications, because parafoveal processing plays a key role in computational models such as the E-Z Reader (Reichle et al., 1998) and SWIFT (Engbert et al., 2005), which were originally developed with alphabetical languages. Finally, no good estimate of the size of the preview benefit effect has been based on all languages that have been studied with the boundary paradigm.

Preview benefit versus N + 1 preview effects

As we mentioned previously, the preview benefit has traditionally been viewed as the advantage of previewing the target word, in terms of word recognition times. However, two recent lines of evidence suggest that this may be an oversimplification. The first is the finding that parafoveal masks that are used during invalid preview conditions may introduce processing costs (Gagl, Hawelka, Richlan, Schuster, & Hutzler, 2014; Hutzler et al., 2013; Kliegl, Hohenstein, Yan, & McDonald, 2013; Marx, Hawelka, Schuster, & Hutzler, 2015; Yan, Risse, Zhou, & Kliegl, 2012). This evidence suggests that the preview benefit effect is a combination of both preview benefits and preview costs. Because the effect is calculated by subtracting the fixation durations after valid preview from the fixation durations after invalid preview, any processing costs that inflate the latter will also result in an inflated preview benefit effect.

In alphabetical languages, at least, there are a number of ways in which the target word can be masked in parafoveal vision. For example, it can be replaced with Xs (cottagexxxxxxx), random letters (cottagefdiuekl), an unrelated word (cottagekitchen), or a pseudoword (cottageoxypane). Therefore, a number of baselines could be used for calculating the preview benefit effect. The problem of what constitutes a proper baseline is not unique to eye-tracking studies of reading, but is also relevant to priming studies more broadly (see, e.g., Jacobs, Grainger, & Ferrand, 1995). To date, the extent to which different types of parafoveal masks introduce processing costs is poorly understood. In one study, Hutzler et al. (2013) investigated the effect of parafoveal X masks on reading lists of words. By using fixation-related brain potentials, they found that such masks interfere with foveal word recognition. More recently, Marx et al. (2015) also presented evidence that parafoveal X masks and random-letter masks lead to overestimation of the preview benefit effect relative to a more neutral baseline in which the target word is visually degraded. The search for a new and more neutral mask is indeed a commendable effort. However, regardless of whether a mask that yields a “pure” preview benefit without any interference may be found in the future (e.g., through visual degradation; Marx et al., 2015), the processing costs associated with each of the masks outlined above still have not been quantified. Since the boundary technique has been used for the last 40 years, this question is very pertinent, in that dozens, if not hundreds, of boundary experiments have used such masks.

A second line of evidence also suggests that the preview benefit effect may not reflect the pure advantage of having a valid preview of the target word. More specifically, Risse and colleagues (Risse & Kliegl, 2012, 2014) manipulated both the validity of the word in parafoveal vision (valid vs. invalid) and its difficulty (high vs. low). The difficulty was operationalized as the lexical frequency of the target word. They found that what is traditionally known as the preview benefit effect is in fact a combined effect of both the difficulty and the validity of the parafoveal preview. In other words, the preview validity and preview difficulty are two separate effects that, when combined, make up the preview benefit effect. Therefore, even though readers benefit from having a valid preview of the target word, the nature of this preview is also influenced by the lexical difficulty of the word in parafoveal vision. In this sense, these results suggest that the preview benefit may be inherently confounded with preview frequency effects.

In light of the evidence reviewed above, it can be argued that the preview benefit may not necessarily be the pure advantage of having a valid preview of the target word in parafoveal vision. For this reason, in the present article we purposefully avoid talking about a preview benefit effect. Instead, we will use the slightly more conservative term N + 1 preview effect, which denotes the effect resulting from manipulating the preview of word N + 1 in parafoveal vision. Even though there is no doubt that readers do obtain a net benefit from previewing word N + 1 in parafoveal vision, at present it is not known what contributes to this benefit. For example, it could be formed of a large benefit effect moderated by large mask interference effects, or it could be a smaller benefit effect that is not reduced much by interference. For this reason, we prefer to remain agnostic on this issue until more evidence is available (nevertheless, we will attempt to address this question in the empirical part of this article). The issue of whether and to what extent interference effects exist is also theoretically important for computational models of reading. For example, models such as E-Z Reader (Reichle et al., 1998) and SWIFT (Engbert et al., 2005) assume that parafoveal processing is only a benefit and includes no interference. In this sense, the issue is also important for future model development.

To summarize the discussion so far, it is well known that readers obtain an advantage from previewing word N + 1 in parafoveal vision. However, the sizes of this effect for different measures are less clear. For example, there is a great deal of uncertainty regarding how the effect differs between alphabetical and Chinese studies. Also, the consequences of using different parafoveal masks are not well understood. These are all issues that we will return to in the empirical part of this article. Before that, however, we will also consider the evidence from N + 2 preview effects.

N + 2 preview effects during reading

Originally, research on parafoveal preview effects focused on word N + 1. Interest in N + 2 preview effects has been more recent and has primarily been motivated by the possibility of testing competing predictions of computational models of eye movement control during reading (Rayner, Juhasz, & Brown, 2007). Whereas N + 2 preview effects are expected in guidance-by-attentional-gradient models such as SWIFT (Engbert et al., 2005), they are generally not predicted by serial-attention-shift models such as E-Z Reader (Reichle et al., 1998). More recent simulations with SWIFT have confirmed that the model is able to simulate N + 2 preview effects without directly fitting it to such effects (Risse, Hohenstein, Kliegl, & Engbert, 2014). Interestingly, simulations with E-Z Reader have also demonstrated that N + 2 preview effects are theoretically possible, although they are likely to be small in size (Schotter, Reichle, & Rayner, 2014). However, it is currently not clear how well these simulations fit with the empirical data, due to the mixed findings in the literature. As is shown in Fig. 1, if word N + 1 is short, then the next word (N + 2) can also be previewed in parafoveal vision prior to crossing the boundary. However, even though N + 2 preview effects have been investigated for the last 10 years, these studies have yielded inconsistent findings. As a result of this, currently there is no clear consensus as to whether any processing of word N + 2 occurs in parafoveal vision.

Judgments about the existence of such effects have usually been made by comparing statistical significance across studies. However, this practice is problematic, because it can result in misleading conclusions (Gelman & Stern, 2006). The use of p values also does not answer the question of real interest—How likely is it that readers obtain an advantage from previewing word N + 2? If these studies are examined even more closely, other statistical issues may further obscure the underlying effect. Because word N + 2 is situated closer to the limits of parafoveal vision than is word N + 1, the visual acuity there should by definition be lower. For this reason, N + 2 preview effects, if they exist, should almost certainly be smaller in size than N + 1 effects. Even though it is well known that larger sample sizes are needed to achieve sufficient statistical power with smaller effect sizes, this issue has not always been considered when planning studies investigating N + 2 preview effects.

Another issue that may create further confusion is the fact that some N + 2 studies have also included an N + 1 boundary in the same experiment (Rayner et al., 2007; Yang, Rayner, Li, & Wang, 2012; Yang, Wang, Xu, & Rayner, 2009). The general reasoning behind this was to see whether a dissociation could be found between parafoveal preview and boundary location. Unfortunately, such a dissociation can be found only by analyzing the interaction between boundary location and parafoveal preview, and not by comparing the statistical significance of the two preview effects (see Nieuwenhuis, Forstmann, & Wagenmakers, 2011). Because not all experiments have reported such an interaction, N + 2 preview effects at the very least warrant further investigation. The use of two boundary locations in the same experiment may also introduce additional artifacts that may not be found in experiments with a single (N + 2) boundary.Footnote 1 For example, participants may be more likely to detect display changes when there are two types of boundary manipulation in the same experiment. Since preview effects from word N + 1 are well established, and since comparison of the statistical significance of the effects of two boundaries is generally not very informative, it may be worthwhile to use only one boundary location in future N + 2 studies.

The present study

The main goal of the present study was to reexamine N + 1 and N + 2 preview effects by undertaking a systematic statistical synthesis of the previous findings. We had four main questions of interest: (1) What is the size of N + 1 and N + 2 preview effects? (2) What is the probability that N + 2 preview effects exist, given the evidence to date? (3) Do N + 1 and N + 2 preview effects differ between alphabetical and Chinese studies? (4) Are N + 1 preview effects influenced by the type of parafoveal mask?

Although it may seem that determining the size of the N + 1 preview effect is not theoretically important, estimating this effect is crucial, among other things, for evaluating the magnitude of the N + 2 effect. Because word N + 1 and word N + 2 are two points on the continuum of decreasing visual acuity, N + 2 preview effects can be meaningfully compared only in relation to the N + 1 preview effect. However, because a precise estimate of the latter has not yet been systematically derived, determining the size of this effect was also important. In addition to that, the two effect sizes are very useful for evaluating the theoretical claims and simulations of computational models of reading.

Motivation for the meta-analysis

In order to fully answer these questions, the present meta-analysis adopted a Bayesian approach to statistical inference. Although this method of inference is less commonly known in psychology, it confers a number of advantages for the present meta-analysis. First, Bayesian statistics offers an intuitive and rigorous way of handling uncertainty that is based on the laws of probability. Second, the results of the analysis are conditional on the data, and thus can be used to calculate the probability of preview effects given the data—that is, P(Effect | Data). This is in stark contrast to the frequentist approach to statistical inference, which yields the less informative probability of obtaining the data given the effect—that is, P(Data | Effect). Finally, the results of Bayesian inference, which are expressed as posterior probability distributions, offer much richer information about parafoveal preview effects (e.g., what is the probability that N + 2 preview effects are bigger than 5 ms?). Posterior distributions also avoid the use of p values, which are often misinterpreted as providing information that they do not (Perezgonzalez, 2015). Because null hypothesis significance testing is a mishmash of two incompatible theories, p values are often ascribed statistical properties they do not actually possess (for more details, see Kline, 2004).

Although Bayesian statistics is less familiar to psychology researchers, it has recently increased in popularity (Andrews & Baguley, 2013). In light of the well-known limitations of null hypothesis significance testing (Cohen, 1994; Cumming, 2013; Gigerenzer, 2004; Royall, 1997), in recent years there has been a call for the use of Bayesian inference in psychology (e.g., Andraszewicz et al., 2014; Kruschke, 2010; Kruschke & Liddell, 2015; Rouder, Speckman, Sun, Morey, & Iverson, 2009; Wagenmakers, 2007; Wagenmakers, Lodewyckx, Kuriyal, & Grasman, 2010). Bayesian approaches to meta-analysis have traditionally been used in the biomedical sciences (e.g., Higgins & Spiegelhalter, 2002; Salpeter, Cheng, Thabane, Buckley, & Salpeter, 2009; see Sutton and Abrams, 2001, for a review), but more recently also in psychology (Rouder & Morey, 2011; Rouder, Morey, & Province, 2013) and psycholinguistics (Engelmann, Jäger, & Vasishth, 2016; Vasishth, Chen, Li, & Guo, 2013). For example, Vasishth et al. (2013) used a Bayesian random-effects meta-analysis to investigate conflicting evidence about the processing of relative clauses in Chinese. Their results showed that the probability that subject-relative clauses are easier to process than object-relative clauses, given the data, is about 80 %.

The present meta-analysis is similar in spirit to that of Vasishth et al. (2013), and here we also adopted a random-effects meta-analytical model. This particular type of model was chosen because it assumes that the observed effect sizes vary around a true, unknown effect. This makes it possible to model the variability associated with the use of different languages and writing systems, as well as the variability due to different experimental designs, lab practices, and so forth. The present meta-analysis was mostly exploratory in nature. However, on the basis of the literature reviewed above, we expected that (1) N + 1 preview effects would be somewhat smaller (e.g., around 20–30 ms) than is sometimes assumed, and (2) the posterior distribution of N + 2 preview effects would not be centered on 0.

Method

The meta-analysis was conducted by following the guidelines of the PRISMA statement (Moher, Liberati, Tetzlaff, Altman, & the PRISMA Group, 2009). The graphical presentation of results was done by following the suggestions of Anzures-Cabrera and Higgins (2010).

Literature search

A search of the literature was done separately for N + 1 and N + 2 preview studies. Figure 2 shows a flowchart of the process. Google Scholar and Web of Science were searched with “word n + 2 preview” and “n + 2 preview reading” as keywords, for N + 2 studies, and “word n + 1 preview,” “preview benefit reading,” and “parafoveal preview reading” as keywords, for N + 1 studies. This was done in September 2015. The search process was completed in four stages: (1) recording all relevant articles in a database, (2) removing duplicate entries, (3) screening the full-text articles for experiments with an N + 1/N + 2 boundary manipulation, and (4) reading the full text of the articles identified in the previous stage and coding them if they met the inclusion criteria. During the fourth stage, care was also taken to identify cited studies that had been missed in the literature search and that might be relevant. Such studies (N = 18) were then reviewed using the same process.

Fig. 2
figure 2

Flowchart of the literature search process for N + 1 and N + 2 preview studies. The breakdown of excluded studies is only approximate, because some of them were excluded for more than one reason. Six of the included experiments had both N + 1 and N + 2 boundaries

The full study inclusion criteria can be found in Appendix A. In short, the experiments had to be methodologically sound and not deviate too much from conditions that are typical for everyday reading. As a result of this, 44.2 % of all reviewed N + 1 experiments and 64.7 % of all N + 2 experiments were included in the meta-analysis. Although this may seem like a high exclusion rate, this was necessary to ensure that only studies that had similar parafoveal masks and experimental designs were analyzed together.

Publication bias

Publication bias is a threat to meta-analyses, because the literature consists mostly of studies that report positive findings (Vasilev, 2013), and negative results are less likely to be written up and submitted for publication (Franco, Malhotra, & Simonovits, 2014). In this meta-analysis, two unpublished (at the time of the analysis) experiments were included.Footnote 2 Furthermore, funnel plots were used to assess the data for publication and other related biases (Sterne et al., 2011; see Appendix B). On the whole, we found no clear evidence of systematic bias in the data, while keeping in mind that a small degree of heterogeneity was expected, for reasons discussed below.

Data coding

For all experiments included in the meta-analysis, the following types of information were extracted: mean fixation durations (FFD, SFD, GD, or TVT), their respective standard deviations, the sample size, and the language of the study (a summary of this information is available in supplementary file #1). This information was further used to combine the experiments into different groups for analysis. For studies with an N + 1 boundary, descriptive statistics were coded for the following types of parafoveal masks: random string of letters, a string of Xs, an unrelated word, a pseudoword, an orthographically related word/nonword, a semantically related word, or a phonologically related word/nonword.Footnote 3 A pseudoword mask was defined as a nonword that was pronounceable. An orthographic mask was defined as a word/nonword that had at least a 50 % letter overlap with the target word, regardless of the position of these overlapping letters (in all Chinese studies, the orthographic masks were classified as such by the original authors). Since there was no easy way to quantify the degrees of semantic and phonological relatedness of the mask and the target word, the classification of the original authors was taken at face value. Finally, due to the much smaller number of N + 2 studies, the effect of the mask type could not be addressed in any meaningful way. Whenever two masks were available, the one that best preserved the consistency across studies was chosen.Footnote 4 For example, if one experiment had only an unrelated-word mask, but another experiment had both unrelated-word and pseudoword masks, the unrelated-word mask was coded in both experiments.

Missing data

All of the N + 2 studies reported the full data needed for the analysis. However, 27 % of all N + 1 studies did not report a measure of variance. This was handled in the following way. For studies published in the last 10 years, the authors were contacted and asked whether they could provide the missing data (8 % of all the missing data were recovered in this way). In all other cases, the missing values were imputed by pooling the standard deviations from the remaining studies. This approach has been shown to produce accurate results (Furukawa, Barbui, Cipriani, Brambilla, & Watanabe, 2006). Also, it had a number of advantages over simply excluding such studies (e.g., it increased the sample size and it did not introduce study-selection biases; see Lajeunesse, 2013).

Data analysis

A random-effects model was chosen because some heterogeneity between studies was expected a priori due to the different types of languages (alphabetical vs. Chinese). It should be noted, however, that a random-effects model does not account for heterogeneity per se. Rather, heterogeneity no longer matters, since the model already assumes that such heterogeneity exists (Higgins & Green, 2011).

Meta-analysis

A Bayesian random-effects meta-analysis was defined as follows (Engelmann et al., 2016; Schmid & Mengersen, 2013). Let T i be the observed effect size (in milliseconds) for study i, where i ranged from 1 to n (the number of experiments in the analysis). Positive values represented evidence for a preview effect, and negative values represented evidence against it. Let θ be the true, unknown preview effect that was to be estimated by the model. Furthermore, let θ i be the true preview effect in the ith study, and σ 2 i the true variance of the sampling distribution in the ith study. Finally, let τ 2 be the unknown between-study variance. Therefore, the model was constructed as:

$$ \begin{array}{l}\left.{T}_i\right|{\theta}_i,{\sigma}_i^2\sim N\left({\theta}_i,{\sigma}_i^2\right)\kern0.5em i=1,2,3,\dots, n\hfill \\ {}\left.{\theta}_i\right|\theta, {\tau}^2\sim N\left(\theta, {\tau}^2\right),\hfill \\ {}\theta \sim Uniform\left(-200,200\right),\hfill \\ {}\tau \sim Uniform\left(0,200\right).\hfill \end{array} $$
(1)

Each σ i was estimated from the standard error of study i. In a within-study design such as the experiments in the present meta-analysis, this is calculated in the following way (Borenstein, Hedges, & Higgins, 2009, p. 24). If S 1 is the standard deviation of the valid preview condition, and S 2 is the standard deviation of the invalid preview condition, the variance of the mean difference T, V T , is calculated as

$$ {V}_T=\frac{S_{diff}^2}{n}, $$
(2)

where n is the sample size and S diff is given by

$$ {S}_{diff}=\sqrt{S_1^2+{S}_2^2-2\kern0.5em x\kern0.5em r\kern0.5em x\kern0.5em {S}_1\kern0.5em x\kern0.5em {S}_2}, $$
(3)

where r × S 1 × S 2 denotes the covariance of the two means. The standard error is then given by

$$ S{E}_T=\sqrt{V_T} $$
(4)

In Eq. 3 it was not possible to estimate the covariance of the two means, due to the fact that we did not have access to the raw data. Therefore, the calculations assume that the covariance was equal to 0. It should be noted that this has the consequence that the variance parameter is likely overestimated in the analyses. In the meta-analysis, precision was defined as the inverse of the within-study variance of the sampling distribution (i.e., 1/SE T 2). This means that studies with smaller variance were given greater weight in the analysis. The same was done for the between-study variance τ 2, since the between-study variance is added to the within-study variance of the sampling distribution in calculating the weight of each study.

Uniform distributions were used as the priors in the analysis. These noninformative prior distributions make each value on these intervals equally likely; that is, they do not contain any meaningful assumptions about the sizes of preview effects. In this way, maximizing the posterior probability is equivalent to maximizing the likelihood (Edwards, 1974). In other words, the data are allowed to “speak for themselves,” and the prior distributions have very little to no influence on the results.

To check whether the chosen priors influenced the results, sensitivity analyses were also conducted by using different priors (the results are reported in supplementary file #3). The additional priors used were Gamma(0.001, 0.001) and Normal(0, 1002) I(0,) for τ, and Normal(0, 1002) for θ (note that the Gamma prior was on the precision of τ). The results of the sensitivity analysis showed that the choice of priors had very little effect on the results and had practically no influence on the conclusions from the analyses. The only notable finding was that the Gamma prior tended to underestimate τ (cf. Gelman, 2006), but even that did not result in a dramatic difference in the results.

Meta-regression

As we mentioned above, a random-effects meta-analysis can account for heterogeneity between studies, but it does not help explain what gives rise to this heterogeneity in the first place (Welton, Sutton, & Cooper, 2012). However, it is possible to use random-effects meta-regression to investigate how categorical study characteristics (e.g., whether the language was alphabetical or Chinese) are associated with the effect of interest (Deeks, Higgins, & Altman, 2008). In this case, the model in Eq. 1 was extended by adding a regression coefficient β for the underlying effect of the covariate of interest (Welton et al., 2012). Therefore, the meta-regression model was constructed as follows (Engelmann et al., 2016; Welton et al., 2012):

$$ \begin{array}{l}\operatorname{}{T}_i\Big|{\theta}_i,\beta, {\sigma}_i^2\sim N\left({\theta}_i+\beta \kern0.5em x\kern0.5em {\mathrm{covariate}}_i,{\sigma}_i^2\right)\kern0.5em i=1,2,3,\dots, n\kern1em \\ {}\operatorname{}{\theta}_i\Big|\theta, {\tau}^2\sim N\left(\theta, {\tau}^2\right),\kern1em \\ {}\beta \sim Uniform\left(-200,200\right),\kern1em \\ {}\theta \sim Uniform\left(-200,200\right),\kern1em \\ {}\tau \sim Uniform\left(0,200\right)\kern1em \end{array} $$
(5)

In this model, a prior on β was also needed. For consistency purposes, and since the possible range of values for β was constrained, we also used a uniform prior. However, a sensitivity analysis with a Normal(0, 1002) prior on β showed that the results were not influenced by this decision (see supplementary file #3). In this meta-regression model, θ i is the effect in the ith study that was adjusted for the covariate effect β; therefore, the effect of the covariate β became the parameter of interest, rather than θ (Engelmann et al., 2016).

Meta-regression was used to investigate two of the main questions identified in the introduction: (1) whether effect sizes differ as a function of the type of language (alphabetical vs. Chinese), and (2) whether the N + 1 effect size differs between different parafoveal masks. For the first question, a sum contrast was used in which alphabetical languages were coded as –1 and Chinese studies as 1. Therefore, positive values of β would indicate that preview effects were larger for Chinese than for alphabetical languages. For the second question, the same type of contrasts were set up post-hoc after inspecting the effect sizes of different masks; the types of comparisons are shown in the Results section, for simplicity. It is important to note, however, that the conclusions from meta-regression are only observational in nature (Thompson & Higgins, 2002). Therefore, the meta-regression results should be considered as exploratory, and they need to be verified by future experiments.

Posterior sampling

Sampling from the posterior distribution was done with JAGS (Plummer, 2003) using the R software, version 3.1.3 (R Development Core Team, 2015). Three Markov-chain Monte Carlo (MCMC) chains were run with 75,000 iterations each. Checks were performed to ensure that the initial values did not influence the results. The first 3,000 iterations were discarded as burn-in. Chain thinning of 5 was used to reduce the influence of autocorrelation. Convergence was assessed through visual inspection and Gelman and Rubin’s (1992) convergence diagnostic. The evidence suggested that all models had converged (for details on MCMC sampling, see Lynch, 2007).

Types of analyses

In the main analysis, we fitted separate models for N + 1 and N + 2 preview effects. For N + 2 effects, all 11 studies were included in the analysis. For N + 1 studies, only the experiments containing a random-letter, unrelated-word, X-string, or pseudoword mask were included. This was done because such masks do not contain any useful information about the target word. The effects found in these two analyses will be referred to as N + 1 and N + 2 preview effects, respectively. In all analyses, separate models were fit for each dependent variable.

In addition to these general analyses, we did a separate breakdown of N + 1 preview effects by parafoveal masks. The minimum number of studies needed to fit a model was set at six. The purpose of this breakdown was to investigate the effect sizes associated with each mask type. Separate models were again fit for each fixation duration measure when at least six studies were available. Additionally, whenever enough experiments were available, all analyses were completed once for all studies, once for alphabetical-only studies, and once for Chinese studies. Finally, it should be kept in mind that all probabilities reported in this article are not p values and cannot be interpreted as such; rather, they are the posterior probabilities of parafoveal preview effects, given the data. For a gentle introduction to using and interpreting the results of Bayesian inference, readers are referred to Nicenboim and Vasishth (2016; see also Kruschke, 2014).

Results

Baseline fixation time measures

First, we analyzed the fixation time measures (FFD, SFD, GD, and TVT) on the target word in the valid preview condition. In gaze-contingent display change experiments, this condition provides the baseline that the experimental preview conditions are compared against. The posterior distribution in this analysis gives the fixation time given all the available data. The results of this analysis are presented in Table 1. The purpose of this analysis was to calculate the average duration of each fixation time measure when parafoveal preview was allowed. Since the valid preview condition corresponds to normal reading, the results can also be used as benchmark data for designing experiments or computational models of reading.

Table 1 Posterior distribution means and 95 % credible intervals for fixation time measures on the target word when parafoveal preview was allowed (i.e., in the valid preview condition)

For this analysis, nine additional experiments were added. These experiments were not included in the meta-analysis of parafoveal preview effects because they did not have a mask that could be added to one of the mask types that was analyzed in this article. However, since these studies did have a valid preview condition, they were suitable for this baseline analysis.

N + 2 preview effects

Means and 95 % credible intervals for the posterior distributions of the fixation time measures for N + 2 preview effects are presented in Fig. 3. When all the studies were combined, the posterior distributions of FFD and GD showed a 5-ms preview effect as compared to the valid-preview baseline. When only the alphabetical studies were analyzed, the estimates of the two effects were slightly smaller and less certain. Due to the fact that some previous studies (e.g., Kliegl, Risse, & Laubrock, 2007) have reported N + 2 preview effects on word N + 1, we also ran a separate analysis with the eight studies that reported descriptive statistics for word N + 1. The mean N + 2 preview effects on word N + 1 were 6.7 ms for FFD [95 % credible interval: –4, 17.2] and 7.4 ms for GD [95 % credible interval: –5.2, 20.2].

Fig. 3
figure 3

Forest plot of N + 2 preview effects on word N + 2. Plotted are the observed effect sizes with 95 % confidence intervals (CIs) and the posterior estimates of the effect sizes with 95 % credible intervals (CrIs). The sizes of squares are proportional to the weight of the study (i.e., the inverse of the within-study variance of the sampling distribution). The results for individual studies are based on the full model with all 11 experiments

When investigating N + 2 preview effects, the length of word N + 1 is an important factor, because it determines whether and to what extent word N + 2 can be previewed in parafoveal vision. Therefore, it can be argued that, for alphabetical languages, N + 2 preview effects can be reliably found only when word N + 1 is short (e.g., three or four letters). Ideally, one would include the word N + 1 length in alphabetical languages as a covariate in a meta-regression model. However, in the present analysis this was not feasible, because there were only seven alphabetical studies, and meta-regression is not recommended with fewer than ten studies (Deeks et al., 2008). However, the length of word N + 1 for all studies is included in Fig. 3 for visual comparison. Also, we fit a separate meta-analytical model for the five alphabetical studies that had a word N + 1 length of three letters. The expectation that N + 2 preview effects would be larger when word N + 1 was three letters long was confirmed for GD (mean effect size: 6.9 ms [95 % credible interval: –19.7, 33.4]), but not for FFD (mean effect size: 2.3 ms [95 % credible interval: –17.2, 21.8]). Therefore, the results suggest that the effect size for GD is larger when word N + 1 is three letters long, but it should be kept in mind that this finding could be influenced by the small sample size.

To take full advantage of the Bayesian approach in the present meta-analysis, we also calculated the probability that N + 2 preview effects would be bigger than 1 ms. The threshold of 1 ms was taken as a purely statistical measure of the minimum effect size anyone would sensibly consider to indicate that N + 2 preview effects exist. Of course, researchers may have their own beliefs about what threshold can be used to determine whether N + 2 preview effects exist. For example, a skeptical researcher may believe that only effect sizes greater than 4 ms can be taken as evidence that N + 2 preview effects are real. Such beliefs can be easily accommodated within the Bayesian framework, since researchers can simply look up the probabilities associated with their beliefs. This information is presented in Fig. 4.

Fig. 4
figure 4

Probabilities that N + 2 preview effects are greater than some number X, given the data. All, all studies; alphab., alphabetical studies

When all studies were combined, the probabilities that the N + 2 preview effects on word N + 2 were bigger than 1 ms were .87 and .77 for FFD and GD, respectively. When only alphabetical studies were analyzed, these probabilities were .70 for FFD and .53 for GD. The probabilities that N + 2 preview effects on word N + 1 were bigger than 1 ms were .85 for both FFD and GD.

Given that the credible intervals for both FFD and GD included 0, one could reasonably ask whether this could indicate a null effect. The fact that the intervals include 0 means that this number cannot be rejected as a credible value for N + 2 preview effects. However, this also does not prove that the effect is null. To prove a null effect, it would be necessary to show that the 95 % credible interval is contained within the null region that was defined in the present analysis (from –1 to 1 ms; cf. Kruschke, 2014). In this sense, a null effect can only be accepted or rejected when the posterior estimate is sufficiently precise.

Finally, we used meta-regression to investigate whether the N + 2 preview effects on word N + 2 were greater for Chinese than for alphabetical studies. This was possible due to the larger number of studies (n = 11). Recall that β here reflects the difference in effect sizes between the alphabetical and Chinese studies, and that positive values indicate bigger effect sizes for Chinese studies. There was a .65 probability that Chinese studies had a bigger effect size for FFD (mean estimate of β: 1.5 [95 % credible interval: –6.2, 9.4]). There was a higher probability (.80) that the effect size was bigger for Chinese studies for GD (mean estimate of β: 5.1 [95 % credible interval: –6.7, 16.7]). Therefore, the results suggest that the N + 2 preview effect on word N + 2 is bigger for Chinese than for alphabetical studies, and that this difference is more pronounced for GD than for FFD.

N + 1 preview effects

Since our analysis included many more studies on N + 1 than on N + 2 preview effects, the results for the N + 1 preview effect are presented in Table 2 (however, see supplementary file #2 for forest plots). This analysis includes all experiments with an N + 1 boundary manipulation that had an invalid parafoveal preview (unrelated-word, pseudoword, random-letter, or X-string mask). The probabilities associated with these effects are presented in Fig. 5. As the breakdown shows, the majority of studies were alphabetical (78 % for FFD, 73 % for SFD, 75 % for GD, and 100 % for TVT). The effect sizes for all studies increased in the expected direction: FFD was smaller than SFD, which was smaller than GD, which in turn was smaller than TVT.

Table 2 Posterior distribution means and 95 % credible intervals for the effect sizes of the N + 1 preview effect (in milliseconds)
Fig. 5
figure 5

Probabilities that N + 1 preview effects are greater than some number X, given the data. The analysis was done on all studies from all languages

Differences between alphabetical and Chinese studies

As in the N + 2 analysis, we also fitted meta-regression models to investigate differences in the N + 1 preview effects between Chinese and alphabetical studies when no useful information was obtained from the target word. The findings are presented in Table 3. These results suggest that the common assumption that N + 1 preview effects are bigger for Chinese studies holds up only for GD. However, this result could be confounded by the type of mask, due to the fact that two masks (a string of letters and a string of Xs) are not possible in Chinese studies. A separate analysis with an unrelated-word mask alone showed an overall high probability that effects sizes were bigger for Chinese studies. Critically, the difference for GD was almost 10 ms. However, the opposite result was found for the pseudoword mask: There was relatively low probability that the effects sizes were bigger for Chinese.

Table 3 Mean estimates of the difference (β) in the N + 1 preview effects between Chinese and alphabetical studies

Differences between parafoveal masks

The N + 1 preview effect was further analyzed by checking how the type of parafoveal mask influenced the effect size. This exploratory analysis was done by fitting separate meta-analytical models for each mask type (detailed results for each effect size are available in supplementary file # 2 [Figs. S2.5 and S2.6]). These exploratory findings were then used to fit meta-regression models that tested for differences in all possible pair-wise comparisons of the masks. The results of these analyses are presented in Fig. 6. In contrast to all analyses so far, this one also included masks that contained some information about the target word (orthographic, phonological, or semantic).

Fig. 6
figure 6

A heat map of facilitation (green) versus interference effects (red) associated with the type of parafoveal mask (based on all studies). The masks on the y axis are compared to the masks on the x axis (arrows are added to show the direction in which masks can be compared)

It should be noted that although meta-regression does not solve the problem of the different sample sizes for each mask type, the mean difference for each comparison is weighted by the precision of the studies that were included in each analysis. It is also worth remembering that these results should not be interpreted as orthographic, phonological, or semantic preview benefit effects. Rather, this is an analysis of the trade-off between facilitation and interference effects, depending on what type of mask is used.

There were three main findings. First, as expected, parafoveal masks that contained some information about the target word (orthographic, phonological, or semantic) resulted in the greatest facilitation. Conversely, unrelated-word, pseudoword, random-letter, and X-string masks led to the greatest interference in parafoveal processing, since they did not contain any useful information about the target word. Second, we also found interference effects between the masks that did not provide any information about the target word. In this analysis, unrelated-word mask resulted in the least interference relative to the other three masks, and a string of Xs resulted in the greatest interference. Finally, the amount of facilitation of the masks that provided some information about the target word depended on the baseline that they were compared to.

To check whether the results were influenced by including all Chinese studies, the analysis was also repeated with alphabetical-only studies (fewer mask comparisons were possible, due to the smaller sample sizes). Although there were minor differences, the main results remained: Unrelated-word masks led to the least interference, whereas X-string masks led to the greatest interference (4.6 ms larger for FFD, and 10.5 ms larger for GD).

Robustness of the main results

We checked the robustness of the main results in order to make sure that they were not sensitive to the exclusion of any individual study (Deeks et al., 2008). To do this, we used the leave-one-out method, in which the same analysis was repeated by omitting one different study each time (see, e.g., Greenhouse & Iyengar, 2009, pp. 423–424). For the main N + 2 results (presented in Fig. 3), the effect sizes ranged from 4.7 to 5.9 ms for FFD (SD = 0.32), and from 3.3 to 7.9 ms for GD (SD = 1.16). Therefore, the results for FFD were influenced very little by individual studies; the results for GD were influenced slightly more, but this was still within a reasonable range for this number of studies. The results from the main N + 1 analysis presented in Table 2 were also largely unaffected by this analysis. When studies from all languages were analyzed together, the effect sizes ranged from 24.9 to 26.4 ms for FFD (SD = 0.23), 30.7 to 33.7 ms for SFD (SD = 0.71), 38.6 to 40.5 ms for GD (SD = 0.32), and 44 to 49.1 ms for TVT (SD = 1.29).

Discussion

The present results provide a new, comprehensive perspective on what can be believed about parafoveal preview effects, given the available data. We will use this new perspective to revisit the open questions that we identified in our review of the literature: (1) the existence of N + 2 effects, (2) how the N + 1 preview effect differs as a function of parafoveal mask, and (3) type of language. Finally, we will consider issues related to interpreting parafoveal preview effects. Before we address these issues, we will consider the size of the N + 1 preview effect very briefly.

How big is the N + 1 preview effect?

The N + 1 preview effect has been investigated for the past 40 years. However, considerable uncertainty has surrounded the sizes of this effect reported in previous literature reviews on the topic (Rayner, 2009; Rayner et al., 2003). By taking into account all of the available data, the present meta-analysis has greatly reduced this uncertainty. On the basis of the present results, we can conclude that there is a 95 % probability that the N + 1 preview effects are smaller than 29 ms for FFD and 45 ms for GD. As the results show, the effect was constrained not only to the so-called early measures (FFD, SFD, and GD), but also increased for TVT. The precise effect size for N + 1 preview may not seem highly theoretically important, but as we will show in the next section, it actually has important theoretical implications when we evaluate model predictions about the existence and effect sizes of N + 2 preview effects.

Do N + 2 preview effects exist?

Perhaps the most controversial question addressed in this article is whether N + 2 preview effects exist. The present results demonstrate that the N + 2 preview effect on word N + 2 is mostly constrained to FFD and that its size is about 5 ms. Furthermore, there was a high probability that this effect was not centered on 0. Additionally, we found that the effect was larger for Chinese than for alphabetical studies, and that this difference was bigger for GD than for FFD. This is consistent with the view that preview effects are larger for nonalphabetical languages such as Chinese, due to the greater proximity of parafoveal words to the current fixation point (Yang, Wang, et al., 2012). However, it is important to keep in mind that this analysis was based on a small number of studies. In this sense, to get a better estimate of how N + 2 preview effects differ between alphabetic and Chinese studies, more research will be needed. Studies comparing the magnitudes of N + 2 preview effects in different languages within bilingual participants may be particularly helpful. In the meanwhile, it can be concluded that, on the basis of the available evidence, the N + 2 preview effect on word N + 2 across languages is very likely not centered on 0 but is modest in size. Therefore, to answer the question posed at the beginning of this section, it does appear that N + 2 preview effects exist.

One important question is whether the magnitude of this effect is of any practical or theoretical significance. Realistically, large N + 2 preview effects should be prevented by the distance of word N + 2 from the fixation point, and the resulting loss of visual acuity. Therefore, as it has been argued previously, the size of the N + 2 effect can be meaningfully compared only in relation to the size of the N + 1 effect. This comparison is of course limited by the different numbers of studies in each analysis. However, the best available evidence to date suggests that the size of the N + 2 preview effect is a nontrivial 20 % of the N + 1 effect size for FFD, and 14 % for GD. Therefore, even though the N + 2 effect is rather small, it is not negligible when considered in relation to the N + 1 effect.

On a more practical level, the present results are very useful for planning future N + 2 studies. However, they also have theoretical implications for computational models of reading. Models such as E-Z Reader (Reichle et al., 1998) and SWIFT (Engbert et al., 2005) make different predictions about the parafoveal processing of word N + 2. However, recent simulations with the two models have shown that they can both generate N + 2 preview effects of the magnitudes observed in this meta-analysis (Risse et al., 2014; Schotter et al., 2014). This suggests that modest N + 2 effects can be accounted for by both serial-attention and parallel-attention models. Even so, the two models explain these effects in different ways: In parallel-attention models, they are generated because the attention gradient usually includes word N + 2; in serial-attention models, they can be generated when the previous words are recognized prior to fixating word N + 2.

The present results, taken together with the simulations, suggest that N + 2 preview effects are not likely to distinguish between the two models. More importantly, however, the present findings greatly constrain what is possible with regard to the parafoveal processing of word N + 2. First, the results do not favor models that cannot explain N + 2 preview effects. Second, they also constrain the size of the N + 2 effects that should be simulated. In fact, any model that simulates N + 2 effects greater than 15 ms would be at odds with the present findings.

The precise estimates of both the N + 1 and N + 2 preview effects are also critically important for evaluating simulations of computational models of reading. As an example, we will use our estimates of the N + 1 and N + 2 preview effects to evaluate the claims made by Schotter et al. (2014) about N + 2 effects. Because of the difficulties inherent in simulating the gaze-contingent boundary paradigm, Schotter et al. estimated the “preview time” available for parafoveal processing of words N + 1 and N + 2 in the E-Z Reader model, rather than directly simulating actual preview effects. Using our estimates, we can link these simulated preview times to our numerical estimates of the N + 1 and N + 2 preview effects. Specifically, the key result of Schotter et al.’s Simulation 2 was that word N + 2 would only be previewed on 20 % of trials, for a preview time of 69 ms (and for 0 ms on 80 % of trials). In our meta-analysis, we observed a mean N + 2 effect size of 5.4 ms in FFD. Using Schotter et al.’s simulation results, we can assume that our estimate for the N + 2 effect of 5.4 ms is a weighted average of the 0-ms effect on 80 % of the trials and an unknown, nonzero effect on 20 % of the trials. It follows that this unknown N + 2 preview effect on the trials in which N + 2 was actually previewed must be 5.4 × 1/.2 = 27 ms.

We can do the same calculation for the N + 1 preview effect. Simulation 1 of Schotter et al. (2014) showed that word N + 1 is previewed on 89 % of trials and that, if it is previewed, the internal “preview time” in the E-Z Reader model is 98 ms. If we assume, following Schotter et al.’s simulation results, that our estimate for the N + 1 effect of 25.9 ms is a weighted average of the 0-ms effect on 11 % of the trials and an unknown, nonzero effect on 89 % of the trials, it follows that this unknown effect on the trials in which N + 1 was actually previewed must be 25.9 × 1/0.89 = 29.1 ms. This leads us to a surprising conclusion: For N + 1, every 1 ms of “preview time” corresponds to 0.297 ms of observed N + 1 preview effect in FFD, whereas, for N + 2, every 1 ms of “preview time” corresponds to 0.391 ms of observed N + 2 preview effect in FFD. Given that it is usually assumed that preprocessing of word N + 2 is less efficient due to visual acuity constraints, this result is quite unexpected and, at the very least, demands additional theoretical explanation. Without numerical estimates for the N + 1 and N + 2 effects, it would be impossible to detect this discrepancy between the theoretical claim of N + 2 preview occurring on only 20 % of trials and our empirically observed effect sizes. In this sense, the numerical values of both the N + 2 and N + 1 preview effects are critically important for evaluating computational models against empirical data.

Finally, the results of the N + 2 analysis are also very clear in one other aspect: There is a considerable difference in the precision of some of the studies (see Fig. 3). Although the robustness analysis showed that this had little effect on the present results, it does illustrate very well the need for high-power N + 2 studies in the future. Although meta-analysis is a good tool for summarizing evidence, it is no substitute for replication studies (Van Elk et al., 2015). In this sense, we recommend that high-power, registered replications of N + 2 studies be undertaken in the future. Also, due to the lack of consistently reported data, it was not possible to examine whether the N + 2 results are influenced by skipping probability. This remains an issue that needs to be explored in future research.

Does the type of parafoveal mask influence the size of the N + 1 preview effect?

The present data clearly demonstrate that the N + 1 preview effect is influenced by the type of parafoveal mask. On the surface, this conclusion is hardly surprising. However, it is of interest to examine two types of masks: ones that provide some information about the target word, and others that do not. If the parafoveal mask during invalid preview contains orthographic, phonological, or semantic information about the target word, this can potentially facilitate word recognition times. Previous studies have suggested that both orthographic and phonological overlap between the mask and the target word can reduce the N + 1 preview effect and thus reduce target-word processing times (see Schotter et al., 2012). Similarly, semantic overlap has also been shown to result in such a benefit in word processing times in German (Hohenstein & Kliegl, 2014; Hohenstein, Laubrock, & Kliegl, 2010), but so far has been observed only under certain conditions in English (cf. Rayner & Schotter, 2014; Rayner, Schotter, & Drieghe, 2014; Schotter, Lee, Reiderman, & Rayner, 2015). Interestingly, individual studies in Chinese have also reported numerically greater effect sizes for phonological than for semantic masks (e.g., Pan, Laubrock, & Yan, 2016, Exp. 2; Tsai, Kliegl, & Yan, 2012; Yan, Richter, Shu, & Kliegl, 2009), thus suggesting that semantic information in Chinese may be more readily accessible in parafoveal vision than is phonological information.

The approach taken in the present article was simply to quantify the amount of facilitation that semantic, phonological, and orthographic masks induce. A more fine-grained analysis of how this facilitation differs between alphabetical and Chinese languages was not possible, due to the small number of studies that had such masks. The findings were consistent with the previous literature in indicating that all three masks led to facilitation effects relative to masks that contained no useful information about the target word. Moreover, orthographic masks led to the greatest facilitation effect, and semantic masks led to the least facilitation. However, it is also notable that the difference between the three masks was very small.

A second question of perhaps greater importance is whether N + 1 preview effects also differ between invalid masks that contain no useful information about the target word. Until now, no systematic investigation of this question has been undertaken. The results from the present study suggest that the type of mask indeed influences the size of the N + 1 preview effect. The analysis indicated that the unrelated-word mask resulted in the smallest N + 1 effect sizes. All other masks (pseudowords, strings of letters, strings of Xs) resulted in greater effect sizes, which indicates that interference effects are associated with these masks.

A conceptual illustration of the results is presented in Fig. 7. Specifically, we assume that a “pure” preview benefit effect of unknown size exists, and that different masks used as invalid previews can introduce interference effects that increase the overall size of the observed effects. In contrast to this, parafoveal masks that contain some information about the target word lead to facilitation effects that speed up word processing. The present findings suggest that these interference effects are not very big and amount to up to several milliseconds for first-pass measures. However, because the present findings are only exploratory in nature, it is important that this issue be investigated further experimentally.

Fig. 7
figure 7

Illustration of the effects of different mask types on the N + 1 preview effect (based on all studies). Masks that contain no information about the target word can lead to interference, whereas masks that contain information about the target word can lead to facilitation

One possible explanation of the present results is that masks such as random letters and a string of Xs may look more “unnatural” and may be more likely to be noticed by participants (consciously or not). This would explain how unrelated-word masks could result in the smallest N + 1 effect sizes, since these masks are actual words that occur in the language. Indeed, there is some evidence that preview effects are larger for participants who notice display changes than for those who do not (Angele, Slattery, & Rayner, 2016; Slattery, Angele, & Rayner, 2011; White, Rayner, & Liversedge, 2005). The present results show that interference effects roughly increase as the parafoveal mask becomes less “word-like” (see Fig. 7). This is consistent with the results of Angele et al. (2016), that participants are more sensitive to changes after nonword-like previews than after word-like previews. Therefore, it could be that the interference effects observed in this analysis are due to differences in display change sensitivities. However, this conclusion is still in the realm of speculation, and more research will be needed to understand why such interference effects exist.

Currently, data about the processing costs associated with different parafoveal masks are very scarce. In one recent study, Hutzler et al. (2013) argued that parafoveal X-string masks interfere with word recognition. Interestingly, however, it appears that very few studies (six, in the present analysis) have used such masks. Nevertheless, the results of this study support Hutzler et al.’s (2013) claim, although it should be noted that this evidence is suggestive at best, due to the different sample sizes. More recently, Marx et al. (2015) provided more evidence for the processing costs associated with such masks and have argued in favor of a more neutral mask (visual degradation).Footnote 5 Clearly, the jury is still out on what is the best mask; however, this is an area that deserves more attention from researchers.

Finally, it should be noted that the present findings also have a couple of limitations. First, since the different masks also had different sample sizes, this could have influenced the results. However, this is not a critical problem, because the Bayesian approach adopted in this article makes it possible to easily update the present results as more evidence becomes available. Second, all effect sizes in the present analysis had the confound that invalid preview conditions contain a display change, but the valid preview condition does not. This limitation, however, is not unique to the present analysis, and is indeed present in much of the research on reading that has utilized gaze-contingent display change techniques. Therefore, there is a need to better understand how display changes influence the underlying effects, particularly when using different types of masks.

Are preview effects bigger for Chinese studies?

There has been an ongoing effort to determine how much reading in Chinese differs from reading in alphabetical languages, and whether models of reading in alphabetical languages are also appropriate for reading in Chinese. Given that parafoveal processing is a major component of most models of skilled reading, it is important to determine whether parafoveal preview effects are different for Chinese than for alphabetical languages. Overall, the results of the meta-regression analyses showed that both N + 1 and N + 2 preview effects are greater in Chinese studies when measured with GD. Although in some analyses a similar result was also found for FFD, the mean difference was not very large. Interestingly, the difference in the N + 1 preview effect was influenced by the type of mask. For an unrelated-word mask (the most common one), the effect sizes for Chinese studies were reliably higher. However, the opposite result was found for pseudoword masks. This discrepancy could have been due to the different sample sizes of the two analyses. Thus, a better estimate of the difference could be obtained when more studies are available. It also was not possible to investigate differences in “late” measures such as TVT, because Chinese studies did not report descriptive statistics for this measure. Therefore, this needs to be addressed by future research.

The present findings are therefore consistent with the assumption that readers of Chinese make more efficient use of parafoveal processing. The intuitive explanation of this is that information is more densely packed in Chinese and it is situated close to the point of fixation. Indeed, because the majority of Chinese words consist of one or two characters (Li, Zang, Liversedge, & Pollatsek, 2015; Zang et al., 2016), word N + 1 may sometimes even fall within foveal vision. Therefore, this should facilitate parafoveal processing, since there would be a smaller loss of visual acuity. However, it could also be speculated that some of this advantage in parafoveal processing may be confounded by display change awareness. Indeed, Slattery et al.’s (2011) display change detection paradigm showed that display change awareness was greater when the subjects’ fixation was closer to the invalid preview before triggering the display change. Therefore, because the boundary will generally be closer to the point of fixation in Chinese, it would be interesting for future research to look into how display change awareness affects preview effects in Chinese.

Limitations

One potential criticism of the present findings is that they do not tell researchers much beyond what they did not know already. However, this argument misses the point of the present analysis. Even though researchers may have their own intuitions (i.e., prior beliefs) about parafoveal preview effects, the present analysis makes it possible to update these beliefs in the light of all the available evidence. Also, it could be argued that the N + 2 preview effects could be explained by misallocated fixations or the reliability of the measures. Although this possibility cannot be excluded, such an argument assumes that these sources of error selectively affect N + 2 effects, but not all other preview effects. Such a scenario is, of course, unlikely. Even though this is a valid concern, we argue that such systematic sources of error will be averaged out in the long run as more and more data become available.

On being cautious when interpreting parafoveal preview effects

In the present article, we have purposefully avoided talking about preview benefit effects. The rationale behind this was that accumulating evidence is indicating that N + 1 preview effects are not just the results of being able to preprocess the target, but also depend on the properties of the mask. The present results reinforce this notion. Despite the fact that unrelated-word, pseudoword, random-letter, and X-string masks all do not provide any useful information about the target word, they result in different effect sizes. This difference was not trivial in some of the cases, and it is not likely to be explained by sample-size differences or sampling variability alone, because differences between mask types were found even in analyses done on comparable sample sizes. Therefore, it is important that the differences between parafoveal masks be further investigated experimentally. Also, future studies should take care to consider the type of mask that was used when interpreting the results.

The present findings reflect our best understanding of parafoveal preview effects in light of the available data. However, they are not the last word on this subject. The beauty of the Bayesian approach is that the present results can be updated in a natural and intuitive way as more evidence becomes available. This will further reduce the uncertainty and give us an even better understanding of what can be believed about parafoveal preview effects, given the data.

The present results also show that the true complexity of parafoveal preview effects may not always be captured by simply comparing statistical significance across studies. Even if no statistically significant result is found when comparing two parafoveal masks, this does not of necessity mean that the effect is null or that it has absolutely no practical importance (see Kirk, 1996). To truly capture such complexity in the data, it is recommended that researchers use Bayes factors as a supplementary analysis (Rouder et al., 2009; Rouder, Morey, Speckman, & Province, 2012; for a recent application, see Abbott & Staub, 2015). In the present study we did not use Bayes factors, since it was mostly concerned with the estimation of effect sizes. However, Bayes factors are very useful for hypothesis testing, as they give a ratio of how likely the null and alternative hypotheses are, given the data. In this way, it becomes possible to test whether the data support the null hypothesis that there is no difference between two parafoveal preview conditions, or the alternative hypothesis that such a difference exists.

Conclusion

Parafoveal processing during reading has been an active area of research for the past four decades, and it will likely remain so in the future. The present study is one of the first attempts to make a statistical synthesis of previous findings. The results showed that there is high probability that N + 2 effects are not centered on 0 and are modest in size. The N + 1 preview effect was perhaps smaller than has sometimes been assumed, and it was influenced by the type of parafoveal mask: by masks that provide information about the target word, and also by those that do not. The type of language also had an effect on the results, with Chinese studies generally increasing the sizes of the effects. The present results are also relevant for computational models of eye-movement control during reading. Actual model simulations will be needed to determine whether existing models can accommodate the effects that were observed in this meta-analysis. It is hoped, however, that the present data will be useful both for computational modeling and for planning new experiments.