Introduction

Behavior can be controlled by reward or punishment, and by the environmental stimuli that predict them. The way that animals develop representations of these predictive relationships has been described in terms of mathematical models of reinforcement learning, a restricted set of which have dominated experimental and theoretical attention. With the advent of new neurophysiological and imaging methods, insights from these models have advanced our understanding of the role of cortico-striato-thalamic networks, the midbrain, the amygdala, and the monoamine systems in behavioral adaptation. In particular, the activity of dopamine neurons in the mesostriatal pathway has been shown to conform to the predictions derived from formal learning rules (Waelti, Dickinson, & Schultz, 2001), and may also distinguish between particular instantiations of reinforcement learning models (Roesch, Calu, & Schoenbaum, 2007). Combined with imaging and neurophysiology, they have helped us understand better the types of computations that take place in the reward system and the alterations observed in neurological and psychological disorders, including Parkinson’s disease (M. J. Frank, 2005), depression (Kumar et al., 2008), schizophrenia (Gradin et al., 2011), eating disorders (G. K. Frank, Reynolds, Shott, & O’Reilly, 2011), addiction (Chiu, Lohrenz, & Montague, 2008), and suicidal behavior (Dombrovski, Szanto, Clark, Reynolds, & Siegle, 2013). Here, we provide an introduction to the constructs of prediction error—the discrepancy between the expected and obtained outcomes—and expected value. We then offer a brief overview of the putative neural substrates of these computations and present a meta-analysis of functional imaging studies that have examined the neural correlates of the prediction error and expected value constructs derived from reinforcement learning models.

The Rescorla–Wagner model of Pavlovian conditioning

Building on the earlier Bush–Mosteller model (Bush & Mosteller, 1951, 1953), Rescorla and Wagner (RW) developed their influential model of Pavlovian conditioning (Rescorla & Wagner, 1972). The RW model provides an account of animal learning from multiple conditioned stimuli (CSs). One challenge here is posed by the interactions between stimuli—such as the Kamin blocking effect, or diminished conditioned responding to stimulus X following AX → unconditioned stimulus (US) pairing preceded by A → US (Kamin, 1968). The dependent variable in the RW model is the unobserved, but theoretically plausible associative strength (V) of the CS–US pairing. Associative strength is conceptually close to the expected reward value of a given stimulus (at least when a single appetitive US is presented). Another innovation, which has enabled an elegant explanation of the Kamin blocking effect, was to combine the associative strength of all stimuli present on a given trial, in order to generate a prediction error (PE). In other words, according to RW, an outcome is surprising only to the extent that it is not predicted by any of the stimuli. Here is how the model describes the change in the associative strengths of the two stimuli after a trial in which the stimulus compound AX is followed by a US:

$$ \begin{array}{l}\varDelta {V}_{\mathrm{A}}={\alpha}_{\mathrm{A}}{\beta}_{\mathrm{US}}\left({\lambda}_{\mathrm{US}}-{V}_{\mathrm{A}\mathrm{X}}\right),\hfill \\ {}\varDelta {V}_{\mathrm{X}}={\alpha}_{\mathrm{X}}{\beta}_{\mathrm{US}}\left({\lambda}_{\mathrm{US}}-{V}_{\mathrm{A}\mathrm{X}}\right),\hfill \end{array} $$
(1)

where α is the learning rate for each stimulus, ß is the learning rate for the US, λ US is the asymptote of associative strength that the US will support, and V AX = V A + V X. Thus, if stimulus A is pretrained to the asymptote, subsequent training with the AX compound generates no PE for X. Besides blocking and overshadowing, the RW model has successfully accounted for a variety of Pavlovian and instrumental phenomena, despite a number of limitations (see Miller, Barnet, & Grahame, 1995).

Temporal difference models

Temporal difference (TD) models of animal learning, like RW, learn from PEs (Sutton & Barto, 1998), and describe an approach modeling prediction and optimal control. TD aims to predict all future rewards, discounting them over time:

$$ R(t)=r\left(t+1\right)+\gamma r\left(t+2\right)+{\gamma}^2r\left(t+3\right)+\dots +{\gamma}^kr\left(t+k+1\right), $$
(2)

where r is future reward and γ is the temporal discount factor, reflecting a preference for immediate over delayed rewards. Instead of waiting until all of the outcomes are experienced, TD estimates future rewards by repeating the following algorithm in each learning episode (time step):

$$ V(t)\leftarrow V(t)+\alpha \left[r\left(t+1\right)+\gamma V\left(t+1\right)-V(t)\right], $$
(3)

where α[r(t + 1) + γV(t + 1) – V(t)] is the prediction or temporal difference error, and γV(t + 1) takes the place of the remaining terms γr(t + 2) + γ 2 r(t + 3) + . . . + γ k r(t + k + 1).

To deal with the temporal distribution of predictive cues or response options, TD methods introduce the idea of eligibility traces. That is, only closely preceding (eligible) cues or actions are credited for reward or blamed for punishment.

TD provides a real-time account of learning that RW and other trial-level models do not. A key area of divergence between RW and TD is that TD treats rewards themselves and the cues that predict them as, in principle, equivalent, insofar as they are both stimuli that can invoke changes in the valuation of future rewards. Both conditioned cues and outcomes can influence value prediction and can elicit PEs. This innovation provides an effective account of the learning of sequences of stimuli, since conditioned cues can come to operate as reinforcers in their own right (Dayan & Walton, 2012). Moreover, the reinforcement value is collapsed into a single, common currency across different reinforcers. On the other hand, RW is a model that describes the extent to which the US (e.g., reward or punishment) can be predicted by environment stimuli. Thus the major focus of RW is the processing of the US, PEs occur only at the US, and all conditioned cues are treated as distinct entities competing to predict the US (Rescorla & Wagner, 1972). At the same time, one can see the parallel between the summed associative strengths of all presented CSs in RW and value in TD.

These differences between trial-level models such as RW and TD lead to differential predictions regarding the putative neural learning signals, as is illustrated in Fig. 1. A trial-level model aligns its associative strength (or expected value) signal with the CS, and PE with the US. One can see that, when the signals from a trial-level model such as RW are aligned with stimuli in real time, the time course of TD error approximates the combination of associative strength at the CS and PE at the US. On the other hand, in trial-by-trial functional magnetic resonance imaging (fMRI) learning experiments with short and, especially, fixed CS–US intervals, the predicted blood oxygenation level dependent (BOLD) signal corresponding to the associative strength or value generated by trial-level models will often approximate those of TD.

Fig. 1
figure 1

The temporal difference (TD) model describes a real-time course of reward prediction error (PE) signals; PEs transfer from the unconditioned stimulus (US) to the conditioned stimulus (CS) as learning progresses. In contrast, trial-level models such as Rescorla–Wagner describe PE only at the US, whereas associative strength (conceptually close to value) signals build at the CS. It is easy to see the resemblance between the TD error signal and the combination of PE and associative strength signals in trial-level models. *Before the asymptote is reached. At asymptote, PE at the US disappears

Neural correlates of prediction errors: model-based neuroimaging and electrophysiology

Prediction-error-based learning models have also enabled neuroscientists to interpret neural signals, most prominently from midbrain dopaminergic neurons (Schultz, Dayan, & Montague, 1997). The firing rates in dopaminergic neurons in this region are consistent with the predictions of RW: A blocking experiment revealed that firing rates reflect the contingency between a stimulus and a reward, rather than the mere pairing of the two (Waelti et al., 2001). Moreover, specific predictions of the TD model were also corroborated in these neurons: Most notably, neural firing within dopaminergic neurons in the midbrain gradually becomes coupled to predictive stimuli rather than to the rewards themselves (Schultz et al., 1997). In addition, a study of conditioned inhibition revealed that an inhibitory cue, predictive of reward omission, could reduce the firing rates of subpopulations of these neurons (Tobler, Dickinson, & Schultz, 2003).

A natural development of this work was to apply the same behavioral paradigms and reasoning to human neurophysiological research. Although event-related potential and magnetoencephalographic research has attempted to address analogous questions (Holroyd & Coles, 2008; Krigolson, Hassall, & Handy, 2014), the relatively limited capability of these methods to register unambiguous physiological responses from subcortical or brainstem regions has meant that the majority of progress must depend on fMRI. Since one of the seminal studies of this field (O’Doherty, Dayan, Friston, Critchley, & Dolan, 2003), the primary focus of fMRI studies has generally been the ventral striatum, rather than the midbrain itself. A typical explanation (see, e.g., Roesch, Calu, Esber, & Schoenbaum, 2010; Tobler, O’Doherty, Dolan, & Schultz, 2006) is that the fMRI response reflects the phasic input to a structure (Logothetis & Pfeuffer, 2004), rather than the local processing or the region’s output. Thus, given that the dopaminergic neurons of the ventral tegmental area (VTA) project to the areas of the striatum (Haber, Fudge, & McFarland, 2000), fMRI-measured ventral striatal activation might then be seen as the downstream consequence of VTA firing. This perspective has found considerable support in the literature, although there are two areas of possible complication. First, there is evidence of prediction-error-related activation in the VTA itself (e.g., D’Ardenne, McClure, Nystrom, & Cohen, 2008), implying that local processing may also be relevant. Second, the ventral striatum also receives input from a wide range of cortical and subcortical regions (Voorn, Vanderschuren, Groenewegen, Robbins, & Pennartz, 2004), any of which could influence its activity and information processing within it. A further advantage of fMRI is that, although focused analysis of PE responses in the VTA and ventral striatum has been performed with this technique (D’Ardenne et al., 2008), its capability to identify signal across the entire brain has allowed for an examination of related signals in other parts of the cortex. Integration and analysis of the rich data sets obtained using fMRI methods are the focus of the present work.

Learned value, economic subjective value, and their neural correlates

In economics, subjective value or utility is the theoretical common currency used to compare disparate goods. Economic commodities can be thought of as reinforcers, and labor or a price paid as analogues of effort during operant conditioning (Lea, 1978). Although economic decision-making has traditionally been studied using stylized description-based prospects, recent research has suggested that experience-based experiments resembling animal-learning paradigms provide complementary models of real-life economic decision-making (Hertwig & Erev, 2009). Thus, to the degree that economic preferences incorporate one’s reinforcement history, one may hypothesize that revealed preferences and feedback-based animal learning depend on similar neural computations (Fellows, 2011). One of the motivations for the present analysis was to examine whether the cortical regions tracking learned reward value coincide with the medial prefrontal regions that have been shown to signal economic subjective value on revealed preference tasks (Peters & Buchel, 2010).

In addition, animal electrophysiological studies have shown responses that accord well with what might be expected of learned-value signals in regions including the ventral prefrontal cortex (vPFC) and limbic areas such as the cingulate, and the striatum (Samejima, Ueda, Doya, & Kimura, 2005; Simmons, Ravel, Shidara, & Richmond, 2007; Wallis & Miller, 2003). Here, the vPFC refers to the orbitofrontal cortex (OFC), the ventromedial prefrontal cortex (vmPFC), and more lateral regions of the ventral prefrontal cortex. The vmPFC denotes the mammalian paralimbic agranular/dysgranular prefrontal cortex, encompassing monkey areas 14, 25, and rostral 24 and 32 of Petrides and Pandya (1994), and human areas 25 and rostral 32 and 24; the orbital aspect of this region is also referred to as the medial orbitofrontal cortex (mOFC). Associative signals represented in the vPFC possess many properties of abstract value, in that they are sensitive to delays and probability of reward, as well as to the presence of alternatives (Kennerley, Dahmubed, Lara, & Wallis, 2009; Kennerley & Wallis, 2009b; Kobayashi, Pinto de Carvalho, & Schultz, 2010; Padoa-Schioppa & Assad, 2008; Roesch & Olson, 2005; Tremblay & Schultz, 1999). These signals are “subjective,” integrating such internal states as hunger (Bouret & Richmond, 2010; Critchley & Rolls, 1996). Other decision-related signals have been found in motor prefrontal and parietal cortex (Platt & Glimcher, 1999). However, it appears that these signals may reflect salience (Leathers & Olson, 2012) or motivation (Roesch & Olson, 2004), rather than value.

The present meta-analysis

The present work provides a quantitative summary of fMRI evidence on PE and expected value representations in the human brain using an activation likelihood estimation (ALE) meta-analysis. It extends recent meta-analyses of value and PE signals (Bartra, McGuire, & Kable, 2013; Clithero & Rangel, 2014; Garrison, Erdeniz, & Done, 2013; Levy & Glimcher, 2012) in two ways. First, to control methodological heterogeneity, our analysis included only studies that have used delta-rule reinforcement learning models. This enabled a better-controlled evaluation of the consequences of variations in methodology. We could thus identify the core networks that are most reliably detected. Second, to reveal the distributed networks that subserve human reward learning, we jointly mapped the regions responsive to value and PE. On the basis of the animal and human literature reviewed above, we hypothesized that PE signals would be observed in the striatum (including putamen, caudate, and nucleus accumbens) and midbrain. In contrast, we hypothesized that expected value signals would be represented in the vmPFC.

In contrast to previous meta-analyses (Bartra et al., 2013; Garrison et al., 2013; Levy & Glimcher, 2012), we focused only on studies in which signals derived from a reinforcement learning algorithm served as explanatory variables in the analysis of fMRI data. This allowed us to examine whether differences in approaches to generating such signals could yield different neural maps. We also examined other methodological variables that could have an impact on the observed coordinate maps derived from reward prediction error (RPE) experiments. Our variables of theoretical interest included instrumental or Pavlovian designs and reinforcer type (monetary, liquid, or social). Accounting for the effects of these variables would demonstrate the degree to which the RPE maps are dependent on choices of experimental parameters. To this end, we had several secondary hypotheses.

  1. 1.

    Pavlovian versus instrumental paradigms: Prior studies had suggested differential roles for striatal subregions in Pavlovian versus instrumental tasks. Pavlovian RPEs recruit the ventral striatum, whereas RPEs from instrumental tasks (most of which include a Pavlovian component) appear to recruit both ventral and dorsal striatum (O’Doherty et al., 2004).

  2. 2.

    Fixed/individual learning: All models evaluated in the present work include a parameter that controls the rate at which conditioning occurs. There are three main strategies for determining the learning rate, all of which are evaluated in a study by Cohen (2007). He compared the neural correlates of the parameters generated by individual fits of each participant’s responses (“individual”) with the correlates of either the group means of such parameters (“group fixed”) and an arbitrary fixed estimate of the group response (“fixed”). Despite somewhat different patterns of activation, the two methods were broadly consistent in indexing similar limbic and prefrontal regions of interest. In general, individually fitted parameters can arguably better accommodate the subject’s behavior (Estes & Maddox, 2005), and thus may provide a more optimal fit of the underlying neural signals. Yet noisy, stochastic behavior, or directed exploration, may deleteriously affect the reliability of estimated parameters. Group-fitting (“group fixed”) of parameters provides a form of regularization (Daw, 2011), leading to more a conservative parameterization that is potentially less susceptible to such misspecification. It may also be well suited to studies of patient groups (e.g., Bernacer et al., 2013). We tested whether each approach biased the discovery of particular brain regions. Alternatively, either approach could simply be a more accurate way of characterizing the neural correlates of individual acquisition curves, and thus be associated with similar, if more finely resolved, patterns of activation.

  3. 3.

    US-aligned outcome PE versus CS- and US-aligned TD error: As we noted above, the time course of TD error differs from that of the outcome PE generated by trial-level models. It has been suggested that TD error may be exclusively represented in the ventral striatum, whereas outcome PE is signaled by a larger network including the caudate (Niv, Edlund, Dayan, & O’Doherty, 2012). Moreover, exclusively outcome-coupled PE regressors may be more susceptible to ongoing activation coupled to the outcome, distinct from PE itself, such as the appetitive response to a rewarding outcome (Rohe, Weber, & Fliessbach, 2012). We contrasted TD and outcome PE studies, expecting to see more extensive activation to outcome PE and also anticipating that a conjunction analysis would reveal the ventral striatum as the site of overlap between these studies.

  4. 4.

    Reward type: Previous meta-analyses have examined patterns of activation in response to various primary and secondary rewards (Sescousse, Caldu, Segura, & Dreher, 2013). However, any differences and commonalities may have been driven by sensory properties of the rewarding stimuli. By contrast, our focus on model-estimated PEs allowed us to examine the spatial segregation or dissociation of more abstract neural computations triggered by disparate rewards. On the basis of the animal studies reviewed above, we hypothesized that the ventral striatum would be the shared area of activation for all types of rewards.

  5. 5.

    Smoothing: A variable without theoretical interest that might affect the pattern of data was the smoothing kernel employed by the study. Recently, Sacchet and Knutson (2013) have shown that the application of large smoothing kernels can bias the localization of ventral striatal responses to reward anticipation. In addition, it is not easy to detect BOLD activations in subcortical, and especially brainstem, nuclei because of their small size: only 60 mm3 for the nucleus of VTA, for example (Paxinos & Huang, 1995). Yet, when preprocessing whole-brain fMRI images, researchers often use spatial filters exceeding the size of potential signal sources in these nuclei. The matched filter principle suggests that such large filters are likely to reduce the signal-to-noise (SNR) ratio in these structures. We tested whether this size mismatch affected the detection of PE signal sources in the basal ganglia and midbrain. We contrasted studies that used smaller (<8-mm) filters with those that used larger filters.

Method

Study selection criteria and definitions

Studies were selected by searching PubMed and Google Scholar to identify fMRI studies that employed computational algorithms to investigate the neural correlates of reinforcement learning studies. Combinations of keywords were used: [“reinforcement learning” OR “reward learning”], [“prediction error” OR “expected value”], and [“rescorla-wagner” OR “temporal-difference” OR “Q-learning”]. We also identified studies using reference tracing and citations within reviews. The search yielded 40 studies. Each article was reviewed by at least two authors to make sure that it fulfilled the following criteria:

  1. 1.

    Only studies that used a reinforcement learning model (i.e., trial-level delta-rule model, TD, or back-propagating connectionist model) to create regressors for a general linear model (GLM) analysis of BOLD signal were included. The common feature of these studies was a PE-based learning rule.

  2. 2.

    Our PE analyses used maps that revealed a positive coupling with appetitive “signed” RPEs, which are positive when the reward is higher than expected or negative when it is lower than expected. Maps reporting aversive PEs were excluded, since their number was insufficient for an ALE analysis. Similarly, negative correlations with RPE or expected value (EV) regressors were also not analyzed, since these are not systematically reported.

  3. 3.

    EV was defined as the extent to which stimuli or actions were predictive of reward.

  4. 4.

    Studies that had used modified delta-rule algorithms were included as long as they involved no additional equations or components that would fundamentally change the representational structure (e.g., an upper layer in a hierarchical model).

  5. 5.

    Studies in which a reinforcement learning model of the sort described above was refuted or outperformed by a model from a different class (e.g., by a hidden-Markov model, Kalman filter, hierarchical Bayesian model, or hybrid models with separate representational systems) were excluded, to avoid the inclusion of maps derived from potentially disadvantaged models.

  6. 6.

    Only studies reporting whole-brain results were included.Footnote 1 For studies reporting only region-of-interest or otherwise restricted analyses, we contacted the authors to obtain whole-brain coordinates and included the study if the data were received.

  7. 7.

    We included only studies of nonclinical adult populations, excluding rare genotypes, subclinical psychopathology, and placebo-treated participants.

In total, we included in our ALE analyses 38 studies reporting RPE maps and 16 studies reporting EV maps, with 751 and 337 participants, respectively. Of the EV studies, two did not contribute RPE maps. The details of all included studies are listed in Tables 1, 2 and 3, and proportions of different study designs are displayed in Fig. 2.

Table 1 Studies reporting reward prediction error (PE) maps, including details about sample size (n) and number of foci, learning rule (US = unconditioned stimulus, TD error = temporal difference error), Pavlovian/instrumental design, learning rate parameter estimation (Fixed = fixed at group level, Individual = individually estimated per participant), and reinforcer type
Table 2 Studies reporting expected value (EV) maps
Table 3 Overall numbers of participants and foci contributing to each of the contrasts investigated
Fig. 2
figure 2

Pie charts showing the percentages of studies in each condition that were included in producing the activation likelihood estimation (ALE) maps for reward prediction error

Subgroup analyses

Various subgroup analyses investigated heterogeneity across our studies. We classified the studies into the following categories:

  • Pavlovian/instrumental: In “instrumental” paradigms, outcome is contingent on a behavioral response (choice). In “Pavlovian” paradigms, outcome is not contingent on choice, although a response may be made—for example, in order to signal outcome probability.

  • Fixed/individual: A “fixed” learning rate is assumed to be equivalent for all participants within the cohort. The learning rate may be estimated at the group level (e.g., Bernacer et al., 2013) or by taking a reasonable heuristic (often around 0.2; e.g., Kumar et al., 2008). Alternatively, “individual” learning rates are estimated separately for each participant, and the PE and EV signals for each participant reflect the individually estimated learning rate.

  • Outcome PE/TD: Although a wide variety of algorithms were used, we made a broad distinction between RW-like trial-level models and TD-like algorithms. Put simply, trial-level models have a single update mechanism at the time of the outcome that forms the basis of the RPE, whereas RPEs are computed at both the stimulus/action and outcome phases of the task in TD algorithms.

  • Monetary/liquid/cognitive/social: “Monetary” and “liquid” paradigms involved the respective reinforcers; “cognitive” paradigms employed cognitive reinforcement, such as numerical or symbolic feedback; and “social” paradigms involved smiles, frowns, fearful, or beautiful faces as reinforcement.

  • High/low smoothing: “High” studies employed a smoothing kernel of 8 mm or more; “low” studies employed a smoothing kernel of 7 mm or less.

Where there was a choice of maps to use from a given study that fulfilled our criteria, we selected the one in which the GLM regressor was estimated on the basis of the largest number of trials. For example, we included the overall social and monetary RPE maps reported in the study of Fareri, Chang, and Delgado (2012) for the main RPE analysis, but the social RPE map only for all of the subgrouping analyses. Other arbitrary choices included the decision to include the liquid reinforcement map in Metereau and Dreher (2013), due to the relatively low number of these studies. Finally, where slightly different models were fitted to the data, the better-fitting or otherwise preferred model was selected.

Activation likelihood estimation

Our statistical analysis of the studies was conducted using the revised activation likelihood estimation (ALE) algorithm (Eickhoff, Bzdok, Laird, Kurth, & Fox, 2012) for coordinate-based analyses (Turkeltaub, Eden, Jones, & Zeffiro, 2002). The method generates meta-analytic maps of consistent brain activation locations from the coordinates derived from neuroimaging studies with similar experimental conditions. The method provides an estimate of the convergence of foci across activation maps, and determines the significance of these estimates via an empirically derived null distribution (Eickhoff et al., 2012). The null hypothesis is that the foci are distributed randomly across the brain, and the test statistic supports a random-effects inference, that the modeled activation maps reflect an above-chance convergence across studies (Eickhoff et al., 2012; Turkeltaub et al., 2012). A detailed description of the ALE technique can be found elsewhere (Eickhoff et al., 2012; Turkeltaub et al., 2012). In short, the activation foci reported for a given experiment are treated as centers of a 3-D Gaussian probability distribution, the width of which is empirically derived and reflects an estimate of the spatial uncertainty of the foci of a given map and the sample size of each experiment (Eickhoff et al., 2009). On the basis of the ICBM tissue probability maps, each focus is given a probability value of how likely the activation is to be located at exactly that position. One modeled activation map is then created for each experiment by merging the probability distributions of all activation foci. If more than one focus from a single experiment is jointly influencing the modeled activation map, then the maximum probability associated with any one focus reported by the given experiment is used. ALE scores are then calculated by taking the union of these individual modeled activation maps, and these scores reflect the voxel-wise convergence of activations across experiments. The p values of the ALE scores are determined with reference to the null distribution. The resulting nonparametric p values were transformed into z scores and thresholded at a cluster-level family-wise error rate-corrected threshold of p < .05 (cluster-forming threshold at voxel-level p < .001).

Comparison of the different subgroups was performed by subtracting the voxel-wise modeled activation maps from one another, and then comparing this map to an empirically derived null distribution of ALE-difference scores (10,000 permutations). To this end, ALE analyses were performed separately on the experiments associated with either condition and the voxel-wise differences were computed between the ensuing ALE maps. All experiments contributing to either analysis were then pooled and randomly divided into two groups of the same size as the two original sets of experiments defined by activation in the first or second cluster (Eickhoff et al., 2011). The ALE scores for these two randomly assembled groups were calculated, and the difference between these ALE scores was recorded for each voxel in the brain. Repeating this process 10,000 times yielded a null distribution of differences in ALE scores between the ALE analyses of the two clusters. The “true” difference in ALE scores was then tested against this null distribution, yielding a posterior probability that the true difference was not due to random noise in an exchangeable set of labels based on the proportion of lower differences in the random exchange. The resulting probability values were then thresholded at p > .95 (i.e., 95 % chance for a true difference) and a cluster size (k) of 20.

Results

Reward prediction error

The activations revealed by the main categories were largely in line with our hypotheses (Table 4, Figs. 3 and 4). The ALE meta-analysis of the RPE maps revealed clusters encompassing bilateral ventral striatum, bilateral amygdala, midbrain, thalamus, frontal operculum, and insula. The largest clusters were seen in the ventral striatum: one activation cluster in each hemisphere that extended from the ventromedial caudate (nucleus accumbens) to the lateral putamen and amygdala (predominantly the superficial subregion). The left frontal operculum cluster impinged on both the pars orbitalis of the inferior frontal gyrus and the anterior insula. RPE-related activation was also observed in the left visual cortex, predominantly located in V3 and V4.

Table 4 ALE clusters representing reward prediction errors, including peak t statistics, Montreal Neurological Institute (MNI) coordinates, and cluster sizes
Fig. 3
figure 3

Map of significant ALE clusters associated with the reward prediction error contrast, with activations in the striatum circled. Pie charts show the contributions of the studies of a particular class to the bilateral striatum activation. Percentages are not corrected for base rate

Fig. 4
figure 4

Map of significant ALE clusters associated with the reward prediction error contrast, with activations in the midbrain and frontal operculum circled. Pie charts show the contributions of the studies of a particular class to each activation. Percentages are not corrected for base rate

RPE: subgroup analysis

We performed a number of analyses focused on different subcategories of the RPE studies, in order to identify the distinct activations associated with different designs. First, in order to interpret these contrasts appropriately, we examined the extents to which the different categories of experimental designs were statistically independent.

Confounding

Fisher’s exact tests between the subcategories assessed the contingencies between design factors. There was a highly significant association between reinforcer type and Pavlovian/instrumental design (exact test = 14.67, p < .001). Monetary reinforcers were more common in instrumental studies, and liquid reinforcers were more common in Pavlovian studies. Three other relationships showed trend-level associations (ps between .061 and .088): fixed/individual versus Pavlovian/instrumental, outcome PE/TD error versus reinforcer type, and outcome PE/TD error versus Pavlovian/instrumental.

This confounding between Pavlovian designs, liquid reinforcers, and TD modeling proved relevant, because the activations associated with Pavlovian designs were mostly collected from studies employing liquid reinforcement and also included a high contribution from TD studies. There were relatively few TD studies, but these employed either monetary or liquid reinforcers, and about half were Pavlovian designs. In general, given the small number of such studies (Pavlovian/TD/liquid) and the potential for confounding, the findings from these maps should be interpreted cautiously.

Both the individual-related striatal and the fixed-related midbrain activations were predominantly collected from instrumental rather than Pavlovian studies, as would be expected from the higher proportion of instrumental studies. The striatal activations associated with individual studies were elicited half by monetary and half by other reinforcers, whereas the midbrain activation associated with fixed studies was also represented by studies employing a variety of different reinforcers.

Pavlovian versus instrumental (Table 5)

Table 5 ALE clusters representing instrumental (Instr) and Pavlovian (Pav) activations, including peak t statistics, MNI coordinates, and cluster sizes

The instrumental RPE map was similar to the overall RPE map, aside from the lack of midbrain activation. Striatal activations were slightly more medial than the overall RPE cluster and did not extend as convincingly into the lateral striatum (putamen), nor farther into the amygdala. In addition, the left caudate was activated in this contrast. By contrast, the Pavlovian studies yielded two clusters in the left putamen/amygdala and right amygdala. The amygdala activations were predominantly located in the superficial subregion.

Bilateral amygdala and left lateral putamen were significantly more likely to be activated in Pavlovian than in instrumental paradigms. The reverse contrast yielded a significant cluster in the left caudate (anterior and dorsally located), as well as smaller activations in more ventral regions of the medial striatum. A small region reflecting the conjunction of instrumental and Pavlovian tasks was apparent in the left putamen.

Fixed versus individual (Table 6)

Table 6 ALE clusters representing individual (Ind) and fixed activations, including peak t statistics, MNI coordinates, and cluster sizes

The individual map was also similar to the overall RPE map, without the presence of the midbrain cluster or any activation within the dorsal striatum. The striatal activations were focused within the medial regions of the ventral striatum. By contrast, the fixed map yielded two clusters: one in left putamen and one in the midbrain. Statistical comparison of the contrasts yielded greater activation in the bilateral ventral striatum (medially focused) for the individual contrast, as well as the left operculum and left visual cortex. The fixed contrast yielded a large midbrain cluster, as well as very small differences in the left lateral putamen. A cluster representing the conjunction of fixed and individual was present in the left putamen.

PE at outcome versus TD error (Table 7)

Table 7 ALE clusters representing temporal difference (TD) error and prediction error (PE) at outcome activations, including peak t statistics, MNI coordinates, and cluster sizes

Studies that modeled PE only at the US made up a large proportion of the data, and consequently the US PE map was very similar to the overall RPE map. The seven TD error studies yielded a cluster including the left lateral striatum (putamen) and amygdala. A conjunction between the two was again observed within the left putamen. The TD error studies showed activated left amygdala/hippocampus more than did the US PE studies, whereas the latter showed greater activation in the left caudate and left frontal operculum.

Reinforcer type (Table 8)

Table 8 ALE clusters representing the activations associated with different reinforcers, including peak t statistics, MNI coordinates, and cluster sizes

As with the outcome PE map, monetary reinforcement occurred frequently in the selection of studies. Thus, the monetary subanalysis revealed a pattern of activations very similar to the overall RPE contrast. The other reinforcer-type subanalyses were somewhat underpowered, and we did not perform statistical contrasts of these maps. The cognitive subanalysis did not reveal any significant clusters, but the liquid and social reinforcement maps yielded several distinct clusters. Liquid rewards elicited lateral putamen and amygdala activations, whereas social rewards produced two left hemispheric activations: One was similar to the frontal opercular/insula cluster in the main reward PE contrast; the second was in the left inferior parietal cortex.

High versus low smoothing (Table 9)

Table 9 ALE clusters representing activations associated with high and low smoothing kernels, including peak t statistics, MNI coordinates, and cluster sizes

High-smoothing studies were associated with bilateral putamen and amygdala activation, as well as activation in the left frontal operculum. Low-smoothing studies were associated with the thalamus/midbrain and left frontal operculum. The opercular activations were not similar enough to yield a significant conjunction. High-smoothing studies were significantly more likely to activate the right amygdala than were low-smoothing studies. The low-smoothing studies were more likely to activate a small cluster of the thalamus, toward the top of the midbrain/thalamus cluster identified in the main RPE contrast.

Overall conjunction

A conjunction analysis was conducted across all of the main contrast types (Pavlovian/instrumental, fixed/individual, RW/TD, high/low smoothing) using the minimum statistic across the cluster-thresholded contrasts for each of the eight maps (Rottschy et al., 2012). A 30-voxel cluster was revealed in the left putamen (–22, 6, 9) across the first three pairs of contrasts (i.e., excluding smoothing). This cluster thus reflects the strongest convergent evidence for a neural correlate of a signed RPE signal that we were able to obtain (see Fig. 5). However, when the smoothing-related contrasts were included, no clusters were identified.

Fig. 5
figure 5

Conjunction map showing overlap of the ALE maps from individual subgroup analyses (fixed, individual, Pavlovian, instrumental, outcome PE, TD, monetary, liquid, and social), with the left putamen cluster (x = –22, y = 6, z = 9, cluster size = 30) from the conjunction analysis marked with arrows

Expected value (Table 10)

Table 10 ALE cluster representing the activation associated with expected value (EV), including peak t statistics, MNI coordinates, and cluster sizes

The ALE analysis of studies reporting EV yielded a single activation in the subgenual anterior cingulate cortex (ACC; Table 10, Fig. 6). To illustrate specificity, the RPE and EV maps were contrasted. The subgenual ACC was significantly more likely to be activated in the EV than in the RPE condition, whereas the left striatum and midbrain were significantly more likely to be activated in the RPE than in the EV condition. No significant clusters representing the conjunction of EV and RPE were observed.

Fig. 6
figure 6

Map of significant ALE clusters associated with the estimated value contrast. Pie charts show the contributions of the studies of a particular class to the subgenual cingulate activation. Percentages are not corrected for base rate

Discussion

In line with previous animal and human studies, the present meta-analysis confirmed our core hypotheses: that the midbrain and striatum represented reward prediction errors, whereas the subgenual cingulate—a caudal region of the vmPFC—represents expected value. In addition, this meta-analysis revealed that the frontal operculum and visual cortices are part of the RPE network, mainly recruited during social rewards and attentional processing, respectively. Although these results are largely compatible with previous meta-analyses of the neural bases of PEs (Garrison et al., 2013), reward anticipation and receipt (Diekhof, Kaps, Falkai, & Gruber, 2012; Liu, Hairston, Schrier, & Fan, 2011; Sescousse et al., 2013), and value (Bartra et al., 2013; Clithero & Rangel, 2014; Levy & Glimcher, 2012; Peters & Buchel, 2010), the present study extends this work by focusing exclusively on the neural correlates of parametric RPEs and EV derived from reinforcement learning models. We identified methodological factors that might have contributed to the divergent findings, including Pavlovian/instrumental designs, reinforcer type, and smoothing kernel size.

Core PE network

The reproducibility of fMRI BOLD images is often a concern, with test–retest reliability of the method being generally modest, and very poor in some cases (Bennett & Miller, 2010). Moreover, methodological differences across studies, including differences between scanners, paradigms, participants, and analysis software may further conspire to amplify between-study heterogeneity. Nevertheless, a core network of regions associated with PEs was readily identified, including the ventral striatum and midbrain, as predicted. Indeed, even for two regions that were not predicted—the left frontal operculum and left visual cortex—over ten studies contributed to each of these clusters. This suggests that this core PE network is robust to between-study variability and reflects a level of specificity of the activations. However, each of the activations should be interpreted carefully; it is often difficult to distinguish certain psychological events, due to a shared but spurious correlation with the general linear model regressor. The variability of paradigms may act to provide some decorrelation of irrelevant variables from the RPE construct. For example, the lack of PE signals in the medial PFC is consistent with animal electrophysiological studies (Roesch et al., 2010), although medial OFC activation has been shown to be coupled to RPE in some human fMRI studies. Our findings are consistent with the view that this is likely to be due to the correlation inherent between appetitive properties of the outcome and RPE in many of these designs (Erdeniz, Rohe, Done, & Seidler, 2013; Rohe et al., 2012).

Aside from the reinforcement learning signal hypothetically encoded by dopamine-rich regions such as the midbrain and ventral striatum, associative learning algorithms are often extended to account for salience and attentional phenomena. These constructs may be necessary for interpreting RPE correlates in the visual cortex, amygdala, and insula. For example, the Pearce–Hall (PH) model (Pearce & Hall, 1980) emphasizes that the cues associated with surprising outcomes command attention: PEs not only strengthen associations, but a similar signal, reflecting surprise associated with the outcome, may control the rate at which such associations are strengthened. In the PH model, stimuli that are accompanied by larger PEs attract attention, and thus become more readily associated with other stimuli. A recent theme has been to argue that a PH signal might be coupled to the surprising outcome itself, rather than to conditioned stimuli. For example, a recent study by Li, Schiller, Schoenbaum, Phelps, and Daw (2011) suggested that, consistent with animal learning studies (Maddux, Kerfoot, Chatterjee, & Holland, 2007), the amygdala codes surprise, as predicted by the PH model, rather than a signed RPE signal.

In the present study, we found amygdala activation coupled to the RPE contrast. In the probabilistic designs that are widely used, it would be difficult to dissociate a PH signal from the basic RPE contrast. It may then be that RPE-coupled amygdala activation reflects some confounding of a PH signal with the RPE signal, particularly because a PH parameter is often not concurrently modeled. However, amygdala activation was particularly associated with studies in which liquid was used as a reinforcer, whereas larger smoothing kernels were also associated with greater activation in the amygdala. These factors should be independent of the learning rule and contingency under investigation, and should be adequately controlled in future studies of the PH rule.

Other regions that have played a well-established role in attention in the fMRI literature were also coupled to the RPE contrast, including the left visual cortex. Although reward-related responses in the visual cortex have been identified, a recent study argued that these signals may reflect attentional processing rather than the appetitive and dopamine-related properties of the reward (Arsenault, Nelissen, Jarraya, & Vanduffel, 2013). With the RPE contrast, we also identified a left frontal operculum/anterior insula region that is activated by a wide range of stimuli and task designs, and thus perhaps has a general role in task set representation (Dosenbach et al., 2006). Nevertheless, the activation of this region by reward has been quite well characterized. A study by Rutledge, Dean, Caplin, and Glimcher (2010) parametrically manipulated the reward probabilities of wins and losses, finding that the response of the anterior insula to reward did not follow a pattern that would be expected from a PE signal. It was, however, modulated to some degree by the probability of the outcome, insofar as activation was not observed in the region if the outcome was fully predicted, and showed fairly consistent activation across wins and losses if the outcome was uncertain. Given that the paradigms in the present study have generally included a degree of outcome uncertainty, this opens the possibility that anterior insula activation may become coupled with an RPE regressor, while not accurately reflecting the predicted RPE signal. Less obvious is the fact that paradigms employing social reinforcement were particularly able to elicit activation in this region. An interpretation of the Rutledge et al. study might suggest that this is simply related to the kind of contingencies employed in the social paradigms, but equally it is worth considering the possibility that the anterior insula may play a distinct role in the reinforcement process itself.

Pavlovian versus instrumental

Although the majority of studies have been instrumental, requiring participants to make a choice, we contrasted these studies with a small number of Pavlovian designs. We found differential activation in the left caudate (dorsal striatum), consistent with an influential study by O’Doherty and colleagues (2004) in which the striatum was argued to follow the “actor–critic” model: the anterior, dorsal caudate (“actor”) was engaged when behavior output was required. By contrast, the ventral striatum (“critic”) was engaged during errors of value prediction, whether or not a response was required to obtain reward. This distinction is also broadly consistent with animal lesion studies, since the dorsomedial striatum of rodents—a likely homologue of the caudate region identified in the present study and that of O’Doherty et al. (2004)—plays a key role in instrumental, goal-directed behavior (Yin, Ostlund, Knowlton, & Balleine, 2005), whereas the ventral striatum is more consistently implicated in Pavlovian behaviors (Corbit & Balleine, 2011; Parkinson, Olmstead, Burns, Robbins, & Everitt, 1999).

Although the notion that the striatum contributes to action selection in a manner predicted by the actor–critic model has steadily gathered currency, it was somewhat undermined by a previous meta-analysis by Garrison and colleagues (2013). This study showed that, although both the dorsal and ventral striatum were engaged by instrumental designs, both were significantly more activated by these designs than by Pavlovian designs. Our findings contrast with that study, since we did find significant activation in the ventral striatum elicited by Pavlovian designs, although it was somewhat more lateral than the equivalent activations seen in instrumental designs.

Together, the present study and that of Garrison et al. (2013) may provoke further debate about the success of the actor–critic model as an account of the striatum’s influence on behavior. However, there are several important reasons why providing a definitive contribution to this question might be difficult. First, it has been noted (e.g., Coricelli et al., 2005; Yeung, Holroyd, & Cohen, 2005) that designs in which a (human) participant is required to make a choice, and is reinforced for doing so, are potentially more engaging than Pavlovian designs, and consequently can provide more robust neural signals. Given that the magnetic resonance scanner requires that an individual lie for long periods in a darkened room, performing an often repetitive task, this consideration is not to be taken lightly, and can make it difficult to design an effective Pavlovian paradigm. This may explain both the preponderance of instrumental tasks in the literature and the second key limitation—that Pavlovian designs tend to focus on liquid reinforcers rather than other domains. This is presumably because liquid is a powerful primary reinforcer, particularly when the participant is thirsty (e.g., Kumar et al., 2008), and this may somewhat compensate for the potential lack of engagement described above. A final limitation is the nature of the definition of instrumental and Pavlovian designs. Instrumental behavior can be defined on the basis of the contingency between a particular action and an outcome (Balleine & Dickinson, 1998), and the manner in which a participant can use this information to obtain reinforcement. The presence of stimuli in all of the paradigms that we considered in the present work complicates this issue somewhat. Specifically, in any of the instrumental designs included in the present work, it cannot be assumed that this action–outcome contingency was the sole factor that determined choice. Rather, an individual’s responses may also have been susceptible to influence by the presented stimuli and by the relationships between the stimuli and reinforcement.

Fixed versus individual learning rates

We investigated whether the strategy of reinforcement learning model fitting, upon which the pattern of the RPE (and EV) regressors was based, was associated with different patterns of neural activation. Although across most situations the patterns of RPEs associated with fixed and individual model fitting should be highly similar, it is nevertheless unclear exactly how sensitive the pattern of activations is to the parameterization of the underlying model. Daw (2011) has consistently argued that the fixed (or, more particularly, group fixed) strategy offers advantages over estimating the model parameters per individual. On the other hand, regarding the fitting of models to behavioral data, Estes and Maddox (2005) have argued that individual-participant fitting avoids certain sources of bias associated with group averaging.

The fixed subgroup showed the strongest corroboration of the classic RPE hypothesis pioneered by Schultz and colleagues (Schultz et al., 1997), since the midbrain was engaged in these studies. In addition, activation in the lateral putamen was also observed, as would be expected on the basis of anatomical connectivity (Haber et al., 2000). However, if the individual method was suboptimal, we would not expect the method to have obtained traction in the literature—individual studies being more common than fixed ones—and more importantly, we would not expect a distinct pattern of activations to emerge. It is possible to imagine various scenarios in which the presence of suboptimal acquisition or preprocessing parameters that impair the detection of midbrain activations would sustain the observation of a certain pattern of weaker ventral striatal RPE-associated responses beyond the canonical network, but even then, the focus of the activation should not show such a reproducibly medial focus within the striatum. It also does not seem likely that a suboptimal RPE regressor would be better coupled to an experimental confound, such as the response to the reward itself (Rohe et al., 2012). Within the reinforcement learning framework we have set out, the most likely remaining explanation is that the neural responses to RPEs generated by different learning rates are reflected across different regions of the brain (Glascher & Buchel, 2005). For example, a model by M. J. Frank, Moustafa, Haughey, Curran, and Hutchison (2007) distinguished a rapid but time-dependent learning mechanism, ascribed to the OFC, and a slower, incremental learning mechanism, ascribed to the striatum. Both mechanisms used similar RW-based learning rules, although more recent, comparable models have employed a working-memory-based system rather than a rapid reinforcement learning system (Collins & Frank, 2012). This might, therefore, provide one interpretation of our data, with the modification that the medial striatum encodes a more variable learning rate (across individuals), perhaps better linked to trial-by-trial choice performance, whereas the midbrain and lateral putamen reflect a more homogeneous, slower learning rate that is not as strongly reflected in behavior.

Conjunction analyses

A further level of specificity is afforded by the conjunction analysis examining which regions have been identified across different designs, and thus are relatively invariant. Across several of the subgroup analyses (i.e., fixed/individual, Pavlovian/instrumental, and RW/TD), the left putamen was identified. The region was notable insofar as it was positioned at the midpoint between the classic ventromedial striatal region, which may correspond to the nucleus accumbens in humans (Haber & Knutson, 2010), and a more clearly lateralized putamen region. Given that these two regions may be anatomically distinct (Haber et al., 2000), it is important to consider the extent to which smoothing may have played a part in this finding. The smoothing of individual participant images is considered to be an important preprocessing step: Though not without drawbacks, the method is thought to enhance statistical power, by increasing the ratio of signal to noise (Yue, Loh, & Lindquist, 2010), and increases the underlying smoothness for Gaussian random field-based (cluster) analyses (Hayasaka & Nichols, 2003). It is intriguing that one subgrouping analysis that did not yield activation in this region was the conjunction of studies that used high and low smoothing kernels. In a recent study, Sacchet and Knutson (2013) demonstrated that larger smoothing kernels can influence the localization of peak activation within the ventral striatum, with larger kernels yielding more posterior activations. In our study, the variability in the magnitudes of smoothing kernels across studies was relatively small, with the large majority of studies choosing an 8-mm kernel, and no significant differences between the low/high smoothing subgroups were seen. However, it was also notable that studies using a small smoothing kernel were (nonsignificantly) more capable of revealing midbrain activation. Given that the midbrain is a small structure, matched filter theory (for fMRI, see Yue et al., 2010) would predict that a smaller filter should therefore be advantageous to identify activation in this region. Overall, as was suggested by Sacchet and Knutson, differences in smoothing across studies may provide significant additional heterogeneity, and alternative smoothing methods that honor the geometry and sizes of these regions may be valuable in future studies.

Core expected value network

Our meta-analysis of reinforcement learning studies of EV identified a subregion of the subgenual cingulate cortex, corresponding most closely to areas 25 and 32 of the human and monkey vmPFC. This phylogenetically ancient agranular region is likely homologous to the paralimbic and infralimbic cortex of rodents (Wallis, 2012).

At the first approximation, our findings converge with primate electrophysiological (Kennerley et al., 2009; Kennerley & Wallis, 2009a, 2009b; Morrison & Salzman, 2009; Padoa-Schioppa & Assad, 2006, 2008; Roesch & Olson, 2004, 2005; Wallis & Miller, 2003) and lesion (Izquierdo, Suda, & Murray, 2004; Noonan et al., 2010; Rudebeck & Murray, 2011) studies, as well as rodent lesion studies (Gallagher, McMahan, & Schoenbaum, 1999; McDannald, Lucantonio, Burke, Niv, & Schoenbaum, 2011; Takahashi et al., 2009), implicating the OFC in value computations. Yet, the substantial anatomical heterogeneity between these literatures cannot be ignored. Most primate electrophysiological studies have recorded value signals from more rostral, central orbitofrontal regions (BAs 11 and 13). Rodent studies have often employed lesions of the more rostral and lateral OFC (Gallagher et al., 1999; McDannald et al., 2011; Takahashi et al., 2009). In contrast, our subgenual cingulate cluster is more medial and caudal and does not extend to the orbital surface. This discrepancy was recently discussed by Wallis (2012), who pointed out a few possible solutions to this puzzle. First, rostromedial OFC BOLD activations in BA 11, medial BA 13, and ventral BA 10 are obscured by the susceptibility artifact. Thus, value signals in the human brain may well extend into the rostral and central OFC areas highlighted by primate physiological studies. However, a recent meta-analysis of fMRI studies of reward value that was not limited to reinforcement learning studies, by Bartra and colleagues (2013), reported value-related activations in the medial rostral OFC areas most affected by the susceptibility artifact, but not in the more lateral central OFC, in which signal is often better preserved.

Another set of considerations stems from the medial–lateral organization of the orbitofrontal circuits (Ongur & Price, 2000). The lateral, “orbital” circuit of Carmichael and Price (1996) encompasses central OFC areas, which integrate sensory inputs carrying information about extrinsic food values: taste, olfaction, and vision. It is often argued that this lateral circuit represents not only the values of foods and liquids typically used in animal experiments, but those of external stimuli and outcomes in general (Schoenbaum, Takahashi, Liu, & McDannald, 2011; Wallis, 2012). Physiologists have typically recorded from this circuit in their studies of primate and rodent OFC (Kennerley et al., 2009; Kennerley & Wallis, 2009a, 2009b; Morrison & Salzman, 2009; Padoa-Schioppa & Assad, 2006, 2008; Roesch & Olson, 2004, 2005; Wallis & Miller, 2003).

An additional reason why fMRI studies may have not detected value signals in central OFC is its diametrically opposed value-encoding scheme (Wallis, 2012): Some OFC neurons increase and others decrease their firing rates in response to increasing value (Kennerley & Wallis, 2009a; Morrison & Salzman, 2009; Padoa-Schioppa & Assad, 2006). These opposing responses may cancel each other out at the level of the BOLD signal. The medial orbital circuit, encompassing the vmPFC and the subgenual cingulate in particular, has prominent visceral and motor connections (Carmichael & Price, 1996; Ongur & Price, 2000). Its putative functions include sensing internal states, tracking social value, and bridging outcome value and action selection (Bouret & Richmond, 2010; Noonan et al., 2010; Rudebeck et al., 2008; Rudebeck, Buckley, Walton, & Rushworth, 2006). Grabenhorst and Rolls (2011) have placed the vmPFC downstream from the OFC in the processing of reward signals, proposing that the vmPFC receives stimulus value information from the OFC, incorporates other variables such as cost into the decision, and transmits it to motor areas. VmPFC responses often scale with subjective pleasure, which may best correspond to the reward rate or the total value of the contingencies that can be exploited.

Not only are the findings of vmPFC value signals consistent in human fMRI studies, but they are also less well established in the primate electrophysiological literature (Wallis, 2012; but see Strait, Blanchard, & Hayden, 2014). This discrepancy may reflect methodological differences between the human and monkey studies. For example, human studies have mostly used secondary reinforcers such as money and correct/incorrect feedback. Only 2/16 value studies in our meta-analysis used primary rewards (liquid). One of them detected value signals in the vmPFC (Takemura, Samejima, Vogels, Sakagami, & Okuda, 2011), and one did not (Gradin et al., 2011), and neither found value signals in the central OFC. Furthermore, the meta-analysis by Bartra and colleagues (2013) reported vmPFC value signals for both primary and monetary rewards. A similar explanation focuses on the putative predilection of the vmPFC for social value signals (Rudebeck et al., 2006). The presence of vmPFC value signals in fMRI studies that have used primary, nonsocial rewards argues against this explanation. That said, demand characteristics may be a confound in human imaging studies of value signals, and experimenters may thus need to conceal contingency manipulations. In summary, our finding of reinforcement-learning-estimated value signals in the vmPFC/subgenual cingulate is consistent with non-reinforcement-learning-based human imaging studies and diverges somewhat from the primate electrophysiological studies, which have tended to find value signals in the central OFC.

Given that the EV map was restricted to the vmPFC, a supplementary conjunction analysis of the RPE and EV contrasts did not reveal significant results. Given that the EV maps reflect future expected rewards, it is plausible that a TD-related signal should be observed at this stage, and thus a concurrent striatal or midbrain activation. In fact, significantly different activations were observed between the RPE network (RPE > EV) and the vmPFC EV cluster (EV > RPE). A statistical account of this observation may relate to the combined inclusion of RPE and EV regressors in the general linear model used in the analysis of many of the studies: The presence of each regressor concurrently, combined with a suitable design, may act to orthogonalize these two events and distinguish the resulting maps. Nevertheless, our findings are also consistent with the view that a phasic TD signal might be distinct (in this case, neuroanatomically) from an EV signal (Ludvig, Sutton, & Kehoe, 2008).

Limitations

Although striking consistency in the patterns of activation was observed across paradigms, there was nevertheless evidence of different classes of paradigms leading to different patterns of findings, as we discussed. A limitation of the inferences that can be drawn from analyses of these differences was caused by the presence of confounds between different categories. This was particularly acute for Pavlovian–TD–liquid designs, because of their relative infrequency. In particular, amygdala RPE-coupled activations were associated with these classes of designs, making it difficult to draw strong conclusions about the amygdala’s engagement by a paradigm class. Overall, our method of contrasting paradigm classes required that all other dimensions be controlled for strong inferences to be obtained. Although this was not possible, the findings nevertheless point to particular trends of experimental design that may precipitate differences in the patterns of neural activation obtained.

Refutations or refinements of reinforcement learning models are of course a crucial part of their theoretical development within neuroscientific investigation (Gamez, 2012). However, we have restricted our analysis to studies in which the reinforcement learning model was not refuted or otherwise argued to be an inferior account of the pattern of data, albeit we did allow for some modifications of parameterization to the basic RW or TD model. Bayesian models such as the Bayesian learner (Behrens, Woolrich, Walton, & Rushworth, 2007), hidden-Markov models (Hampton, Bossaerts, & O’Doherty, 2006), and Bayesian reinforcement learning (Mathys, Daunizeau, Friston, & Stephan, 2011), as well as the Kalman filter (Daw, O’Doherty, Dayan, Seymour, & Dolan, 2006), can all exhibit advantages over many of the models we have examined in the present work. However, the superior performance of the alternative models in the studies that we opted to exclude may have been a result of peculiarities of the experimental designs, which might render these studies more heterogeneous a priori, and thus less suitable for meta-analysis. In addition, the nature of this advantage should be carefully qualified (Myung, 2000): Often, these models are representationally more powerful, perhaps reflecting inherent features of the experimental design (e.g., the rule transitions embedded within reversal learning: Behrens et al., 2007; Hampton et al., 2006). Although pursuing the benefits of these models is likely to be a topic of major ongoing interest, we argue that the incremental increase in complexity and representational capacity of many of these models creates a natural, qualitative distinction from the more traditional reinforcement learning methods that provided the focus of the present work.

Another limitation of the present study involves the limitation of meta-analysis, over and above the direct pooling of data within a “mega”-analysis. A judicious combination of fMRI studies of conditioning could in theory be performed, perhaps along similar lines to the analysis of task-related neural activation by Dosenbach and colleagues (2006). If possible, this would certainly afford a more direct contrast of different modeling strategies (e.g., fixed/individual learning rates, smoothing kernels), and possibly also of procedural differences (e.g., reinforcer types, response contingencies). Moreover, this approach may afford more detailed investigation of the relationships between individual functional activations and anatomy, providing that adequate structural data are available. The overlap between individually defined regions of interest and brain activations would diminish the necessity of spatial smoothing and potentially increase the specificity in regions of high between-participant anatomical variation.

We also restricted our study inclusion to healthy adult groups. Individual differences in a variety of demographic factors can influence the patterns of reinforcement-learning-related neural activation and represent possible unmeasured sources of intersubject variability. Again, a “mega”-analysis with suitably recorded data might provide some control of these effects. However, the consistency of some of our findings (e.g., left putamen) across methodological dimensions suggests that these factors may serve to modulate a core pattern of activation rather than to yield qualitative differences. Overall, because ALE has been argued to be statistically conservative (Graham et al., 2013), it is likely that our findings broadly represent a central, reproducible motif that may provide a useful reference point for future studies of reinforcement learning and reward-based conditioning studies. Indeed, an increase in the number of available reinforcement learning studies would allow greater power to address the full diversity of reinforcement-learning-related processes in the human brain. Although the number of studies available was adequate, further information could be usefully gleaned by increasing the number of studies (e.g., Rottschy et al., 2012), particularly if they provided data from designs not well represented in the present selection (e.g., liquid–TD studies).

Summary

In the present work, we have identified a pattern of human neural correlates of RPE and EV signals derived from simple reinforcement learning algorithms. Our findings accord well with the existing literature, particularly with electrophysiological studies of experimental animals, in our identification of dopamine-rich regions such as the midbrain and striatum in RPE signaling, and the ventromedial prefrontal cortex in EV representation. The main contribution of the present work has been to demonstrate that various methodological factors can influence the patterns of findings. These include factors that are possible to control at the analysis stage (e.g., learning rate estimation, smoothing), but also factors that must be examined experimentally (e.g., reinforcer type, behavioral output). Overall, the reinforcement learning framework has been an empirically successful paradigm for investigating the neurobiology of appetitive behavior, and we anticipate that a new generation of studies will seek to develop the implications of these findings further.