Introduction

Over four decades ago, James J. Gibson presented the seminal concept of affordances to describe the relationships that exist between organisms and their environments, indicating that “the affordances of the environment are what it offers the animal” (Gibson, 1979, p.127). According to this view, common manipulable objects, such as tools, handles, or kitchenware, automatically trigger responses that have acquired a strong association with them, resulting in automatic and specific motor plans for interacting with them (Makris et al., 2013; Proverbio et al., 2011; Tucker & Ellis, 2001). In the classic version of the affordances task, participants classify images of manipulable objects according to a certain rule (e.g., natural vs. manufactured; upright vs. inverted) by responding with their right or left hand (Tucker & Ellis, 1998, 2004; Wilf et al. 2013). Typically, the objects have a prominent handle and thus trigger an automatic grasping response in one hand (e.g., a cup with the handle facing right or left will trigger a grasping response in the corresponding hand). Reactions decelerate and are more erroneous when the relevant response (classifying the object) and the irrelevant, stimulus-driven, grasping response activate different hands (incongruent condition) than when they activate the same hand (congruent condition). Recent studies have elaborated on this finding by adding a neutral condition to the task and demonstrated that two cognitive conflicts exist in the affordances taska response conflict between responding with the relevant versus the irrelevant hand, and a task conflict between the goal-directed classification task and the stimulus-driven grasping task (Littman & Kalanthroff, 2021, 2022). While response conflict manifests only in incongruent trials, task conflict exists in both incongruent and congruent trials. Thus, typical results indicate a congruency effect (longer reaction time [RT] to incongruent than to congruent trials, indicating a response conflict), a reversed facilitation effect (congruent RT > neutral RT, indicating task conflict), and an interference effect (incongruent RT > neutral RT, which encompasses both task and response conflicts; Littman & Kalanthroff, 2022). Since its presentation in the seminal work by Tucker and Ellis (1998), the affordances task has been employed in a variety of studies and in various iterations to promote our understanding of human cognition, attention, and visuomotor functioning. However, despite its importance in experimental science, an evaluation of the task’s psychometric properties, including its test–retest reliability, has not been undertaken. Critically, the lack of reliability measures poses a significant limitation to our ability to infer valid conclusions regarding aspects of individual differences measured by the task. Thus, our primary goal here was to establish the test–retest reliability of the affordances task.

For cognitive tasks, test–retest reliability is often assessed by correlating RT performance on different occasions of assessment (Enkavi et al., 2019). However, such efforts often result in low test–retest measures, falling short of the minimal satisfactory value of 0.7 (Barch et al., 2008), even with the most well-established tasks (Draheim et al., 2021; von Bastian et al., 2020). The “reliability paradox” (Hedge et al., 2018) refers to the phenomenon according to which behavioral tasks that produce highly replicable group-level effects often fail to capture individual differences between participants in the same task, for example, by yielding low test–retest reliability (Enkavi et al., 2019; Haines et al., 2020). For instance, in the case of the Stroop task, which yields a very robust and replicable interference effect (MacLeod, 1991; Stroop, 1935), simple test–retest correlations often yield satisfactory reliabilities for the congruent, incongruent, and neutral conditions individually, but significantly lower and unsatisfactory correlations for the robust Stroop interference effect, calculated as the difference between the incongruent and the congruent conditions (Bender et al., 2016; Hedge et al., 2018; Strauss et al., 2005). The gap between the satisfactory reliability of the task’s conditions and the low reliability of the task’s effects, which otherwise produce robust replicable group-level effects, presumably stems from the nature of the well-studied “reliability of difference scores” (see review in Draheim et al., 2019). This term refers to the difference between the scores of two conditions that are highly correlated and often result in a value that is unstable across administrations, which in turn yields low reliability. Nevertheless, the focus on congruency effects and not on congruency conditions is crucial for the evaluation of the operation of “refined” processes of cognitive control that are evident beyond general performance RT. Thus, as many robust group-level effects result in unsatisfactory test–retest estimations, various researchers have raised concerns regarding their employment in the measurement of individual differences, or have deemed them unsuitable to do so (Dang et al., 2020; Elliott et al., 2019; Gawronski et al., 2017; Hedge et al., 2018; Schuch et al., 2022; Wennerhold et al., 2020).

A few recent studies have suggested that hierarchical Bayesian models could serve as key instruments to bypass the limitations of the common summary statistics practice to obtain test–retest reliability (Chen et al., 2021; Haines et al., 2020; Romeu et al., 2020; Rouder & Haaf, 2019). As mentioned above, summary statistics correlate mean RT differences across times of assessment. However, the summary statistics approach has two major limitations: (a) it ignores the specific distribution for each participant from which this mean derives, and (b) it neglects to account for trial-level variance, which constitutes an important source of data variability (Chen et al., 2021; Haines et al., 2020). As opposed to summary statistics, hierarchical Bayesian models are generative. That is, for a likelihood function predetermined by the researcher (e.g., lognormal distribution for RT, Bernoulli distribution for accuracy), hierarchical Bayesian models allow one to simulate data on the trial level based on the specific distribution calculated for each participant in the different experimental conditions (McElreath, 2020). Thus, hierarchical Bayesian models address the limitations mentioned above by (a) choosing a likelihood distribution for the model, which in turn provides each participant with their own specific distribution, and (b) incorporating the data in the model on the trial level, thus accounting for trial-level variance. That is to say, while summary statistics ignore the uncertainty (i.e., measurement error) associated with each participant’s summary score, generative models specify a single model that jointly captures individual- and group-level uncertainty. Given that means alone are often imprecise when characterizing entire distributions, models that capture the entire shape of participants’ RT distributions may yield very different inferences. Therefore, because hierarchical Bayesian models are generative, they provide individualized distribution per participant (per experimental condition) and can provide an improved estimation of test–retest reliability. In this respect, hierarchical Bayesian models can provide a more reliable and more accurate estimation of task test–retest reliability than summary statistics.

Recently, researchers have compared the test–retest estimates of several well-established cognitive paradigms by using both mean RT correlations and hierarchical generative models (Chen et al., 2021; Haines et al., 2020; Snijder et al., 2022). The generative models consistently inferred higher test–retest measures relative to the summary statistics approach, and in many cases resulted in substantial differences in test–retest estimations. Moreover, Haines et al. (2020) demonstrated how the generative model estimates are highly consistent across replications of the same task, whereas estimates based on summary statistics may vary considerably. A core process that differentiates the two methods is the hierarchical pooling that takes place in the generative models’ method and refers to the regression of individual-level parameters toward the group-level mean. Simply put, hierarchical pooling improves the estimation for each participant such that when there is great inconsistency in the participant's data (e.g., large variability), the estimation benefits from the group-level mean to yield a better estimate of the participant's performance. Importantly, Haines et al. (2020) showed that generative models do not automatically generate higher test–retest reliability than summary statistics, but that the process of hierarchical pooling only occurs to the extent it is warranted by the data.

In the present study, we followed the method employed by Haines et al. (2020) to assess the test–retest reliability of the affordances task for the first time, first by using traditional summary statistics and then by employing a Bayesian generative model. As mentioned above, the task has been shown to serve as a valuable behavioral tool for the assessment of visuomotor functioning, cognitive conflicts, and the activation of cognitive control at the task level and at the level of response (Buccino et al., 2009; Goslin et al., 2012; Grezes & Decety, 2002; Littman & Kalanthroff, 2021, 2022; Rice et al., 2007; Schulz et al., 2018). Furthermore, in recent years there is a growing interest in scientific psychopathological models that focus on the imbalance between stimulus-driven habitual behaviors and goal-directed behaviors (Gillan et al., 2014, 2015; Kalanthroff et al., 2017, 2018b; Robbins et al., 2012). Thus, establishing the affordances task’s test–retest reliability is important for its utilization as a behavioral and neurocognitive tool for the assessment of control over stimulus-driven habitual behaviors. Finally, we administered the task online at both time points. Given the rising popularity of online administration of cognitive tasks (e.g., Feenstra et al., 2018; Gillan & Daw, 2016), establishing reliability measures for an online version of the affordances task may be useful for both cognitive and clinical scientists.

Method

Participants

Three hundred and thirty-one students from the Hebrew University of Jerusalem, Ben Gurion University of the Negev, and Achva Academic College (all in Israel) took part in the experiment for course credit or small monetary compensation (~12 USD). Participants had normal or corrected-to-normal vision, were native speakers of Hebrew, and were naïve as to the purposes of the experiment. The experiment was approved by the Hebrew University institutional ethics committee (HUJI-500119). Informed consent was obtained from all participants prior to their participation in the experiment. The participants were instructed to register for the experiment only if they were able to complete its second part precisely one week after the first part (on the same day and hour in which they completed the first part). The results of 20 participants who did not complete the second part of the study and of an additional 10 participants who failed to complete the second part within six hours of the designated time were removed from the analyses. Following Hedge et al. (2018), the results of six participants were removed due to having more than 30% missed trials (three participants) or due to accuracy rates below 60% in either session (three participants). The analyzed sample thus consisted of 295 participants (226 female, 69 male) between the ages of 18 and 42 (M = 24.5, SD = 3.1). The proportion of left-handed participants was 10.8%Footnote 1.

Materials and methods

The experiment was programmed and administered online using Gorilla Experiment Builder (Anwyl-Irvine et al., 2020). Participants were instructed to complete the experiment on their private PCs, in a quiet environment, devoid of interruptions, and after turning off their mobile phones. The experiment was limited to participation via stationary or laptop computers; tablet devices or mobile phones were not permitted. The program matched the image resolution so that stimuli size was held constant across participants’ monitors. The participants completed the affordances task twice, with a one-week gap between the two administrations. One day prior to the Time 2 designated administration, participants were reminded via email to complete the second part at the same time of day in which Time 1 was administered. The versions of the task in Time 1 and Time 2 were identical except for the practice block, which consisted of 90 practice trials in Time 1 and 30 practice trials in Time 2. We designed a longer practice block at Time 1 to familiarize participants with the keyboard keys. Icons of the relevant response keys appeared at the bottom of the screen throughout the practice block and disappeared during the experiment. By the end of the practice block in each session, a minimum 80% accuracy rate was required to start the experimental block. If a participant fell short of reaching this requirement, an additional 30 practice trials were added.

Prior to undertaking the affordances task, participants viewed four brief video clips, each lasting two seconds, in which a male/female hand (consistent with the participant’s gender) reached and grasped a teapot or a cup by the handle, each turning left in one clip and right in a second clip (for similar procedures see: Garrido-Vásquez & Schubö, 2014; Littman & Kalanthroff, 2021, 2022; Tipper et al., 2006). This was done in line with previous suggestions according to which affordances tendencies may become more prominent under conditions that emphasize the object’s graspability or the contextual correspondence of perception and action (Girardi et al., 2010; Lu & Cheng, 2013; Netelenbos & Gonzalez, 2015). Next, participants completed a practice block and then performed an experimental block consisting of 288 trials. Each trial began with a 500 ms fixation (white plus at the center of a black screen) followed by the target stimulus, which appeared for 1500 ms or until keypress, and an additional 500 ms of a black screen. Trials in which there was no response within 1500 ms were coded as missed trials and were not further analyzed. The target stimuli consisted of one of three black-and-white images (a cup, a teapot, or a house), obtained from the Amsterdam Library of Object Images (ALOI; Geusebroek et al., 2005). Stimuli were 767 × 574 pixels and appeared at the center of a black screen. Each stimulus appeared either in its upright form or in its inverted form on a random selection of 50% of the trials, a common procedure in affordances tasks (e.g., Iani et al., 2019; Saccone et al., 2016; Tucker & Ellis, 1998). Participants were instructed to indicate whether each object appeared in its upright or inverted form as quickly and as accurately as possible by pressing the “A” key with their left index finger or the “L” key with their right index finger. The mapping rule was counterbalanced across participants. To provoke an affordances effect, the cup and the teapot stimuli had a horizontal handle that could appear to the right or the left side of the object. In half of the trials, the handle direction was congruent with the correct response key (i.e., both on the right or both on the left), while in the other half the direction of the handle and the correct response key were incongruent (i.e., left-facing handle and right correct response key, and vice versa). House images were previously shown to function as a neutral condition in the affordances task (Littman & Kalanthroff, 2022), serving as large objects that do not afford grasping tendencies (Chao & Martin, 2000). Within each presented orientation (upright vs. inverted), the trials were equally divided into neutral, congruent, and incongruent conditions with equal proportions and random order. As tools were previously shown to evoke affordances effects when presented in their functional orientation, but not in other, unfunctional orientations (Bub et al., 2018; Iani et al., 2019; Littman & Kalanthroff, 2022; Masson et al., 2011), we focused our analyses on the upright trials only (and indeed, the inverted trials data did not produce an affordances effect).

Statistical analysis

We began by trimming RTs shorter than 150 ms (0.06% of the data). To evaluate the within-task effects, a two-way analysis of variance (ANOVA) with repeated measures was applied to the RT data of correct responses with congruency conditions (congruent vs. neutral vs. incongruent) and time of assessment (Time 1, Time 2) as within-subject factors. Next, we assessed test–retest correlations of the RT data of correct responses. First, we employed the traditional summary statistics method and calculated Pearson’s r correlations between the mean RTs of Times 1 and 2 for the congruency conditions (congruent, incongruent, and neutral) and the congruency effects (congruency, interference, and reversed facilitation). Following this, we assessed test–retest reliability for the congruency conditions and congruency effects by using the Bayesian generative model similar to the one presented by Haines et al. (2020). In this model, group-level normal distributions are considered prior distributions on the individual-level parameters. This allows information to be pooled across participants such that each individual-level estimate influences its corresponding group-level mean and standard deviation estimates, which in turn influence all other individual-level estimates. This interplay between the individual- and group-level parameters is the hierarchical pooling, a core feature of hierarchical models, which increases the precision of individual-level estimates and allows for the group- and individual-level model parameters to be estimated simultaneously (Gelman & Pardoe, 2006). Here, we analyzed our data by using a generative model with a lognormal link function, in which changes in stimulus difficulty produce changes in both means and variances of RT distributions (Rouder et al., 2015). Similar to RT distributions, lognormal is a positive-only right-skewed distribution. For further discussion on analyzing RT data with a lognormal link function, see Lo and Andrews (2015). We used four chains, each using 3000 iterations and 1500 sample warm-ups. A figure of the Rhat distribution and caterpillar plots are presented in section S1 of the Supplementary Material.

We estimated parameters using Stan (version 2.2.1), a probabilistic programming language that uses a variant of Markov chain Monte Carlo to estimate posterior distributions for parameters within Bayesian models (Carpenter et al., 2017). For the generative model analysis, we report the highest maximum a posteriori (MAP) probability estimates of the posteriors, i.e., the value associated with the highest probability density (the "peak" of the posterior distribution) which serves as an estimation of the mode for continuous parameters. To further illustrate the interpretability of the posterior distributions, we report 89% posterior highest density intervals (HDI), which were deemed to be stable in Bayesian analyses (Kruschke, 2014; Makowski et al., 2019). The HDI is a generalization of the concept of the mode, but it is an interval rather than a single value.

Results

We began by inspecting the within-tasks effects for each administration time. Table 1 illustrates RTs and accuracy rates. As can be seen in Table 1, the task yielded significant congruency effects in both administrations.

Table 1 Mean (standard deviation for congruency conditions, standard error for congruency effects) reaction time (ms) and accuracy rates on the affordances task at Time 1 and Time 2

To assess test–retest reliability, we began by evaluating correlations for the congruency conditions and effects by calculating Pearson’s r correlations between Time 1 and Time 2 for each pair of values. As can be seen in Table 2, Pearson’s r test–retest measures for the congruent, incongruent, and neutral conditions resulted in acceptable values between .72 and .75. However, in line with the “reliability paradox,” when inspecting test–retest correlations for the congruency, interference, and reversed facilitation effects, test–retest measures dropped significantly, yielding weak correlations—between the values of .22 and .29.

Table 2 Test–retest correlations between performance at Time 1 and Time 2 for the affordances task estimated by Pearson’s r correlations and the Bayesian generative model

Next, we assessed test–retest correlations by using the Bayesian generative model method. In comparison to the traditional test–retest (Pearson) evaluation, the generative model method resulted in a modest improvement for the test–retest values for the congruent, incongruent, and neutral trials, yielding acceptable to good correlations, between .77 and .79 (see Table 2). More importantly, the generative model method resulted in considerably higher test–retest values for the congruency, interference, and reversed facilitation effects. The affordances task yielded acceptable to good test–retest measures for all three congruency effects, yielding correlations between .70 and .83 (see Table 2).

Discussion

In the present study, we evaluated the reliability of the affordances task for the first time. To that end, we administered the task online twice, with a one-week gap. We assessed the task’s test–retest reliability both by using traditional summary statistics and by employing a hierarchical Bayesian generative model that was recently suggested as more suitable for the assessment of individual differences (Chen et al., 2021; Haines et al., 2020; Rouder & Haaf, 2019; Snijder et al., 2022). The task’s online administration replicated common group-level results obtained on in-lab experiments, yielding congruency, interference, and reversed facilitation effects (see Table 1). Test–retest correlations for the three congruency conditions were satisfactory in both traditional summary statistics and the Bayesian generative model. However, for the congruency effects (congruency, interference, and reversed facilitation), which are the more important measures in the task, the test–retest correlations obtained by the use of traditional summary statistics were weak whilst the employment of the Bayesian generative model improved those correlations considerably, resulting in satisfactory and reliable test–retest estimations. These results raise several important points for consideration.

First, the current findings provide the first assessment of the affordances task test–retest reliability, indicating the task’s stability for the study of individual differences. These findings may prove significant for future investigations of visuomotor and neurocognitive processes. Recent studies have demonstrated that the affordances task consists of cognitive conflicts at both the task level and the level of response (Littman & Kalanthroff, 2021, 2022). Response conflict manifests in the task as a conflict between responding with one’s right versus left hand, illustrated by the longer RT to incongruent than to congruent trials (i.e., the congruency effect). Task conflict, which evolves between competing task demands (Kalanthroff et al., 2018a; Littman et al., 2019), manifests as a conflict between the goal-directed object classification task versus the stimulus-driven object grasping task. Hence, task conflict is indicated by longer RT to congruent (conflict-laden) than to neutral (conflict-free) trials (i.e., the reversed facilitation effect, see Littman & Kalanthroff, 2022). The current results provide evidence of good test–retest reliability for the congruency and reversed facilitation effects, thus supporting the task’s reliability in the assessment of task and response conflicts. Importantly, while past studies mainly demonstrated the emergence of task conflict under conditions that trigger mental reactions such as word-reading in the Stroop task (Goldfarb & Henik, 2007; Parris, 2014) and object recognition in the object-interference task (La Heij et al., 2010; La Heij & Boelens, 2011; Prevor & Diamond, 2005), the affordances task is the first to demonstrate the emergence of task conflict under conditions that trigger a behavioral reaction (object-grasping). As such, the affordances task serves as a nonlinguistic, behavioral measure of task conflict that is potentially closer to participants’ everyday experiences. The current findings also illustrate the affordances task as a promising tool for the assessment of control over stimulus-driven habitual behaviors in healthy populations as well as in pathological populations characterized by increased reliance on stimulus-driven habitual behaviors, such as obsessive-compulsive disorder patients (Gillan et al., 2014, 2015; Kalanthroff et al., 2017, 2018b; Robbins et al., 2012), patients with substance use or behavioral addictions (Voon et al., 2015), and individuals suffering from a pre-supplementary motor area brain lesion (Haggard, 2008). Importantly, while stimulus-driven habitual behaviors have been demonstrated using various tasks, the current findings support the use of the affordances task as a unique measure of the specific cognitive control impairments that result in increased reliance on stimulus-driven habitual behaviors.

An important point regarding the affordances task needs to be acknowledged. Although many researchers attribute the affordances effect to the automatic activation of grasping responses, an alternative view has been suggested. According to this suggestion, the affordances effect represents a spatial correspondence effect, essentially similar to the Simon effect for stimulus location, and not grasping tendencies (Proctor & Miles, 2014). According to this approach, the effect is not triggered by a conflict between a correct response behavior and an incongruent activation of a stimulus-driven behavior, but rather by a conflict between a correct response behavior and an incongruent spatial cue. In other words, this alternative approach suggests that the conflict would be evident regardless of the graspability characteristics of the presented stimulus, since only its asymmetrical spatial form determines the conflict. Behavioral studies which inspected the two alternatives yielded mixed results: while some concluded that the observed affordances effects may be explained by a mere spatial correspondence effect (Cho & Proctor, 2010; Proctor et al., 2017; Song et al., 2014; Xiong et al., 2019), others have demonstrated that dissociable Simon and affordances effects can co-occur and that the affordances effect may emerge even in the absence of spatial correspondence (Azaad & Laham, 2019; Buccino et al., 2009; Iani et al., 2019; Netelenbos & Gonzalez, 2015; Pappas, 2014; Saccone et al., 2016; Scerrati et al., 2020; Symes et al., 2005). Importantly, a wide body of brain imaging studies has demonstrated the activation of premotor areas when participants view manipulable objects (Chao & Martin, 2000; Creem-Regehr & Lee, 2005; Grafton et al., 1997; Grezes & Decety, 2002; Proverbio et al., 2011), an activation which is absent in classic Simon tasks (e.g., Kerns, 2006), and unique patterns of brain activity for manipulable objects that go beyond the effects of spatial correspondence (Buccino et al., 2009; Rice et al., 2007). Nonetheless, the findings of recent studies refined the initial concept of complete automaticity of the affordances effect and suggested that the affordances effect becomes more behaviorally evident when objects are presented in their functional orientation (Bub et al., 2018; Masson et al., 2011), and under conditions which emphasize the object’s graspability (Girardi et al., 2010; Lu & Cheng, 2013). To ascertain the emergence of an affordances effect, we followed the specific suggestions made by these studies. In doing so, we believe that the current study results reflect a reliable measure for control over motor stimulus-driven behavior.

Second, the current study also allows us to evaluate the task’s functioning and reliability under online administration conditions. In recent years, the online administration of cognitive tasks has gained popularity due to its ability to save resources, allow large sample sizes, and reach diverse populations across the globe (Feenstra et al., 2018; Gillan & Daw, 2016; Hansen et al., 2016; Haworth et al., 2007; Ruano et al., 2016). This tendency became even more prominent following the COVID-19 pandemic when the administration of in-lab experiments became limited or impossible for periods of time. Recently, a wide body of studies has reported encouraging findings following online administrations of a variety of cognitive tasks (Anwyl-Irvine et al., 2020; Crump et al., 2013; de Leeuw & Motz, 2016; Hilbig, 2016; Ratcliff & Hendrickson, 2021; Semmelmann & Weigelt, 2017; Simcox & Fiez, 2014). Chiefly, these studies reported results that were comparable to those typically obtained under in-lab administrations. The results of the current study are comparable to those of previous studies which used similar task designs in a laboratory setting (e.g., Littman & Kalanthroff, 2021, 2022; Saccone et al., 2016; Tucker & Ellis, 1998). Specifically, a comparison of the current study results to those reported by Littman and Kalanthroff (2022), Experiment 1, which used an identical design but was administered in the lab, yielded very similar results, albeit with minor differences in general RTs, which were somewhat shorter in the current study than those reported by Littman and Kalanthroff (2022). This full data is presented in section S3 of the Supplementary Material. Most importantly, the effects found in the current study were all in the same direction as the ones reported by Littman and Kalanthroff (2022), and were all significant, yielding medium to large effect sizes. Furthermore, the current results provide essential data regarding the reliability of an online administration of the task, together with the application of generative modeling to behavioral data obtained online. Alongside their advantages, web-based experiments are limited to the extent that administration may be less standardized in comparison to in-lab administration and may contain additional noise sources. Here, the replication of the task’s effects under these (noisier) conditions strengthens their replicability and the utility of the web-based administration of the affordances task.

Lastly, the inspection of test–retest reliability using a traditional method of assessment (Pearson’s r) resulted in weak test–retest correlations for the congruency, interference, and reversed facilitation effects. These findings replicate the “reliability paradox” that is often observed when using summary statistics to assess individual differences in cognitive tasks that yield robust group-level effects (Haines et al., 2020; Rouder & Haaf, 2019), typically resulting in low estimates of the congruency effects (Bender et al., 2016; Hedge et al., 2018; Paap & Sawi, 2016; Soveri et al., 2018; Strauss et al., 2005). Following this, the application of the hierarchical Bayesian generative model resulted in a significant improvement in test–retest evaluation of the congruency, interference, and reversed facilitation effects, all yielding acceptable or good test–retest reliability. These results are in line with recent findings that illustrated the utility of generative models in the assessment of individual differences features (Chen et al., 2021; Haines et al., 2020; Rouder & Haaf, 2019). Recently, Haines et al. (2020) have demonstrated how the employment of generative models results in richer and more accurate test–retest estimations for a variety of well-established cognitive paradigms including Stroop, flanker, and Posner tasks. Additionally, Chen et al. (2021) indicated how the use of generative models accounts for trial-level variability and incorporates it into the model, allowing for a more precise evaluation of reliability in comparison to the summary statistics approach, in which trial-level variability is considered as measurement error. Importantly, the employment of generative models does not automatically result in an inflation of test–retest measures but does so only when such changes are warranted by the data (see Haines et al., 2020). The results of the current study are in line with the recent findings presented by Chen et al. (2021) and Haines et al. (2020), demonstrating the importance of employing finer, more able tools (such as Bayesian generative models) for the psychometric assessment of cognitive tasks. Such methods may deepen our understanding of the tasks themselves, their psychometric properties, and the cognitive structures they are designed to measure.

The findings from our study demonstrate that the affordances task can yield reliable individual differences. However, this is only the first step in a broader psychometric investigation. It is crucial to further examine the variability of these individual differences for clinical use and to determine their relationships to other constructs in the larger context. Additionally, our study suggests that Bayesian hierarchical models are an effective method for understanding these individual differences (Draheim et al., 2019, 2021), and it is recommended to continue using this approach to account for uncertainty in the affordances task. Further research is needed to fully understand the psychometric potential of the affordances task.

Conclusion

The affordances task can serve as an important tool to study aspects of cognitive control and visuomotor functioning. In the current study, we assessed the task’s test–retest reliability for the first time by using a hierarchical Bayesian generative model in an online administration. The affordances task yielded good test–retest properties, supporting its applicability in the study of individual differences. The employment of the generative model replicated recent findings that demonstrated its higher precision in the assessment of test–retest reliability in comparison to traditional methods of assessment which are based on summary statistics. The employment of Bayesian generative models may be used for future evaluations of individual differences and the reliability of cognitive tasks.