Introduction
Eyewitnesses often play an important role in the investigation of crime. They testify to the course of events and the identity of perpetrators. Current procedures for establishing the identity of a perpetrator largely rely on explicit identification from lineups or showups. Fifty years of laboratory, field, and archival research have shown error rates for lineups and showups are as high as 50% on average (e.g., Clark et al.,
2008; Fitzgerald & Price,
2015) and case studies have painfully demonstrated that misidentifications can lead to wrongful convictions. Examples of such wrongful convictions are known all across North America and Europe (Christianson et al.,
1992; Davies & Griffiths,
2008; Garrett,
2011; Lindemans,
2019; Thompson-Cannino et al.,
2009; van Koppen & van der Horst,
2006). As a result, some countries dismiss explicit identification procedures altogether (e.g., South Korea, Indonesia) and scholars have started to question researchers’ sustained commitment to traditional lineups (Brewer & Wells,
2011; Wells et al.,
2006) and called for the development of radically different procedures (Dupuis & Lindsay,
2007). Indirect measures—being less intentional, faster, and more stimulus driven—might probe for such a radical alternative. The Concealed Information Test (CIT; Lykken,
1959) provides such an indirect assessment of recognition. We conducted two experiments that tested the usefulness of the CIT as a means of diagnosing face recognition when encoding conditions are favorable.
The CIT is a well-established technique for memory detection (Lykken,
1959; for a review see Verschuere et al.,
2011). Similar to a lineup, a CIT consists of several stimuli, only one of which is crime related (e.g., stolen goods: 500 EUR banknote), embedded in a series of plausible, yet crime-unrelated answers (e.g., a cell phone, a credit card, a laptop, a Rolex). However, rather than relying on explicit responses, the CIT infers recognition from indirect measures, such as skin conductance, respiration, and the P300 event-related potential. In our above example, police could ask the suspect about the stolen goods: Was it … A cell phone? … A credit card? …A laptop? … A Rolex? … A 500 EUR banknote? Stronger physiological reactions to the actual stolen banknote, compared to other items, indicate concealed recognition. When combining multiple questions, for example about a weapon, the crime scene, and the location of the crime, the CIT can detect recognition with high validity. The diagnostic efficiency of the CIT, measured as the area under the receiver operating characteristic curve, is about 0.82–0.94 (Meijer et al.,
2016). In other words, a randomly chosen person who experiences recognition for the critical stimulus has an 82–94% chance to respond stronger to it in the CIT than a randomly chosen person who does not experience recognition. In addition, a meta-analysis reported large effect sizes for different physiological measures, varying between
d = 0.89 and 1.89 (Meijer et al.,
2014).
Recently, reaction times have enjoyed increasing popularity as the response measure in the CIT (for a theoretical analysis, see Verschuere & De Houwer,
2011; for a meta-analytic review, see Suchotzki et al.,
2017). Reaction times constitute an attractive outcome measure because their use is resource friendly, requiring only a single computer, and Web-based testing is possible with high reliability and validity (Kleinberg & Verschuere,
2015). In the reaction time-based CIT (RT-CIT), the stimuli appear sequentially on the computer screen for a brief interval. The three different types of stimuli are probes, irrelevants, and targets. The
probe is the crime-related stimulus, and the
irrelevants are foils. Participants learn that they should press one of two keys when they see the probe or an irrelevant as fast and as accurately as possible. The other key is reserved for the so-called
targets.
1 Targets are non-crime-related stimuli that participants study just before the test. The use of a response deadline prevents strategic slowing (Suchotzki et al.,
2021). Building on the example above, participants may be instructed that the CIT will examine recognition of stolen goods and asked to press the YES button whenever encountering the target (e.g., a laptop) and the NO button for all other items. For innocent (uninformed) participants, all NO-reaction times should be similar. For guilty (informed) participants, the option
500 EUR banknote should stand out and affect participants’ response. Longer reaction times for
500 EUR banknote than the other NO responses provide an index of recognition. A recent meta-analysis reported a large effect size of Cohen’s
d = 1.04 (corrected), confirming the diagnostic efficiency of the RT-CIT (Suchotzki et al.,
2017).
The association between recognition and reaction times is theorized to result from familiarity-based responding. After all, the most efficient way to take an RT-CIT is to rely on familiarity—a fast and automatic process (Yonelinas,
2002). For innocent (uninformed) participants, familiarity-based responding leads to correct responses for all stimuli (target: YES, probe: NO, irrelevant: NO). For guilty (informed) participants, familiarity-based responding leads to the correct response for targets and irrelevants, whereas for probes, familiarity (YES, recognized) conflicts with the responses required by the task (NO, not the target). It is this conflict and the required control to override it that slows down the response. Increases in the RT-CIT effect for interventions promoting familiarity-based responding, such as using familiar targets or adding familiarity-related fillers, support this idea (Lukács et al.,
2017; Suchotzki et al.,
2018). Furthermore, probe processing is associated with measures indexing response conflict and resolution (e.g., Hadar et al.,
2012,
2019; Seymour & Schumacher,
2009; Suchotzki et al.,
2015).
First support for the idea that indirect measures in general and the CIT in particular can provide information about
face recognition came from two studies with pre-school and school children (Newcombe & Fox,
1994; Stormark,
2004). Participants viewed slides of playmates or former classmates and unfamiliar faces, with their skin conductance, heart rate, or both being recorded. Participants also provided direct face recognition responses. Direct and indirect measures scored above chance in both studies, but the indirect measures outperformed direct recognition decisions. In the first application of the CIT for the purpose of identifying incidentally encountered faces, participants watched four mock crimes across two testing sessions (Lefebvre et al.,
2007). In the perpetrator-present conditions, participants sequentially viewed the photograph of the perpetrator, the victim, and five foils, while EEG recordings were made. Deviating from the classic CIT procedure, participants had three response options, indicating that a picture depicted the perpetrator, the victim, or another person. In other words, participants made an explicit identification in an ERP-based CIT. The CIT revealed recognition of the perpetrator, as did explicit identification. While the results point to the potential of the CIT for cooperative eyewitness identification, the electrophysiological index of recognition may have been evoked by the explicit identification. Additionally, Lefebvre et al.’s facial stimuli (
2007,
2009) were matched for gender, age, race, and hair length, but it is unclear in how far the stimuli matched in terms of hair color or hair style, and no measures of effective lineup size were provided.
Recent investigations applied a stricter CIT protocol in a typical eyewitness paradigm to assess the usefulness of the CIT as an alternative to classic eyewitness identification procedures (Georgiadou et al.,
2019; Sauerland et al.,
2019). Participants viewed a filmed mock crime and then took an RT-CIT. To ensure the fairness of the CIT, the included pictures were selected with the same procedure that is considered best practice for selecting lineup fillers (Doob & Kirshenbaum,
1973; cf. Wells et al.,
1998,
2020). In one experiment, the CIT showed a good capacity to differentiate the film actors (i.e., probes) from fillers (
d = 1.21; Georgiadou et al.,
2019, Experiment 2b) and moderate capacity in another (
d = 0.39; Sauerland et al.,
2019, Experiment 4). Yet, the average effect across five experiments revealed a negligibly small overall effect size of
d = 0.14 (Sauerland et al.,
2019).
The RT-CIT effect sizes in the eyewitness identification field tend to be smaller than typically reported for RT-CIT experiments (i.e.,
d = 1.04 in the meta-analysis by Suchotzki et al.,
2017). One reason for this finding is that the probes in a facial identification RT-CIT protocol have to match the irrelevants more closely than in other fields (Georgiadou et al.,
2019, Experiment 1; Sauerland et al.,
2019, Experiment 5). This is necessary for an identification procedure to be fair (Wells et al.,
2020). Differences in encoding conditions and event complexity offer an explanation for the conflicting findings
within facial identification RT-CIT experiments. More specifically, the experiments with moderate to large effects included only two rather than four actors and provided ample close-ups of both (Georgiadou et al.,
2019, Experiment 2b; Sauerland et al.,
2019, Experiment 4). In Georgiadou et al., (
2019, Experiment 2b) encoding was additionally enhanced by presenting the pictures of the actors for 15 s after participants had viewed the stimulus film and prior to taking the RT-CIT. From an applied eyewitness recognition perspective, this setup was somewhat flawed, though, because the presented picture was identical to the picture used in the CIT (Burton,
2013). Nevertheless, the two experiments combined suggest that observation time might be key for applying the RT-CIT in face recognition. Observation time is associated with initial memory strength and predictive of face recognition performance (Bornstein et al.,
2012). It is possible that a certain degree of memory strength is required to ensure reliable performance in the CIT. Although observation time is not under the control of investigators, this finding might be useful in cases with known long observation time.
In two preregistered experiments, we manipulated overall observation time, close-up duration, and facial viewing time during encoding to test whether encoding conditions critically determine the validity of the RT-CIT as an index of face recognition. Extending prior work, we included a classic lineup condition as a benchmark of eyewitness performance. In Experiment 1, participants viewed a stimulus film with shorter or longer observation time before completing an RT-CIT or making lineup decisions from probe-present lineups. In Experiment 2, we increased the observation time difference across conditions further and added probe-absent conditions to test in how far insights from previous work (Georgiadou et al.,
2019; Sauerland et al.,
2019) apply to a situation where the suspect is in fact innocent. We expected a stronger CIT effect (i.e., difference in reaction times to probes vs. irrelevants) when observation time was longer, rather than shorter (CIT encoding effect; hypothesis 1). In Experiment 2, we additionally hypothesized a larger CIT effect in the probe-present compared to the probe-absent condition (hypothesis 2). In Experiment 2, we also predicted that identification performance with lineups would vary as a function of observation time (lineup encoding effect; hypothesis 3). The relative capacity of the CIT and lineups to diagnose face recognition is of strong applied interest, but we had no hypotheses for this comparison.
Discussion
In two preregistered experiments, we tested whether a reaction time-based computerized test might serve as a radical alternative to the classic lineup (under favorable encoding conditions). Based on previous experiments (Georgiadou et al.,
2019; Sauerland et al.,
2019), we expected favorable encoding conditions to improve the capacity of the CIT to diagnose face recognition (CIT encoding effect; hypothesis 1). Extending earlier work, we also included a classic lineup condition as a benchmark of eyewitness performance. While overall capacity to diagnose face recognition by means of the CIT was strong (Cohen’s
ds between 0.74 and 0.97), observation time did not moderate this effect. In Experiment 2, we further tested the usefulness of an RT-CIT protocol for diagnosing
absence of face recognition when the perpetrator was absent for the first time (hypothesis 2). As expected, probe presence moderated the CIT effect in that there was a CIT effect for the probe-present (
d = 0.74 to 0.76), but not for the probe-absent CIT condition (
d = 0.02). Replicating earlier work (Bornstein et al.,
2012), identification performance varied as a function of observation time in Experiment 2 (hypothesis 3), but only for probe-absent (and not probe-present) lineups. Comparisons of identification performance in the RT-CIT vs. lineups were inconclusive.
Essentially, both the CIT and the lineup are memory tests. Generally, better encoding conditions should, through increased memory strength, improve CIT and lineup performance. One possible explanation for our finding that observation time did not (CIT) or only partly (lineup) moderate identification performance could be that shorter observation times were still relatively long, with at least 24 s of overall viewing time, 12 s of facial viewing time, and 5 s of close-ups. Indeed, a meta-analysis on the effect of exposure time on facial recognition accuracy (Bornstein et al.,
2012) reported the strongest effects when short observations times were only a few seconds long and the ratio of observation time conditions was 1:4 or higher. While this could be seen as a limitation of this work, we were mindful of creating encoding conditions that still justified an identification procedure. More specifically, we were wary of creating a condition that would render identification performance generally unreliable, as reported, for example at overall viewing times of 12 s (Memon et al.,
2003). Rather, we aimed at creating conditions that were in line with the notion that an identification procedure should only take place when the encoding conditions justify a memory test (Wagenaar & Van Der Schrier,
1996). In real live cases, investigators can enquire about the viewing conditions in the prelineup interview (Wells et al.,
2020).
For being useful in the field, an identification procedure not only needs to diagnose the presence of face recognition, but also lack thereof. In other words, a procedure needs to show that it leads to accurate decisions when the police suspect is guilty, but also when the police suspect is innocent. We therefore, for the first time, included a probe-absent CIT condition in Experiment 2. Reassuringly, the CIT effect in the probe-absent CIT condition was nil (d = 0.02; i.e., no statistically significant difference in reaction times to probes vs. irrelevants), compared to strong CIT effects in the probe-present condition in both experiments (Experiment 1: d = 0.85; Experiment 2: d = 0.74). These findings indicate a good capacity of the CIT to differentiate between guilty and innocent suspects.
Including a lineup condition allowed us to compare correct classification rates in the CIT with identification performance in lineups. Although Experiment 1 may have been underpowered to pick up differences between the two methods, across both experiments, we found no compelling evidence that one method would outperform the other. If future investigations support the notion of equivalence, it could follow that performance of the two procedures is comparable, at least under certain conditions. It is possible, however, that one of the two procedures is superior under certain conditions that have yet to be identified. For example, RT-CIT might be more robust in the face of some impact factors that are known to impair lineup performance. Indeed, in the current two experiments, observation time did not affect RT-CIT performance in a statistically significant way, whereas it did in probe-absent lineups in Experiment 2.
Limitations and future perspectives
One limitation of this work is that the current two experiments largely used the same stimulus materials as previous experiments—this work was based on (Georgiadou et al.,
2019; Sauerland et al.,
2019, Experiment 4). The fact that our conclusions are based on a single set of materials (with modifications), warrants replication. On the other hand, the variation in effect sizes across those Experiments points to a susceptibility of the RT-CIT effect for variations in stimuli. For example, compared to Sauerland et al., we edited the clothing in the lineup and CIT photos to be black. Although lineup fairness is similarly vulnerable to bias in the images, the particular clothing worn in lineup pictures is not of the essence, as long as it does not make the suspect stand out (Lindsay et al.,
1987; Wetmore et al.,
2015). However, for the RT-CIT, variations in clothing color or pattern can provide undesired cues for recognition that affect reaction times, independent of
face recognition. Similarly, compared to Georgiadou et al. (
2019), we edited the target and irrelevant pictures corresponding to one probe to include a mole. Although the effective lineup size for this probe was decent without editing the moles in, it is possible that this discrepancy in the images affected the reaction times in the earlier experiments. Another point worth discussing is that the CIT effects we observed here were well below the average effect size reported in a meta-analysis, namely
d = 1.04 ([0.93; 1.17], Suchotzki et al.,
2017). These large effects are most likely owed to the use of autobiographic, self-relevant stimuli (e.g., participant’s name, address) in memory detection experiments. Additionally, these studies use several groups of stimuli (e.g., weapons, crime locations, names of co-perpetrators), whereas for our purposes, we are limited to faces. Options for enhancing the size of the CIT effect include increasing the number of targets, using familiar targets
6 (cf. Suchotzki et al.,
2018), and increasing the number of probes, for example by adding photographs of different aspects of a person (e.g., body vs. face) or by using other information known about the perpetrator (e.g., clothing, jewelry, or a bag; cf. Pryke et al.,
2004; Sauerland & Sporer,
2008). Finally, differences in the lineup vs. CIT instructions (i.e., to pay attention to every detail vs. the faces, respectively) are a limitation of Experiment 1. Clearly, for comparing the usefulness of each method, facial encoding should be identical. Directing special attention to the faces in the CIT but not the lineup might have introduced a bias in favor of the CIT in Experiment 1. Yet, the findings of Experiment 1 and 2 were strikingly similar. Thus, it seems that the differences in the instructions did not have a meaningful impact on the pattern of results.
Thus far, researchers have tested the usefulness of RT-CIT as a means of diagnosing face recognition under pristine conditions in ten experiments (the current work; Georgiadou et al.,
2019; Sauerland et al.,
2019). One possible next step could be to investigate how the RT-CIT fares—in comparison to lineups—in the presence of bias, for example in the (dis)similarity of facial stimuli or the administration of the procedure. Given the implicit nature of the RT-CIT, the procedure might be less prone to impact factors related to the social situation that unfold during lineup administration, such as experimenter effects and demand characteristics (Wells & Luus,
1990).