Introduction: The n-back working memory task

The n-back task is widely used as a way of placing a continuous demand on working memory (WM) in neuroimaging and behavioural dual-task experiments (for reviews, see Jaeggi, Buschkuehl, Perrig, & Meier, 2010; Kane, Conway, Miura, & Colflesh, 2007; Owen, McMillan, Laird, & Bullmore, 2005). This task requires online monitoring, updating of WM, and rule-governed as opposed to familiarity-based decisions, and therefore loads the executive component of WM (e.g., Owen et al., 2005). Participants monitor a series of visually presented digits or letters and decide whether each item repeats the one presented n back in the sequence. Participants may be asked to press a response button when the current stimulus is an n-back repeat or, alternatively, may respond to each stimulus in turn by pressing one of two keys for yes/no. A key advantage of this method is that the demands on executive WM can be easily adjusted by changing the size of n (i.e., by comparing 0, 1, 2, or 3 back). In addition, researchers often contrast n-back tasks involving different materials and modalities—for example, verbal information (i.e., visually presented digits or letters) and spatial locations (e.g., Nystrom et al., 2000; Smith, Jonides, & Koeppe, 1996).

Dual-task studies frequently use n-back tasks to examine the influence of divided attention on a primary task (e.g., Baddeley, Hitch, & Allen, 2009; McKinnon & Moscovitch, 2007). Researchers have also compared verbal and spatial n-back tasks to determine the involvement of the verbal and visuospatial components of WM in different tasks or contexts (e.g., Baddeley et al., 2009). However, n-back tasks typically involve visual rather than auditory presentation (even for verbal materials) and, to facilitate the data analysis, almost invariably involve buttonpress as opposed to spoken responses. This prevents the n-back paradigm from being used in combination with primary tasks that involve vision and action, such as sequential object use—a difficulty addressed by the N-backer program presented here.

Sequential object use is thought to be executively demanding (Duncan, Emslie, Williams, Johnson, & Freer, 1996; Giovannetti, Schwartz, & Buxbaum, 2007; Morady & Humphreys, 2009) and is impaired in clinical populations with attentional deficits following stroke or dementia (Corbett, Jefferies, & Lambon Ralph, 2009; Giovannetti et al., 2008; Giovannetti, Schmidt, Sestito, Libon, & Gallo, 2006; Schwartz et al., 1998). Consequently, object use tasks are vulnerable to disruption by demanding secondary tasks (see Giovannetti et al., 2007; Morady & Humphreys, 2009). Dual-task studies provide a means of (a) exploring which aspects of sequential object use place the heaviest demands on attention and (b) evaluating the efficacy of technologies designed to assist everyday activities involving complex object use, such as cooking. We have therefore developed N-backer: an automated version of n-back, involving auditory presentation and spoken responses, allowing us to explore the impact of divided attention on sequential object use tasks that last for several minutes. Participants listen to a series of digits and attempt to repeat them with a lag of 1, 2, or 3 back. For example, on hearing the sequence 5, 2, 8, 9, 1 . . . , a participant performing 2-back would stay silent for the first two digits, then say “5” after hearing 8, “2” after hearing 9, and so on. This version of n-back, requiring spoken recall of each digit, has a number of advantages: It places a substantial and continuous load on attention and working memory, it overcomes the need to use a two-alternative forced choice decision (i.e., buttonpress responses to indicate whether the current digit matches the digit n back), and it allows for free movement of the hands and body. The computer is trained to recognise participants’ spoken responses, and N-backer determines whether each spoken digit is correct or incorrect, avoiding the need for time-consuming manual coding. We examined the impact of this automated 2-back task on the Naturalistic Action Test (NAT; Schwartz, Buxbaum, Ferraro, Veramonti, & Segal, 2002).

How N-backer works

Principles of operation

N-backer utilises speech recognisers compliant with Microsoft’s Speech Application Programming Interface (SAPI), which is bundled with Microsoft Windows XP, Vista, and Windows 7. The participants wear a wireless headset with headphones and a boom microphone to facilitate free movement. They hear a sequence of synthesised digits through the headphones. This sequence may be of fixed length and read from a file, or randomly generated and continuous until terminated by the experimenter. The rate of presentation is set by the experimenter. The participant’s microphone picks up what he or she says, and speech recognition software records each digit spoken with a time stamp. Time boundaries are set to determine whether the digit spoken occurred within the parameters of the n-back task being performed. A free copy of the N-backer application can be downloaded from http://irgen.ncl.ac.uk/go/nback/.

A sample of the output from the program for a 2-back task is presented in Table 1. By following this table through time from the earliest presented digits (at the top of the table) to the later digits (at the bottom), it is possible to see how N-backer operates. In the first row, the computer presented the digit 5 (“spoken” by the system at speak time 7417). The system registered “5” as a potential response by the participant to this stimulus (at received time 11995). If no response is detected (see Section Speech output and recognition), “Received” is given as −1. N back is the difference between speak time and receive time, divided by the interstimulus interval—here, 1,500 ms. Using criteria detailed in Section Automated scoring of n-back performance, automatic scoring algorithms judged this response to be both correct as a 2-back response (Match = TRUE) and within the right time frame (In Time = TRUE; i.e., not 1 back or 3 back). Consequently, this response was considered to be correct (Pass = TRUE).

Table 1 Output from N-backer

The information in Table 1 is provided by the N-backer software at the end of the experiment as a .CSV file, allowing response accuracy (i.e., the proportion of rows where Pass = TRUE) to be computed. Of course, there is a degree of uncertainty in any system involving the automatic recognition of speech, so in Section An experiment using N-backer we present an empirical case for the validity of the scores obtained.

In terms of a suggested procedure for using N-backer in a continuous task, pilot experiments showed that better recognition accuracy could be obtained by using the standard Microsoft tools to adjust the software to an individual participant’s voice. These can be found in the Speech Control panel. The participant reads out continuous prose that turns grey as the program recognises it. A user voice profile can then be saved for that participant. It was also found advantageous to practice the participant on the 2-back task, to criterion performance, before the experiment started (see Section Procedure below).

Basic controls

Figure 1 shows the main interface window of the software. The toolbar controls should be largely self explanatory. The first group are standard Windows controls for saving the comma-separated value (CSV) text files that contain the data presented in the main panel (Speak time, Spoken, Receive time, etc.). CSV text files are supported by many analysis software tools.

Fig. 1
figure 1

N-backer main interface window

With long sequences of n-backing, such as those used in the experiment described above, one would normally have the program generate a pseudorandom sequence of numbers for itself (by selecting the Generate button, as in Fig. 1). For other uses of N-backer, one might want to generate the number sequences elsewhere and read them into the application. This can be achieved by generating a CSV file with just column 2 specified and the other seven columns entered as blank. In this case, the Generate button should be unselected, otherwise N-backer will append new numbers to the end of the specified sequence.

Play, Pause, Restart, Speak, and Listen are used to control presentation. The use of Simulate is explained in Section Automated scoring of n-back performance. With the toolbar settings depicted in Fig. 1, pressing Play will cause the software to highlight the active row and speak the number in that row while recording a timestamp. A speech synthesiser is used to say the numbers aloud to the participant. The active row will advance automatically at the interval defined by the operator. The n-back interstimulus interval is the first setting in the control panel, here 1,500 ms (see Fig. 2).

Fig. 2
figure 2

N-backer settings window

Speech output and recognition

Speech recognition software is used to recognise the participant’s spoken numbers. A fixed set of single-token “recognition grammars” are used, one for each of the numerals 1 to 9. The speech recogniser is configured to only perform recognition against these grammars, preventing it from trying to match against any other possible speech.

We have found that under normal circumstances, if the standard Microsoft tools to adjust the software to an individual participant’s voice are used, simply limiting the speech recognised to numerals in this way is sufficient to get good hit rates and low false negatives. However, we have provided settings to cater for other circumstances, where it may be necessary to increase the recognition rate of correct answers, at the expense of higher false positive rates (see Fig. 2 for these control panel settings). These use the fact that some numbers are expected and some unexpected. “Unexpected numbers” are numbers that are not expected to be spoken by the participant, since they would not be a correct response given the numbers earlier in the sequence. Numbers become “expected” after a specified amount of time after speak time (the “Numbers become expected after (ms)” setting in Fig. 2), and stay expected until that prompt is matched with a correct answer, or until the presented number exceeds the maximum matching age (“Maximum age to match against (n)” setting).

Unexpected numbers can have lowered recognition weighting (“Unexpected numbers have lower weighting (0–1)” in Fig. 2, set to less than 1), making them less likely to be matched than expected numbers, or lower recognition priority, so that unexpected numbers will lose in a recognition tie (“Unexpected numbers have lower priority (0–100),” set to less than 100). In very extreme circumstances, one might need to set the recogniser to ignore unexpected numbers entirely (“Unexpected numbers are ignored,” set to True). This will result in the Received column registering −1 (unknown) rather than a best-matched digit in the case of an inappropriate response.

Automated scoring of n-back performance

Recognised numbers are compared against any outstanding unmatched numbers, up to the configured maximum matching age (“Maximum age to match against (n)” setting in Fig. 2). Incorrect (recognised but unmatched to an expected number) responses are stored alongside their probable target (i.e., the n-back number that should have been produced). A timestamp for when the spoken response was recognised is recorded with the number spoken. This allows for the computation of the delay between the timestamp recorded when the software issued a command to synthesise the number (“Speak time” in Fig. 1) and the participant’s response. This difference is expressed as n in the “n-back” column—that is, the difference in milliseconds is divided by the interstimulus interval, here 1,500 ms.

For example, consider the first row of data in Table 1. The routine to speak “5” was called at time 7417. Then, 4,578 ms later, at 11995 (Table 1, column 3), the program registered that “5” (column 4) had been spoken, so that “n back” is recorded as 3.05 (column 5). This would appear to be too late, and should thus have been scored as In time = FALSE; in fact, it is scored as In time = TRUE because the limits for n back to be judged as in time were set to being between 2.7 and 3.4. This rather confusing situation arises because the first version of the software used in the experiment described above had no other way to take account of the delay in the system caused by the time taken by the participant to say the digit and for the speech recogniser to recognise it. These delays are schematically depicted in Fig. 3. The recognition delay is the time between the points at which the participant begins to vocalise the number and at which recognition is complete. This depends on processor speed and the time it takes to say the number. By setting the “Minimum n-back time” parameter to 2.7 (4,100 ms), and “Maximum n-back time” to 3.3 (5,100 ms), we are effectively adding around 1,000 ms for the recognition delay.

Fig. 3
figure 3

Schematic timing diagram illustrating delay overhead between prompt and response (not based on data)

With the initial version of the software, setting the limits for n back to be judged as “in time” required some trial and error. The latest version of the software includes an extra setting with which the user can specify the recognition delay. To estimate this delay, this new version of N-backer also has a Simulate mode (see the toolbar at the top of main interface window depicted in Fig. 1). Selecting this mode simulates the responses of a highly accurate 1-back participant by speaking the prompt very quietly and the response loudly at the correct time—the microphone is adjusted to hear the response, and any deviation from 1 back can be adjusted for. The recognition delay (“Estimated recognition delay (ms)” setting in Fig. 2) should only have to be set once for a particular computer configuration.

Finally, to provide continuous feedback to an experimenter on the participant’s performance throughout a study, a simple “success metric” was implemented. The rightmost item in the toolbar at the top of the window depicted in Fig. 1 signals pass or fail in terms of “x correct matches out of the previous y”—for example, where x = 2 and y = 4, the user will “pass” if two or more of the last four matches are within the allowed n-back limits.

The application is written in the C# programming language on Microsoft . NET Framework 3.5 and uses SAPI (minimum version 5.1) speech synthesis and recognition, found on Windows XP and later platforms.

An experiment using N-backer

The purpose of this experiment was to assess the accuracy of automated n-back scoring in a continuous dual-task procedure using this software. The primary tasks for the participants to carry out while n-backing were provided by the NAT (Schwartz et al., 2002). The NAT is widely used to assess patients with brain injury and includes three subtasks of varying difficulty. Subtask 1 requires two highly familiar tasks, making toast and coffee, to be performed simultaneously. Subtasks 2 and 3 are harder, since their goals (wrapping a present and packing a lunchbox and schoolbag) are slightly less familiar. Moreover, these tasks are performed in the presence of irrelevant distractor objects that must be ignored. Subtask 3 is arguably the hardest of all, because it involves two overlapping tasks, packing a lunch box and a school bag, and requires participants to avoid packing irrelevant objects. The first two subtasks took between 2 and 4 min to complete, and the last between 3 and 5 min.

These relatively long primary tasks were selected as a hard test for the software and as an example of an experimental procedure in which automated scoring would be most valuable. In particular, as activities of daily living, they may be judged to have greater ecological validity than simpler but more artificial tasks. We were also interested in the effect of n-backing on performance of the NAT, which has a standardised scoring scheme.

Method

Design

Each participant completed one subtask under normal (control) conditions and the other two while n-backing, in a balanced design. This yielded n-back runs for 8 participants for each of the three subtasks and normal runs for 4 participants for each subtask. This compromise maximised the data available for assessing the accuracy of automated scoring while still permitting a meaningful comparison of n-back and control trials.

Participants

A total of 12 participants were recruited through personal contacts. They were all graduate or undergraduate students at the University of York between the ages of 18 and 25. Eight were female.

Procedure

Table 2 gives the basic structure of the experiment. After completing a consent form, participants trained the speech recognition software to recognise their voice (see above). They were given instructions explaining how to 2-back (see the Appendix), and then they listened to a synthesised digit sequence to familiarise them with its sound and pace. This was followed by practice at 2-backing. Participants continued to practice the 2-back task, with breaks if necessary, until they could produce at least 8 consecutive responses without error and 20 responses with 80% accuracy. Furthermore, they needed to demonstrate that they were able to get quickly back into the sequence (within 3–4 digits) if they paused or made an error—for example, switching to 1-backing or 0-backing. Feedback was given about accuracy and the types of errors made, particularly if the participant slipped into 0- or 1-backing. When practiced to the criteria described above, participants were asked to 2-back while sorting a pack of playing cards into suits, as an introduction to the dual-task method, and then the first baseline 2-back measurement was made. This involved participants 2-backing in the absence of a primary task until the experimenter stopped them after 3 min.

Table 2 The basic structure of the experiment

Before performing each NAT subtask, participants left the room while the test objects for that subtask were set out in their standard positions on a U-shaped table. On participants’ return, they were given the standard NAT instructions for the subtask. The NAT also has instructions about how to answer participants’ questions. In general, these instructions are just to reiterate the overall task and to explain that “everything you need is out here.” The “time-based cues” for the NAT were not used (e.g., prompting after 1 min of unproductive behaviour). At the end of the task, the participant either indicated of their own accord that they had finished or they were asked, “Are you finished?” at which point the program was stopped and the output saved. Upon completing the third subtask, participants provided a second, 3-min 2-backing-only baseline, so that any improvements in 2-back performance over the course of the experiment could be measured.

Apparatus

The N-backer software was run on a Dell PC running Windows XP Version 2002, Service Pack 3. This had an Intel Pentium dual CPU at 2.2 GHz and 3 GB of RAM. The stimuli were presented and responses registered via a LogitechClearChat wireless USB headset with over-ear phones and a boom microphone.

In normal use, such a headset would be the only necessary additional hardware required. For the purpose of this experiment, which was to assess the accuracy of the N-backer software, sessions were video recorded, and participants wore two extra microphones, the input from which were fed into the separate audio channels on the video camera. A clip microphone on the participant’s shirt recorded verbal responses, and a small microphone in the left headphone recorded the stimuli output by the N-backer software. There was some crosstalk on this channel because of bone conduction, but it was still possible to clearly identify the digits generated. The objects for the NAT were as specified in the test manual.

The N-backer settings used in the experiment (Fig. 2; see Section How N-backer works in general for explanations) were Inter-stimulus interval (ms) = 1500; Numbers become expected after (ms) = 500; Unexpected numbers have lower weighting = 0.3; Unexpected numbers have lower priority = 1; Unexpected numbers are ignored = False; Maximum age to match against (n) = 3.4; Minimum n-back time (n) = 2.7; and Maximum n-back time (n) = 3.3. Note that the latter two parameters are set deliberately high, because there was no estimated recognition delay in this version of the software (see Section Automated scoring of n-back performance for an explanation).

Results

N-backer reliability and accuracy

In order to assess the accuracy with which N-backer scored the participants’ 2-backing performance, the videos had to be scored by hand (this, of course, would not be necessary in the normal use of N-backer). Manual scoring was done by the third author, who had considerable experience of doing this with shorter sequences on another project. Scoring 2-back recordings without the help of a computer is hard, since one can make only approximate judgements of timing. It is quite easy, however, to transcribe the sequence of digits spoken by the participant. Manual scoring was thus achieved by matching this transcript to the digits generated by N-backer (Table 1, column 2).

Discrepancies between the transcription and the digits generated by N-backer commonly occurred after participants had slipped into 1- or 0-back responding (i.e., when they repeated the digit they had just heard). Participants would then pause for 2 digits and then start 2-backing again. This pause caused all of the subsequent digits in the transcript to be two items out of place, so these needed to be shifted down in the spreadsheet. Similar adjustments to the transcripts were made wherever necessary to match the transcription to the digits generated by N-backer.

Having achieved a manual transcript in this way, a by-hand score could be computed by counting the proportion of transcribed digits that were correctly 2 back. These percentages are presented in Table 3, along with the automatic accuracy score generated by N-backer. The means and minimum scores across participants are very similar for the two types of scoring, and there is no suggestion that either is more or less lenient. This similarity is reflected in the correlations between the proportions of transcribed digits that were correctly 2 back, computed for both manual and automatic scoring. These are reported in Table 3 and are very high. If the four measurements obtained from each participant are treated as if they were independent, an overall r of .93 (N = 48) is obtained.

Table 3 N-backing accuracy as scored by N-backer (AUTO) and by hand from the audio recording

The Pearson correlations reported above characterise the interrater reliability of the score (proportion of transcribed digits that were correctly 2 back) rather than interrater agreement across individual judgements, as measured by Cohen’s kappa. Interrater reliability will be of most interest to investigators using N-backer to obtain an n-backing score. Interrater agreement is of interest to investigators who wish to use N-backer to look at detailed n-backing behaviour—for example, exactly where in the task n-backing breaks down. Kappa was computed for each of the 24 n-back records—that is, for the two n-back conditions experienced by each of the 12 participants. The mean kappa was .72 (s = .17).

NAT errors

The NAT is scored by counting the number of task steps accomplished. This accomplishment score is combined with an error score to give a summary score. Few of our participants produced major errors or step omissions, so these scores were close to ceiling for Subtasks 1 and 2 (see Table 4). Subtask 3 (lunchbox and schoolbag) included some steps that were less intuitive to our participants, and here the NAT scores were lower. However, there was little difference in the mean scores between the 2-back and control (no 2-back) conditions. It would appear that 2-backing had little effect on performance on this clinically diagnostic scale. The NAT manual reports mean scores (summing across all three subtasks) of 17.3 (SD = 1.2) for control subjects. All of the scores reported in Table 4 are close to this range.

Table 4 NAT summary scores (out of 6) for each subtask for the 4 participants under control conditions and the 8 participants under 2-back conditions

In addition to this clinically diagnostic NAT score, the test also provides a scoring scheme for a more detailed analysis, the comprehensive error score. Comprehensive error scores were calculated separately for each subtask using the video recordings. Table 5 gives the mean comprehensive error score for each subtask for participants performing under control conditions (without concurrent 2-back task) and under 2-back conditions. Because of the counterbalanced design adopted, there were 8 participants in the 2-back group and 4 participants in the control group. The participants in each group varied across the subtasks. For this reason, and given the small sample sizes involved, Mann–Whitney U tests were carried out for each subtask. This analysis found that 2-backing had little effect on the first subtask (toast and coffee), which was arguably the easiest, but produced more errors in the two other subtasks (wrapping a present and preparing a lunchbox and schoolbag; see Table 5).

Table 5 Comprehensive error scores for each subtask for the 4 participants under control conditions and the 8 participants under 2-back conditions

The authors of the NAT explicitly exclude “prolonged struggles to manipulate and use items and misreaching that falls short of taking the object” from the error scores. This misses “toying,” a commonly observed behaviour of people with executive impairment following stroke or dementia (e.g., Corbett et al., 2009; Wherton & Monk, 2010), where patients repeatedly touch, move, or aimlessly pick up and then replace an object. This type of directionless and inefficient behaviour was observed when participants were performing 2-back concurrently with the NAT subtasks. Therefore, to get a more sensitive measure of the disruption to behaviour caused by 2-backing, a second error score was computed that included toying. Toying was defined as any action in which an object was picked up and put back down without being used. Actions judged as intended to clear space on the work surface or clean up after a task (e.g., closing the butter or putting the lid back on the juice bottle) were not categorised as toying. In addition, hesitations—defined as when the participant stopped both 2-backing and the primary task for more than 2 s—were coded. Table 6 presents means for total numbers of errors (i.e., the comprehensive error score plus toying and hesitations). All three subtasks show significant effects of 2-backing on this score.

Table 6 Combined comprehensive error scores plus toying and hesitations for the 4 participants under control conditions and the 8 participants under 2-back conditions

Discussion

This article presents a methodology for auditory–verbal n-back tasks with automated scoring, using software we have developed called N-backer. The empirical findings demonstrate that the N-backer program produces reliable results, even when N-backer is used as a secondary task and combined with primary tasks lasting several minutes in duration—circumstances in which manual coding is time consuming and challenging. Automated scoring was highly correlated with the results obtained by manual coding of participants’ spoken responses (r = .933).

A 2-back version of this task was used with young healthy adults carrying out the NAT, a standardised neuropsychological assessment used extensively with brain-injured patients (Schwartz et al., 2002). While the samples are very small, this task was sufficiently demanding to significantly disrupt sequential object use. The greatest disruption was seen on the more executively demanding subtasks, which involved multitasking and required that distracting objects be ignored. Low-level errors, hesitations, and toying behaviour—in which participants aimlessly picked up and put down objects—occurred more frequently under dual-task conditions.

Patients with executive or attentional deficits following brain injury also perform poorly on the same tasks within the NAT, and also show frequent hesitations and toying (Corbett et al., 2009). However, complete action additions and omissions were relatively rare in our healthy participants, and hence there were no striking effects of 2-backing on the diagnostic “NAT score.” This null result is difficult to interpret, given the very low N in this comparison, but it is clear that our procedure disrupts sequential object use without mimicking all of the symptoms of executive working memory and attention deficits observed in brain-injured patients.

The results reported above suggest that N-backer is a useful tool for exploring the role of executive control in tasks involving vision and/or action, particularly sequential object use tasks that place significant demands on executive mechanisms. We have shown that N-backer produces accurate results over long test durations for healthy young individuals, who are broadly representative of the participants that are typically recruited in n-back studies. Nevertheless, other potential uses for the software might include investigations of sequential object use in older adults (with whom our methods could be used to explore the benefits of technological and ergonomic aids designed to support the completion of complex everyday tasks) and in settings outside the laboratory (in tasks such as cooking and personal care in participants’ homes). In pursuing these research questions, it will be important to establish the degree to which N-backer can cope with suboptimal speech, recall that includes long pauses, and extraneous background noises.