Cognitive flexibility refers to the mental ability that enables us to effectively adjust to changing task and/or environmental demands (Deák, 2003; Scott, 1962) and is thought to arise from the interaction between higher-order executive functions (Dajani & Uddin, 2015). Further, cognitive flexibility is a critical cognitive function that allows an individual to switch their cognitive strategies, consider two or more aspects of an object, idea, or complex situation simultaneously, and appropriately adapt behavioural strategies (Bilgin, 2009; Dennis & Vander Wal, 2010; Diamond, 2013). There are many strategies by which researchers and clinicians assess cognitive flexibility, one of which is the use of neurocognitive tests.

The most commonly used neurocognitive test for assessing cognitive flexibility is the Wisconsin Card Sorting Test (WCST; Berg, 1948; Johnco, Wuthrich, & Rapee, 2014; Tchanturia et al., 2012). Originally popular as an assessment of frontal lobe damage (Barceló & Knight, 2002; Milner, 1963), by 2005, the WCST was the seventh most frequent test implemented by clinical neuropsychologists (Rabin, Barr, & Burton, 2005). The WCST yields upwards of seven variables. ‘Perseverative responses’ and ‘perseverative errors’ are the most commonly used to assess cognitive flexibility (Baker, Georgiou-Karistianis, et al., 2018a; Baker, Gibson, Georgiou-Karistianis, & Giummarra, 2018b; Dickson, Ciesla, & Zelic, 2017; Garcia-Willingham, Roach, Kasarskis, & Segerstrom, 2018; Gelonch, Garolera, Valls, Rosselló, & Pifarré, 2016; Wollenhaupt et al., 2019), but ‘number of categories completed’, ‘failure-to-maintain-set’, ‘trials to complete the first category’ and ‘non-perseverative errors’ have also been used (e.g. Abbate-Daga, Buzzichelli, Marzola, Amianto, & Fassino, 2014; Aloi et al., 2015; Bischoff-Grethe et al., 2013; Dickson et al., 2017; Gelonch et al., 2016; Tchanturia et al., 2012; Wollenhaupt et al., 2019; Zmigrod, Rentfrow, & Robbins, 2018). This has led to criticism of the WCST for having too many outcome variables (Figueroa & Youmans, 2013), that are not clearly linked to particular cognitive domains (Greve, Stickle, Love, Bianchini, & Stanford, 2005). Moreover, various definitions and interchanging of variables, inconsistency in how to obtain the variables, and unclear, incomplete, or misleading reporting clouds the field. Here we highlight several critical problems with the current use of the WCST and recommend solutions for implementation in research and clinical practice.

An overview of the Wisconsin Card Sorting Test

The WCST is a card matching task. For each trial, a response card is placed above four multidimensional stimulus cards (see Fig. 1 for an example of the WCST). The cards presented in the task vary on three dimensions —colour (red, blue, yellow, green), form (circles, triangles, stars, crosses), and number (one, two, three, four). For each ‘trial’, participants ‘match’ the response card to one of the four stimulus cards, without specific instruction by the administrator. The sorting rule is the dimension on which the card needs to be correctly matched, and the participant identifies the sorting rule through a process of trial and error. For example, a response card with two blue triangles can be matched according to colour (blue), form (triangle), or number (two). After each response, the participant receives feedback (i.e. ‘correct’ or ‘incorrect’) that can be used to establish the correct sorting rule. Typically, the sorting rule changes without warning after ten correct responses in a row (this is called ‘completing a category’) and the participant must ‘start again’ to establish the new sorting rule for the following category. In the standard administration of the task, the order of the sorting rule is colour, form, number, colour, form, number (Heaton, Chelune, Talley, Kay, & Curtiss, 1993). The WCST terminates when either (i) all six categories are completed, or (ii) 128 trials are completed. The WCST was first developed as a manually administered task using physical cards; however, it has since been adapted to a computerised format (Heaton & PAR Staff, 2008), and is widely used in both physical and electronic forms.

Fig. 1
figure 1

An example of the WCST display. The figure depicts a response card with two blue stars. The stimulus cards are from left to right; one red triangle, two green stars, three yellow crosses, and four blue circles. In the manual administration, the four stimulus cards would be laid on a table in front of the participant, and the participant is handed a deck of multidimensional response cards to sort

Changes in the WCST since its development

The WCST was first developed by Berg (1948) to assess flexibility in thinking, but has since seen some changes (see Eling, Derckx, & Maes, 2008, for a full historical and conceptual review). Briefly, the original WCST (Berg, 1948) required participants to make five correct responses in a row before the sorting rule changed (it is now ten), and nine categories were completed (it is now six). The original WCST consisted of 60 stimulus cards (Berg, 1948); Grant and Berg (1948) modified the task to include 64 stimulus cards, which were repeated if the participant used all 64 cards before completing the task (i.e. completing all categories). The original WCST had no limits on the number of trials taken to complete the task (Berg, 1948); there are now a maximum of 128 trials (Heaton et al., 1993); however, shorter and perhaps more practical versions of the WCST have been developed and are commonly used in clinical settings (i.e. the WCST-64 card version; Greve, 2001). Berg (1948) considered only the average number of errors made and the number of categories completed during the WCST; Grant and Berg (1948) added perseverative and non-perseverative errors as key outcome variables for the WCST and provided basic scoring information for the task.

Despite the test’s development over time, its popularity, and widespread use, little to no standardisation of the administration, scoring, and terminology existed until Heaton, Chelune, Talley, Kay, and Curtiss (1981) published a standardised manual. Even then, the administration and scoring rules remained unclear to many (Flashman, Horner, & Freides, 1991; Greve, 1993), and supplementary, more transparent scoring guidelines were developed (e.g. Axelrod, Goldman, & Woodard, 1992; Flashman et al., 1991). An updated WCST manual was published in 1993 (Heaton et al., 1993), providing clarification for how to score perseverative responses, one of the key variables used today to assess cognitive flexibility.

Inconsistent terminology and unclear definitions

There are inconsistencies in the WCST terminology used throughout the literature, and the definitions used to operationalise these terms vary between studies. These inconsistencies have caused significant confusion for both researchers and clinicians who use the WCST, and have led to further discrepancies, bespoke scoring approaches, and a lack of transparency in reporting administration and scoring. Many sources of confusion exist. For example, Heaton et al. (1993) used ‘dimension’ to refer to the characteristics of the cards by which they can be sorted (i.e. colour, form, number, or other); WCST studies have variably used attribute, characteristic, criterion, rule, category, or principle in lieu of dimension. The term ‘category’ refers to the discrete sections of the WCST (Grant & Berg, 1948; Heaton et al., 1993). There are six categories, which follow the order colour, form, number, colour, form, number. The category determines the correct sorting rule. Given that the dimensions are repeated twice, categories are sometimes referred to in terms of their numerical order (e.g. first category).

Confusion and problems in scoring the WCST

The ‘perseverated-to’ principle

The ‘perseverated-to’ principle, a key scoring principle of the WCST, remains a poorly understood concept. The perseverated-to principle was formally conceptualised by Heaton et al. (1993) after supplementary scoring guidelines for the WCST were developed by Flashman et al. (1991). The perseverated-to principle was described by Heaton et al. (1993) in conjunction with a definition of perseveration, which may have caused confusion between the perseverated-to principle and the meaning of perseverative responses. In an attempt to clarify the key scoring terms of the WCST, we have provided definitions in Table 1 that are in line with the standardised manual (Heaton et al., 1993). The perseverated-to principle is established in the first category of the WCST after the first unambiguous incorrect response. Subsequent responses that match the perseverated-to principle are scored as perseverative. However, an ambiguous response must also meet the criteria for the ‘sandwich rule’, which specifies that an ambiguous response preceded and followed by an unambiguous response must match the dimension of the unambiguous response, thus demonstrating a perseverative response pattern.Footnote 1 Following the completion of a category (i.e. when ten correct responses are made in a row), the previously correct sorting rule becomes the perseverated-to principle in the next category (Heaton et al., 1993). For example, if the correct sorting rule in the previous category was colour and the correct sorting rule for the current category is form, colour becomes the perseverated-to principle (see Fig. 2, column (a) for an example). According to Heaton et al. (1993), the perseverated-to principle can also change within a category. Following from the previous example, if the participant makes three sequential unambiguous errors according to number within the form category (during which colour is the initial perseverated-to-principle), then the dimension ‘number’ becomes the new perseverated-to principle (see Fig. 2, column (a) for an example). It is important to note that in this situation the new perseverated-to principle does not come into effect until the second unambiguous error, and thus the second unambiguous error (not the first) is the first response to be scored as perseverative.

Table 1 Key WCST scoring terminology derived from the revised WCST manual (Heaton et al., 1993)
Fig. 2
figure 2

Examples of various scoring scenarios for the WCST according to the Heaton et al. (1993) scoring manual. Note: Each main column comprises three scoring columns—Number sequentially correct, Response, and Perseverative response. Each column is divided by a line at the point of category and rule change. Column (a) provides an illustration of the perseverated-to principle after a category change and how the perseverated-to principle can change within a category after three unambiguous responses; column (b) provides an illustration of how to score perseverative responses according to the sandwich rule when there is an ambiguous response; and column (c) provides an illustration of how to score perseverative errors. The number in the correct column signifies the number of correct sequential responses; a bolded and underlined letter indicates which dimensions the participant’s response matches; C = colour; F = form; N = number; O = other (i.e. the response did not match on any of the three key dimensions); p = perseverative response; ptp = indicates the establishment of the perseverated-to-principle

Perseverative responses and perseverative errors: Two key variables to assess cognitive flexibility

Perseverative responses are persistent responses made by a participant on the basis of an incorrect (previous or novel) stimulus dimension (Heaton et al., 1993; McCallum, 2017). For example, a participant may attempt to sort a card based on its form, receive feedback that this response is incorrect, and yet continue to make the same mistake on the following card. Typically, perseverative responses are incorrect (i.e. the response does not match the correct sorting rule), and are termed perseverative errors (Grant & Berg, 1948). However, a perseverative response may also be correct because of the potential for ambiguous responses and the above-mentioned sandwich rule. On some trials of the WCST, the chosen stimulus card will match the response card on multiple dimensions (e.g. the response card ‘four green circles’ matches the stimulus card ‘four blue circles’ on both number and form). These sorts of responses are considered ambiguous because the administrator cannot be certain of which dimension the participant is using to sort the card. Typically, an inspection of the unambiguous responses before and after an ambiguous response can indicate the sorting rule that a participant is using. According to the sandwich rule, an ambiguous response that follows and is followed by an unambiguous perseverative error, and matches the perseverated-to principle, is scored as a perseverative response (see Fig. 2, column (b) for an example of this). Thus, even in cases wherein the ambiguous response matches the correct sorting rule, the response is scored as perseverative, i.e. it is a perseverative response, but not a perseverative error. To summarise: all perseverative errors are perseverative responses, but not all perseverative responses are perseverative errors (see Fig. 3). The terms have been sometimes used interchangeably (e.g. Gelonch et al., 2016; Øverås, Kapstad, Brunborg, Landrø, & Lask, 2015; Strauss, Sherman, & Spreen, 2006), illustrating the potential for erroneous results for both variables. This problem even infiltrates the revised WCST manual (Heaton et al., 1993), which uses the terms interchangeably and fails to highlight whether perseverative responses, perseverative errors, or both, should be reported.

Fig. 3
figure 3

A Venn diagram illustrating the relationship between perseverative responses and errors made on the WCST

Grant and Berg (1948) and Heaton et al. (1993) have different methods for scoring perseverative responses and perseverative errors. Grant and Berg (1948) specified that perseverative responses occurred after a rule change and were responses that matched the sorting rule of the previous category. Yet, the more contemporary definition provided by the revised WCST manual specifies that perseverative responses are those that match the perseverated-to principle (Heaton et al., 1993). Thus, a perseverative response (or perseverative error) can occur at any point in the task, including before a rule change and during the first category of the task (Heaton et al., 1993). Problematically, both scoring methods continue to be used in current research, which is likely to create difficulties in comparing results across studies.

When describing the number of categories achieved on the WCST, Strauss et al. (2006) explain that ‘scores can range from 0 for the subject who never gets the idea at all to 6’ (p. 528–529). Paradoxically, a participant who never learns the correct sorting rule (i.e. they fail to achieve ten correct responses in a row), and thus never advances from the first category, will receive a score of zero perseverative errors under the Grant and Berg (1948) scoring method. Considered to reflect high cognitive flexibility, a score of zero perseverative errors in this situation would be not only imprecise but contradictory. In these instances, the Grant and Berg (1948) scoring method provides an erroneous impression of cognitive flexibility. Typically, in instances where the participant has failed to complete the first category and the Grant and Berg (1948) method has been used, if the participant’s responses are examined in greater detail, it becomes evident that the participant has perseverated within the first failed category. For example, the participant may have sorted according to an incorrect dimension several times in a row before testing another dimension, thus demonstrating cognitive rigidity. Consequently, in these cases, the Heaton et al. (1993) scoring method may be more appropriate to capture all instances of perseverative responses. Although the number of categories completed may not be of importance when considering other WCST variables (e.g. failure-to-maintain-set), it is evident that the number of categories completed in conjunction with the scoring method used have an influence on the number or percentage of perseverative responses scored.

Perhaps in response to the problems outlined, some authors implement their own scoring method. The implementation of different scoring techniques in some papers has led to discrepancies in the total number/percentage of perseverative responses scored. For example, according to the 1993 WCST manual, if the first response after the completion of a category (i.e. the first trial in which a new sorting rule is in effect) is an unambiguous error that matches the previous sorting rule, it is marked as a perseverative error (see Fig. 2, column (c); Heaton et al., 1993). However, in Channon (1996), these responses were not scored as perseverative errors ‘[because] the sorting principle changed without warning and they had no means of knowing this in advance’ (p. 109). In both examples, the sorting rule changes without warning at the end of a category. The two scoring techniques represent differences in the approach to, and understanding of, perseverative responses. Indeed, Channon’s (1996) method of scoring would afford up to six fewer perseverative responses than Heaton et al. (1993), making studies using these different methods incomparable. There are many cases of studies implementing their own scoring techniques for the WCST or failing to specify the scoring method at all (e.g. Perpiñá, Segura, & Sanchez-Reales, 2017; Van Eylen et al., 2011). The inconsistencies in scoring methods present challenges when comparing results and generalising findings.

Using the WCST to assess cognitive flexibility

Although the WCST is widely accepted as an assessment of cognitive flexibility (Cragg & Chevalier, 2012; Dennis & Vander Wal, 2010; Figueroa & Youmans, 2013), neurocognitive tasks inevitably assess other executive functions (Buchsbaum, Greer, Chang, & Berman, 2005; Miyake, Emerson, & Friedman, 2000). Importantly, the WCST is a complex task that requires the use of multiple executive functions including attention, memory, and implicit learning (Buchsbaum et al., 2005; Cepeda, Kramer, & Gonzalez de Sather, 2001; Friederich & Herzog, 2011; Wu et al., 2014). Consequently, when the WCST is used to assess cognitive flexibility, the influence of other executive functions should be considered and accounted for where possible.

Typically, perseverative responses and/or perseverative errors are presented as indicators of cognitive flexibility (e.g. Baker, Georgiou-Karistianis, et al., 2018a; Baker, Gibson, et al., 2018b; Dickson et al., 2017; Garcia-Willingham et al., 2018; Gelonch et al., 2016; Wollenhaupt et al., 2019). However, some research has also used other variables (e.g. the number of categories completed, the number of trials taken to complete the first category, non-perseverative errors, and failure-to-maintain-set) as indicators of cognitive flexibility (e.g. Abbate-Daga et al., 2014; Aloi et al., 2015; Bischoff-Grethe et al., 2013; Dickson et al., 2017; Gelonch et al., 2016; Tchanturia et al., 2012; Wollenhaupt et al., 2019; Zmigrod et al., 2018). The extent to which some of these variables assess cognitive flexibility remains under debate; we recommend the reader refer to Figueroa and Youmans (2013) for discussion of the failure-to-maintain-set variable and its appropriateness as a measure of cognitive flexibility or distractibility. Further, despite the consensus that perseverative responses and/or perseverative errors are indicative of cognitive flexibility, there is no empirical evidence that conclusively verifies that these variables assess this construct. Rather, there is a general acceptance that a pattern of repetitive incorrect responding suggests rigidity and an inability to adapt to change. Comparing the WCST variables, particularly perseverative responses and/or perseverative errors, to other accepted neurocognitive tasks commonly used to assess cognitive flexibility (e.g. the Trail Making Test (TMT); Reitan, 1958) would be a first step in establishing the validity of these outcomes as markers of cognitive flexibility. Previous studies have identified a moderate to non-existent relationship between the cognitive flexibility outcomes of the WCST and the TMT (Chaytor, Schmitter-Edgecombe, & Burr, 2006; Herbrich, Kappel, Winter, & van Noort, 2019; Kortte, Horner, & Windham, 2002; O’donnell, Macgregor, Dabrowski, Oestreicher, & Romero, 1994; Pignatti & Bernasconi, 2013; Van Autreve, De Baene, Baeken, van Heeringen, & Vervaet, 2013). However, it is noteworthy that these studies have correlated diverse outcomes of these tests, and some of the reported variables may not be appropriate for assessing cognitive flexibility (i.e. TMT – Part B; Vall & Wade, 2015). Establishing consensus on which WCST and TMT variables to use when assessing cognitive flexibility will enable the field to compare findings across studies to elucidate the construct validity of such neurocognitive tests. In addition, leveraging advances in the study of cellular, circuit-level, and whole-brain imaging to uncover the neural correlates of cognitive flexibility is necessary to move the field forward (Lie, Specht, Marshall, & Fink, 2006; Specht, Lie, Shah, & Fink, 2009; Yuan & Raz, 2014).

Recommendations and conclusions

Establishing consensus on how to best define the WCST key terms and variables should be made a priority to promote the standardisation of assessments and comparability of results across studies. We recommended that perseverative responses and perseverative errors be defined and reported separately. Given that the Heaton et al. (1993) method of scoring is perhaps better able to assess perseverative responses in the first category of the WCST, we recommend that the Heaton et al. (1993) method be used to score the WCST. However, we acknowledge the complexity of this scoring system and recognise that training may be required before administrators are confident in scoring and interpreting any WCST data using the Heaton et al. (1993) method. Automated scoring that is facilitated by a computer program may be the best choice for a novice administrator. The use of a computerised version of the WCST (ideally the Heaton and PAR Staff (2008) program) reduces the opportunity for human error and misinterpretation of the scoring instructions described by Heaton et al. (1993). A non-commercial open-source version could be utilised in research and in clinical practice. In this instance, we recommend that the scoring code conform to the scoring methods outlined in the Heaton et al. (1993) manual. In instances where other computerised versions are implemented, manually inspecting the raw data to ensure that there are no atypical summary scores is an appropriate precaution to take.

To reduce the complexity of scoring the WCST, alternate versions of the test have been developed with all ambiguous cards removed (e.g. the Modified Card Sorting Test; Nelson, 1976). However, more research is needed to examine the extent to which the scores from the original WCST (e.g. perseverative errors and perseverative responses) relate to the scores of the modified versions. The reduction in the total number of cards presented due to the removal of ambiguous cards creates fewer opportunities for a participant to initiate cognitive flexibility. Relatedly, the maximum number of possible perseverative responses in modified versions of the WCST is lower than in the original WCST. Hence, raw scores on the modified WCST may not be directly comparable with raw scores on the original WCST, potentially leading to further confusion in the literature, concerns surrounding validity, and problems with study comparability. We recommend that future research investigate the validity of the different outcomes of the modified WCST to provide clarification on whether removing ambiguous cards from the WCST is an appropriate step to improve scoring simplicity.

In conclusion, the WCST is undoubtedly a popular neurocognitive task that has been widely used by both researchers and clinicians since its inception. Despite the wide implementation of the WCST within the field of neuropsychology, this task is not without its limitations. The inconsistent scoring of the WCST is a major source of confusion for users and creates challenges in interpreting and comparing findings. There has also been a lack of consensus in the key WCST terminology and its corresponding definitions. Further, the outcome variables for assessing cognitive flexibility vary among studies. We recommend that users of the WCST follow the recommendations described above and presented in Table 2. Specifically, it is fundamental that authors who are considering using the WCST are transparent about the format of the task, cite the chosen scoring method, and report both perseverative responses and perseverative errors when using this task as an index to assess cognitive flexibility so as to better capture this latent construct.

Table 2 A checklist of recommendations for using and reporting the WCST