Introduction
The Autism Diagnostic Observation Schedule (ADOS; Lord et al.
2000) is a semi-structured, standardized observational instrument to assess the social and communicative abilities of individuals with possible autism spectrum disorder (ASD). The ADOS consists of four modules intended for use in children, adolescents, and adults with different developmental and language levels. Items are scored from 0 (not abnormal) to 2 or 3 (most abnormal), and a diagnosis of autism or ASD is established if the individual assessed has scores higher than the established cut-off values in the Communication domain, the Social domain, and a sum of the two. The current ADOS diagnostic algorithm does not include items related to repetitive behaviors and restricted interests, although these behaviors are coded if they occur (Lord et al.
2001). Initially, this choice was based on limited available time to notice these kinds of behaviors in the context of the assessment. In research and clinical practice, the ADOS used in combination with the Autism Diagnostic Interview-Revised (ADI-R; Lord et al.
1994) is considered the “gold standard” for diagnosing autism. The ADI-R is a comprehensive, standardized, and semi-structured parent interview that includes items covering all three major domains of dysfunction in autism, namely, quality of reciprocal social interaction, communication, and repetitive, restricted, and stereotyped patterns of behavior.
In 2007, Gotham and colleagues proposed revised algorithms for Modules 1 through 3 to improve the sensitivity and specificity of the ADOS. In the revised algorithms, the original ADOS domains and cut-off values for Social and Communication items have been collapsed into a single factor consisting of 10 items that describes social and communication domain items: the Social Affect factor (SA). In addition, 4 items from a second factor, restricted, totals repetitive behavior (RRB), have been included because RRB domain items may contribute to the diagnosis of autism or ASD, even in the limited context of the ADOS (Lord et al.
2006). There are two diagnostic cut-off scores for the combined SA&RRB domain total, one for autism and one for ASD. In order to reduce ceiling effects in communication items, the revised algorithms distinguish between “Some words” and “No words” in Module 1, on the basis of the item A1 score (overall level of non-echoed language). To reduce the difference between younger more rapidly developing children and older children, the revised algorithms distinguish between age younger and older than 5 years in Module 2.
The new algorithms have important advantages over the original algorithms: they are more homogeneous, which makes comparisons between and within cases across all three modules easier; the effects of age and IQ are probably reduced; and they make direct comparison with ADI-R scores possible. In addition, on the basis of these revised ADOS algorithms, Gotham et al. (
2009) recently proposed a calibrated autism severity metric that could prove very useful for identifying trajectories of autism severity in clinical, genetic, and neurobiological research.
To the best of our knowledge, four studies have replicated the revised ADOS algorithms. In the first, small study (
N = 26), Overton et al. (
2008) reported inconsistent results, finding slightly more accurate results for Module 1, although revised algorithm scores for Modules 2 and 3 were similar to those for the original algorithms. In the second study, Gotham et al. (
2008) replicated their own study in a large and independent sample (
N = 1282) and presented sensitivity and specificity data that were similar or better than those for the original algorithms, with the exception of scores for young children with phrase speech and a clinical diagnosis of pervasive developmental disorder not otherwise specified (PDD-NOS). In the third replication study (
N = 195), Gray et al. (
2008) found improved sensitivity with slightly reduced specificity for Module 1/No Words, and improved sensitivity and specificity for Module 1/Some Words. Additional indices of diagnostic accuracy were taken into consideration; however, it was difficult to compare findings with those of the other studies due to dissimilarities in definition of diagnostic groups. In the fourth study, De Bildt et al. (
2009) evaluated the revised algorithms in a Dutch, low-functioning sample (
N = 558). The balance between sensitivity and specificity for Module 2 and 3 was better with the revised algorithms, without there being changes in efficiency of classification. The sensitivity and specificity of Module 1 showed a more modest improvement, possibly because of the low-functioning sample.
Although these replication studies generally support the use of the revised algorithms, additional research is warranted because the revised algorithms could not be applied to some developmental cells due to limited data (De Bildt et al.
2009; Gotham et al.
2008; Gray et al.
2008). The sample studied by Overton et al. (
2008) was too small in general to draw firm conclusions from the results. The sample in the current study did allow for filling up some gaps in the replication literature so far, especially for Module 1/Some Words and both Module 2 cells. In addition, as age and diagnostic groups represented in samples will influence outcome measures at all times, it is of interest to replicate Gotham et al. (
2007) findings in an independent sample with a slightly different make-up.
In the current study we sought to augment the findings of earlier studies in a large independent Dutch sample (N = 532). The main aim was to examine whether the revised algorithms improved the diagnostic validity of the ADOS. We administered Modules 1 and 2 but not Module 3, because of the limited data available for this module. Additional aims were to study the effects of age and IQ on the revised algorithms, evaluated against the original algorithms, and to verify the factor structure of the revised algorithms.
Discussion
This study replicates previous findings on the diagnostic validity of the revised algorithms for ADOS Modules 1 and 2, proposed by Gotham et al. (
2007), in a relatively large independent Dutch sample. Overall, the children in our sample were younger and higher functioning than those included in previous replication studies and with a different distribution of clinical diagnoses (autism, non-autism ASD, and non-spectrum disorders). For some specific cells, our study fills up some gaps in the literature, namely for the Modules 1/Some Words cell and for the Module 2/Younger as well as the Module 2/Older than 5 years cell.
Regarding our main aim, to investigate whether the revised algorithms improved the diagnostic validity of the ADOS compared to the original algorithms, we found: (a) increased predictive validity for
autism cases with the revised algorithms (combined Social Affect and Restricted, Repetitive Behavior domains—SA&RRB) except for participants who were administered Module 1/No words and had a mental age of > 15 months (cell A-II), (b) less consistent improvement in predictive validity for
non-autism ASD cases, and (c) increased predictive validity for
ASD cases as a whole, especially when the revised algorithms based on the SA domain total were used. Previous studies generally indicated that the sensitivity, specificity, and/or efficiency of the revised algorithms were the same or better than those of the original algorithms (De Bildt et al.
2009; Gotham et al.
2007,
2008; Gray et al.
2008; Overton et al.
2008). Our findings support the use of the more homogeneous revised algorithms, with similar items used across developmental cells to allow for easier comparison of ADOS scores within and between individuals.
In previous research, the only exception to the overall finding of improved predictive validity of the revised algorithms was the performance of the ADOS revised algorithm for non-autism ASD in young children administered Module 2 (cell C). Gotham et al. (
2008) reported a marked decrease in sensitivity for non-autism ASD with the revised algorithm based on combined SA&RRB domain total compared with the original algorithm in this cell (0.65 vs. 0.88). In cell C (with 17 non-autism ASD and 18 non-spectrum participants), they reported a high mean score (4) for the ADI-R restricted, repetitive and stereotyped behavior domain but a low ADOS mean score for this domain (RRB score of 1.3), which suggests that the ADOS may miss this type of behavior. We found a similar pattern in cell C, which consisted of 41 individuals with non-autism ASD and 30 individuals with non-spectrum disorders. However, compared to the original algorithms, we did not observe the decrease in sensitivity in our substantially larger sample. This suggests that the revised algorithm also has improved diagnostic validity in children with non-autism ASD who have phrase speech and are younger than 5 years.
The sensitivity of the revised algorithms was much lower in the different developmental cells than that reported in previous studies, whereas the specificity was comparable or even better (De Bildt et al.
2009; Gotham et al.
2007,
2008; Gray et al.
2008). As most cases, including the non-spectrum cases, had been referred for suspected ASD, we would have expected the opposite, namely, a high sensitivity with a lower specificity. The high specificity of the ADOS found in our study supports the usability of the ADOS for research purposes in order to select narrowly defined groups of participants. However, the low sensitivity is a problem, because it is important not to miss cases. There are a number of possible explanations for this discrepancy in sensitivity with earlier studies:
(a)
In some cells, the sample size per diagnostic group was limited, which may have influenced the results. Gotham et al. (
2007,
2008) excluded cells with fewer than 15 participants per diagnostic group, whereas we included some diagnostic groups with only 10 cases (Module 1/No Words, mental age > 15 months) or 14 cases (Module 2, > 5 years) per cell.
(b)
Flaws in ADOS coding could distort results. However, this is improbable because our teams were well trained, and five members are recognized as ADOS trainers by leading institutions in the UK or USA. The trainers supervised less experienced colleagues in order to ensure reliability.
(c)
Sample variation may play a role. Most children in this study were clinically referred within the context of an extensive early screening project (see Oosterling et al.
2009). The fact that our sample was younger, higher functioning, and included fewer core cases of autism than the samples in other studies may have given rise to misclassification. Higher functioning children often show milder ASD symptoms or they may “cover up” their weaknesses in a semi-structured one-to-one context. As the disabilities of higher-functioning ASD children are often more evident in complex situations of daily life, we would expect the ADI-R to be more sensitive, which proved to be the case, although only modestly so, for participants who were administered Module 2.
(d)
Another point concerns how diagnoses were established. As the agreement between ADOS and ADI-R classifications is poor, clinicians must have often relied on only one of the instruments in assigning a DSM-IV diagnosis (APA
2000) in addition to observations from other assessments. As the clinicians who made the final diagnoses were highly experienced and the independent variable ‘clinician responsible for final diagnosis’ was not a predictor of the ADOS false negative cases, there is little reason to question the quality of clinical diagnoses made in this study. However, it is a matter of speculation to what extent differences in ADOS outcomes between the sites included in this study and those in the original study performed in the USA are related to differences in assigning clinical diagnoses (clinician based vs. DSM-IV based). The definition of non-autism ASD might be broader in Europe than in the USA, or the distinction between autism and non-autism ASD might be slightly different in clinical practice. For this reason, we added the ASD versus non-spectrum disorders comparison in the analyses, which improved diagnostic validity relative to the more strict comparison (autism vs. non-spectrum disorders and non-autism ASD vs. non-spectrum disorders), especially when the revised algorithms based on SA alone were used.
(e)
Finally, the choice of module may have influenced results. Guidelines for module choice are clearly described (Lord et al.
2001) but in clinical practice the choice of whether to administer Module 1 or Module 2 is sometimes difficult and arbitrary. In this regard, Klein-Tasman et al. (
2007) found that administration of an ‘easier’ module instead of an appropriate module resulted in under classification of autism in participants administered Module 1/2, and in under classification of ASD in participants administered Module 2/3. Therefore, we sought to verify whether the choice of module could have been responsible for the low sensitivity in our study. Despite careful consideration by the diagnostic team of which module to administer, the distribution of item A1 (Level of Non-echoed Language), with a disproportionately large number of 0 scores, suggests that the choice of module resulted in under classification of at least some cases. This notion is supported by the observation that verbal IQ was higher in incorrectly classified cases than in correctly classified cases.
To recapitulate, while a combination of factors could have led to the relatively low sensitivity in our study, which is a problem, this does not detract from our overall finding that the diagnostic validity of the revised algorithms is better than that of the original algorithms.
Interestingly, the revised algorithms based on a single factor (SA) were better than revised algorithms based on two factors (combined SA&RRB) for identifying cases of non-autism ASD in the youngest/lowest functioning group (cell A-II) and in the oldest/highest functioning group (cell D). Although these findings are not in line with Gotham et al. (
2007) and are based on a limited amount of data, at least for cell A-II, they might indicate that restricted and repetitive behavior is of less diagnostic value in very young or low-functioning children (cell A-II) and/or older or higher functioning children at the milder ends of the spectrum (cell D) than in other children.
One of the aims of the revising the ADOS algorithms was to minimize the effect of verbal and nonverbal IQ on ADOS totals (Gotham et al.
2007). However, we found a diminished effect of IQ (verbal and nonverbal) on revised algorithm totals in only a few cells and only with regard to the SA domain total. This suggests that RRB items are (partly) responsible for the influence of verbal/nonverbal IQ on the combined SA&RRB domain total. This is contrary to previous research (De Bildt et al.
2009; Gotham et al.
2007,
2008), which generally found verbal and nonverbal IQ not to affect the ADOS total calculated with the revised algorithms. The fact that our sample was higher functioning than the samples in previous studies could play a role in this difference.
We found that age did not affect the revised algorithm (SA&RRB) domain total in cell C and had only a minor effect in the other cells; the magnitude of this effect was not smaller with the revised algorithm than with the original algorithm, consistent with the findings of other studies (Gotham et al.
2007,
2008; De Bildt et al.
2009).
We found the two-factor model (SA&RRB) of the revised algorithms to fit the data better than the one-factor model (SA), with a slightly unsatisfactory fit for cell D only. In both independent samples of Gotham et al. (
2007,
2008), the two-factor model provided a better fit than the one-factor model in
all developmental cells. While variation in factor structure is to be expected across samples, the slightly unsatisfactory fit of the model in cell D may be due to the difference in the distribution of diagnoses across samples, which was most apparent in this cell. In our sample, cell D included relatively few autism cases, while autism was overrepresented in cell D in the two studies by Gotham et al. In addition to the two-factor model, Gotham et al. (
2007) found a three-factor model, with ‘joint attention’ as the third factor, to fit the data better for both cells A and B, but they decided to use the two-factor model due to its greater consistency across developmental cells. In the current study, the three-factor model was not better in any developmental cell, supporting Gotham et al.’s decision to use the two-factor model.
Limitations
There are some limitations to this study. We did not include any participants assessed with Module 3, which limited the comparison with previous studies. However, we did fill up gaps in the literature on other modules. Moreover, the small sample size of some cells (A-II and D) limited the interpretation of some results and prevented analysis of algorithm performance in cell A-I. However, in this regard, a new Toddler Module of the ADOS with novel tasks for infants and toddlers has been reported by Luyster et al. (
2009) to meet the diagnostic conditions applicable to very young and/or severely delayed children. A final limitation is that clinical diagnoses were made on the basis of the old algorithms, which may have led to biased outcome measures for the original algorithms. Given this bias, one would have expected the diagnostic agreement to be better with the original algorithms than with the revised algorithms, which was not the case, thereby emphasizing the improved diagnostic validity of the revised algorithms.
Clinical Implications
In general, the current replication study of ADOS Modules 1 and 2 in a large, independent and well-defined population shows that the revised ADOS algorithms (Gotham et al.
2007) improve diagnostic validity compared with the original algorithms. The improvement in predictive validity was most apparent for
autism; the sensitivity and specificity of the revised algorithms for
non-autism ASDs were only marginally better than the original algorithms. The low sensitivity of the revised algorithms may be problematic for some developmental cells or diagnostic subgroups since this reflects increasing discrepancy between the ADOS algorithm and clinical diagnosis. Although the source of the lower sensitivity merits further study, it does once again emphasis that a diagnosis of ASD requires a multidisciplinary approach that includes a variety of assessment measures.
We confirmed the factor structure proposed by Gotham et al. (
2007), and the revised algorithms were minimally influenced by age; however, they were not entirely independent of (non)verbal IQ. Overall, our study continues to support the use of the more homogeneous new ADOS algorithms for clinical and research purposes.
Acknowledgements
We gratefully acknowledge all children and parents who participated in our research project and all clinicians who contributed to the data gathering, especially Nicky de Waal, Tim Woudenberg, Sammy Roording-Ragetlie and Kina Potze. This study was supported by a grant from the Korczak Foundation.