Introduction

Autism spectrum disorders (ASDs) are a complex group of sporadic and familial developmental disorders affecting 1 in 150 births1 and characterized by: abnormal social interaction, impaired communication and stereotypic behaviors.2 The etiology of ASD is poorly understood, however, a genetic basis is evidenced by the greater than 70% concordance in monozygotic twins and elevated risk in siblings compared with the population.3, 4, 5 The search for genetic loci in ASD, including linkage and genome-wide association screens (GWAS), has identified a number of candidate genes and loci on almost every chromosome,6, 7, 8, 9, 10, 11 with multiple hotspots on several chromosomes (for example, CNTNAP2, NGLNX4, NRXN1, IMMP2L, DOCK4, SEMA5A, SYNGAP1, DLGAP2, SHANK2 and SHANK3),7, 12, 13, 14, 15 and copy number variations.9, 13, 16, 17, 18, 19, 20, 21 However, none of these have provided adequate specificity or accuracy to be used in ASD diagnosis. Novel approaches are required22 to examine multiple genetic variants and their additive contribution19, 23, 24 taking into account genetic differences between ethnicities and consideration of protective versus vulnerability single-nucleotide polymorphisms (SNPs).

The present study interrogated the Autism Genetics Resource Exchange (AGRE)25 SNP data with two aims: (1) to identify groups of SNPs that populate known cellular pathways that may be pathogenic or protective for ASD, and (2) to apply machine learning to identified SNPs to generate a predictive classifier for ASD diagnosis.26 The results were validated in two independent samples: the US Simons Foundation Autism Research Initiative (SFARI) and UK Wellcome Trust 1958 normal birth cohort (WTBC). This novel and strategic approach assessed the contribution of various SNPs within an additive SNP-based predictive test for ASD.

Materials and methods

The University of Melbourne Human Research Ethics Committee approved the study (Approval Numbers 0932503.1, 0932503.2).

Subjects

(i) Index sample: subject data from 2609 probands with ASD (including Autism, Asperger’s or Pervasive Developmental Disorder-not otherwise specified, but excluding RETT syndrome and Fragile X), and 4165 relatives of probands, was available from AGRE (http://www.agre.org); 1862 probands and 2587 first-degree relatives had SNP data from the Illumina 550 platform relevant to analyses (Figure 1a). Diagnosis of ASD was made by a specialist clinician and confirmed using the Autism Diagnostic Interview Revised (ADI-R27). Control training data was obtained from HapMap28 instead of relatives, as the latter may possess SNPs that predispose to ASD and skew analysis (Figures 1a and b).

Figure 1
figure 1

(a and b) Flow charts show the subjects used in the analyses. Key: AGRE, Autism Genetic Research Exchange; SFARI, Simons Foundation Autism Research Initiative; WTBC, Wellcome Trust 1958 normal birth cohort; CEU, of Central (Western and Northern) European origin; HAN, of Han Chinese origin; TSI, of Tuscan Italian origin; For panels 1a and b: ‘red boxes’—samples used in developing the predictive algorithm; ‘blue boxes’—samples used to investigate different ethnic groups; ‘green boxes’—validation sets; ‘light green boxes’—relatives assessed, including parents and unaffected siblings. Numbers in brackets represent numbers of males/females.

PowerPoint slide

(ii) Independent validation samples: 737 probands with ASD (ADI-R diagnosed) derived from SFARI; 2930 control subjects from WTBC (Figure 1b).

As SNP incidence rates vary according to ancestral heritage, HapMap data (Phase 3 NCBI build 36) was utilized to allocate individuals to their closest ethnicity. Individuals of mixed ethnicity were excluded; HapMap data has 1 403 896 SNPs available from 11 ethnicities. Any SNPs not included in the AGRE data measured on the Illumina 550 platform were discarded, resulting in 407 420 SNPs. Mitochondrial SNPs reported in AGRE, but not available in HapMap were excluded. The 30 most prevalent (>95%) SNPs within each ethnicity were identified and each ASD individual assigned to the group for which they shared the highest number of ethnically specific SNPs. HapMap groups were determined to be appropriate for analysis, as prevalence rates of the 30 SNPs relevant to each ethnicity were similar for each AGRE group assigned to that ethnicity, P<0.05.

Gene set enrichment analysis (GSEA)

Pathway analysis was selected because it depicts how groups of genes may contribute to ASD etiology (Supplementary S1) and mitigates the statistical problem of conducting a large number of multiple comparisons required in GWAS studies. The current pathway analysis differs from previous ASD analyses in three unique ways: (1) we divided the cohort into ethnically homogeneous samples with similar SNP rates; (2) both protective and contributory SNPs were accounted for in the analysis and (3) the pathway test statistic was calculated using permutation analysis. Although this is computationally expensive, benefits include taking account of rare alleles, small sample sizes and familial effects. It also relaxes the Hardy–Weinberg equilibrium assumption, that allele and genotype frequencies remain constant within a population over generations. Pathways were obtained from the Kyoto Encyclopedia of Genes and Genomes (KEGG) and SNP-to-gene data obtained from the National Center for Biotechnology Information (NCBI). Intronic and exonic SNPs were included. AGRE individuals most closely matching the genetics of Utah residents of Western and Northern European (CEU), Tuscan Italian (TSI) and Han Chinese origin were used in the analysis. CEU individuals (975 affected individuals and 165 controls) were chosen as the index sample, representing the largest group affected in AGRE (Figure 1a). The CEU and Han Chinese had 116 753 SNPs that differed, whereas the CEU and TSI had 627 SNPs, differing in allelic prevalence at P<1 × 10−5. The pathway test statistic was calculated for CEU and Han individuals using a ‘set-based test’ in the PLINK29 software package, with P=0.05, r2=0.5 and permutations set to at least 2 000 000. Significance threshold was set conservatively at P<1 × 10−5, calculated from the number of pathways being examined (200). Therefore, significance was <0.05/200, set at <1 × 10−5 (see Supplementary S1).

Predicting ASD phenotype based upon candidate SNPs

For each individual, a 775-dimensional vector was constructed, corresponding to 775 unique SNPs identified as part of the GSEA. To examine whether SNPs could predict an individual’s clinical status (ASD versus non-ASD), two-tail unpaired t-tests were used to identify which of the 775 SNPs had statistically significant differences in mean SNP value (P<0.005). This significance level provided low classification error while maintaining acceptable variance in estimation of regression coefficients for each SNP’s contribution status, and provided the set of SNPs that maximized the classifier output between the populations (Figure 2 and Supplementary S2). This resulted in 237 SNPs selected for regression analysis. Each dimension of the vector was assigned a value of 0, 1 or 3, dependent on a SNP having two copies of the dominant allele, heterozygous or two copies of the minor allele. The ‘0, 1, 3’ weighting provided greater classification accuracy over ‘0, 1, 2’. Such approaches using superadditive models have been used previously to understand genetic interactions.30 The formula for the classifier and classifier performance are presented in Supplementary S3.

Figure 2
figure 2

Cumulative coefficient estimation error and percentage classification error as a function of P-value; P=0.005 provides good trade-off between classification performance and cumulative regression coefficient error.

PowerPoint slide

The CEU sample was divided into a training set (732 ASD individuals and 123 controls) and the remainder comprised the validation set. An affected individual was given a value of 10 and an unaffected individual a value of −10, providing a sufficiently large separation to maximize the distance between means (see Supplementary S3). Least squares regression analysis of the training set determined coefficients whose sum over product by SNP value mapped SNPs to clinical status. Kolmogorov–Smirnov goodness of fit test assessed the nature of distribution of SNPs by classification. At P=0.05, the distributions were accepted as being normally distributed, allowing determination of positive and negative predictive values (see ROC, Supplementary S4). The Durbin–Watson test was used to investigate the residual errors of the training set to determine if further correlations existed. At P=0.05, the residuals were uncorrelated. Regression coefficients were used to assess individual SNP contribution to clinical status.

AGRE validation

After analyzing the CEU training cohort, three cohorts were used for validation: 285 (243 probands, 42 controls) CEUs; a genetically similar TSI sample (65 patients, 88 controls); and a genetically dissimilar Han Chinese population (33 patients, 169 controls). To illustrate overlap in SNPs in first-degree relatives of individuals with ASD (n=1512), we mapped the SNPs of parents (n=1219; 581 male) and unaffected siblings (n=293; 98 male) of CEU origin who did not meet criteria for ASD. Finally, the accuracy of the predictive model was modified to test predictive ability using 10, 30 and 60 SNPs having the greatest weightings.

Independent validation

Samples included 507 CEU and 18 TSI subjects with ASD from SFARI, and 2557 CEU and 63 TSI from WTBC (Figure 1b).

Results

Identification of affected pathways

Analyses focused on 975 CEU ASD individuals, in which 13 KEGG pathways were significantly affected (P<1 × 10−5). The pathway analysis identified 775 significant SNPs perturbed in ASD. A number of the pathways were populated by the same genes and had inter-related functions (Table 1).

Table 1 Statistically significant pathways for the CEU and Han Chinese

The most significant pathways were: calcium signaling, gap junction, long-term depression (LTD), long-term potentiation (LTP), olfactory transduction and mitogen-activated kinase-like protein signaling. GSEA on the genetically distinct Han Chinese identified six pathways that overlapped with 13 pathways in the CEU cohort (estimate of this occurring by chance, P=0.05), including: purine metabolism, calcium signaling, phosphatidylinositol signaling, gap junction, long-term potentiation and long-term depression. Related to these pathways, the statistically significant SNPs in both populations were rs3790095 within GNAO1, rs1869901 within PLCB2, rs6806529 within ADCY5 and rs9313203 in ADCY2.

Diagnostic prediction of ASD

From the 775 SNPs identified within the CEU cohort, accurate genetic classification of ASD versus non-ASD was possible using 237 SNPs determined to be highly significant (P<0.005). Figure 3a shows the distribution of ASD and non-ASD individuals based on genetic classification. An individual’s clinical status was set to ASD if their score exceeded the threshold of 3.93. This threshold corresponds to the intersection points of the two normal curves. The theoretical classification error was 8.55%, and positive (ASD) and negative predictive values (controls) were 96.72% and 94.74%, respectively. Classification accuracy for the 285 CEU AGRE validation individuals was 85.6% and 84.3% for the TSI, while accuracy for the Han Chinese population was only 56.4%. Using the same classifier with the identical set of SNPs, accuracy of prediction of ASD in the independent data sets was 71.6%; positive and negative predictive accuracies were 70.8% and 71.8%, respectively.

Figure 3
figure 3

(a) Genetic-based classification of CEU population (AGRE and Controls) for ASD and non-ASD individuals, showing Gaussian approximation of distribution of individuals. As both the mapped ASD and control populations were well approximated by normal distributions, the asymptotic Test Positive Predictive Value (PPV) and Negative Predictive Value (NPV) was determined. For individuals with CEU ancestry, the PPV and NPV were 96.72% and 94.74%, respectively. (Note the test was substantially less predictive on individuals with different ancestry, that is, Han Chinese). (b) Genetic-based classification of CEU population, including first-degree relatives (parents and siblings of ASD children). Note that the distribution of relatives of ASD children maps between the ASD and the control groups, with no difference found between mothers and fathers (see Supplementary material S5). Key: ASD, autism spectrum disorder; relatives, first-degree relatives (parents and siblings); Siblings, siblings of ASD cases not meeting criteria for ASD; Autism Classifier Score, scores for each individual derived from the predictive algorithm, with greater values representing greater risk for autism.

PowerPoint slide

SNPs were compared with the affected and unaffected individuals. Figure 3b shows that relatives (parents and unaffected siblings combined) fall between the two distributions, with a mean score of 2.68 (s.d.=2.27). The percentage overlap of the relatives and affected individuals was 30.4%. The mean scores of the mothers and fathers did not differ (at P=0.05) with scores of 2.83 (s.d.=2.17) and 2.93 (s.d.=2.34), respectively (see Supplementary S5), whereas unaffected siblings (not meeting diagnostic criteria for ASD) fell between parents and cases (mean=4.74, s.d.=3.80). In testing the robustness of the predictive model, using fewer SNPs monotonically decreased accuracy in the AGRE-CEU analyses to 72% for 60 SNPs, 58% for 30 SNPs and 53.5% for 10 SNPs, with the distribution of parents being indistinguishable from controls.

Of the 237 SNPs within our classifier, presence of some contributed to vulnerability to ASD (Table 2a), whereas others were protective (Table 2b). Eight SNPs in three genes, GRM5, GNAO1 and KCNMB4, were highly discriminatory in determining an individual’s classification as ASD or non-ASD. For KCNMB4, rs968122 highly contributed to a clinical diagnosis of ASD, whereas rs12317962 was protective; for GNAO1, SNP rs876619 contributed, whereas rs8053370 was protective; for GRM5, SNPs rs11020772 was contributory, whereas rs905646 and rs6483362 were protective.

Table 2 List of 15 most contributory (Table 2a) and 15 most protective (Table 2b) SNPs for ASD diagnosis in the CEU Cohort

Discussion

Using pathway analysis, we have generated a genetic diagnostic classifier based on a linear function of 237 SNPs that accurately distinguished ASD from controls within a CEU cohort. This same diagnostic classifier was able to correctly predict and identify ASD individuals with accuracy exceeding 85.6% and 84.3% in the unseen CEU and TSI cohorts, respectively. Our classifier was then able to predict ASD group membership in subjects derived from two independent data sets with an accuracy of 71.6%, thus greatly adding strength to our original finding. However, the classifier was sub-optimal at predicting ASD in the genetically distinct Han Chinese cohort, which may be explained by differences in allelic prevalence. Although only 627 SNPs significantly differed between the TSI and CEU cohorts, this figure increased to 116 753 SNPs between the CEU and Han Chinese. It is likely that an additional set of SNPs may be predictive of ASD diagnosis in Han Chinese and that methods used for our classifier could be applicable to other ethnicities. Interestingly, parents and siblings of ASD-CEU individuals fell as distinct groups between the ASD and controls, reinforcing a genetic basis for ASD with neurobehavioral abnormalities reported in parents of ASD individuals also supporting our findings.31 When we altered the classifier by reducing the number of SNPs, not only did the predictive accuracy suffer but also the relatives merged into the control group. This suggests that use of relatives as controls in SNP GWAS studies is only valid when examining small numbers of SNPs and may not be appropriate when assessing genetic interactions.

There was considerable overlap in the pathways implicated in both the CEU and Han Chinese populations. The analysis demonstrated that SNPs in the Wnt signaling pathway contributed to a diagnosis of ASD in the CEU cohort, but not in the Han Chinese population. Although of interest, a firm conclusion regarding these differences and similarities will require replication in a larger Han Chinese population. Completion of diagnostic classification studies for other ethnic groups will invariably aid in identification of common pathological mechanisms for ASD.

The SNPs contributing most to diagnosis in our classifier corresponded to genes for KCNMB4, GNAO1, GRM5, INPP5D and ADCY8. The three SNPs that markedly skewed an individual towards ASD were related to the genes coding for KCNMB4, GNAO1 and GRM5. Homozygosity for KCNMB4 SNP carries a higher risk of ASD than SNPs related to GNAO1 and GRM5. By contrast, a number of SNPs protected against ASD, including rs8053370 (GNAO1), rs12317962 (KCNMB4), rs6483362 and rs905646 (GRM5). KCNMB4 is a potassium channel that is important in neuronal excitability and has been implicated in epilepsy and dyskinesia.32, 33 It is highly expressed within the fusiform gyrus, as well as in superior temporal, cingulate and orbitofrontal regions (Allen Human Brain Atlas, http://human.brain-map.org/), which are areas implicated in face identification and emotion face processing deficits seen in ASD.34 GNAO1 protein is a subgroup of Ga(o), a G-protein that couples with many neurotransmitter receptors. Ga(o) knockout mice exhibit ‘autism-like’ features, including impaired social interaction, poor motor skills, anxiety and stereotypic turning behavior.35 GNAO1 has also been shown to have a role in nervous development co-localizing with GRIN1 at neuronal dendrites and synapses,36 and interacting with GAP-43 at neuronal growth cones,37 with increased levels of GAP-43 demonstrated in the white matter adjacent to the anterior cingulate cortex in brains from ASD patients.38

In our findings, GRM5 SNPs have both a contributory (rs11020772) and protective (rs905646, rs6483362) effect on ASD. GRM5 is highly expressed in hippocampus, inferior temporal gyrus, inferior frontal gyrus and putamen (Allen Human Brain Atlas), regions implicated in ASD brain MRI studies.39 GRM5 has a role in synaptic plasticity, modulation of synaptic excitation, innate immune function and microglial activation.40, 41, 42, 43 GRM5-positive allosteric modulators can reverse the negative behavioral effects of NMDA receptor antagonists, including stereotypies, sensory motor gating deficits and deficits in working, spatial and recognition memory,44 features described in ASD.45, 46 With regard to GRM5’s involvement with neuroimmune function, this receptor is expressed on microglia,40, 47 with microglial activation demonstrated by us and others in frontal cortex in ASD.48, 49

Further, as GRM5 signaling is mediated via signaling through Gene Protein Couple Receptors, a possible interaction between GNAO1 and GRM5 is plausible. Genes such as PLCB2, ADCY2, ADCY5 and ADCY8 encode for proteins involved in G-protein signaling. Given this association, GRM5 may represent a pivotal etiological target for ASD; however, further work is needed in demonstrating these potential interactions and contribution to glutamatergic dysregulation in ASD.

In conclusion, within genetically homogeneous populations, our predictive genetic classifier obtained a high level of diagnostic accuracy. This demonstrates that genetic biomarkers can correctly classify ASD from non-ASD individuals. Further, our approach of identifying groups of SNPs that populate known KEGG pathways has identified potential cellular processes that are perturbed in ASD, which are common across ethnic groups. Finally, we identified a small number of genes with various SNPs of influential weighting that strongly determined whether a subject fell within the control or ASD group. Overall these findings indicate that a SNP-based test may allow for early identification of ASD. Further studies to validate the specificity and sensitivity of this model within other ethnic groups are required. A predictive classifier as described here may provide a tool for screening at birth or during infancy to provide an index of ‘at-risk status’, including probability estimates of ASD-likelihood. Identifying clinical and brain-based developmental trajectories within such a group would provide the opportunity to investigate potential psychological, social and/or pharmacological interventions to prevent or ameliorate the disorder. A similar approach has been adopted in psychosis research, which has improved our understanding of the disorder and prognosis for affected individuals.50