Background

JMI

JMIR Med Inform

JMIR Medical Informatics

2291-9694

JMIR Publications

Toronto, Canada

v9i4e24754

33714937

10.2196/24754

Original Paper

Retracted: Diagnostic Classification and Prognostic Prediction Using Common Genetic Variants in Autism Spectrum Disorder: Genotype-Based Deep Learning

Eysenbach

Gunther

Pang

Shanchen

Fang

Manzanares

Maria

Wang

Haishuai

PhD 1 2

https://orcid.org/0000-0003-1617-0920

Avillach

Paul

MD, PhD 1

Department of Biomedical Informatics Harvard Medical School

10 Shattuck Street

Boston, MA, 02115

United States 1 617 432 2144 Paul_Avillach@hms.harvard.edu

https://orcid.org/0000-0002-0235-7543

1 Department of Biomedical Informatics Harvard Medical School

Boston, MA

United States 2 Department of Computer Science and Engineering Fairfield University

Fairfield, CT

United States

Corresponding Author: Paul Avillach Paul_Avillach@hms.harvard.edu

4 2021

7 4 2021

9 4

e24754

3 10 2020 23 11 2020 18 2 2021 14 3 2021

©Haishuai Wang, Paul Avillach. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 07.04.2021.

2021

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.

https://medinform.jmir.org/2025/1/e76833

https://medinform.jmir.org/2025/1/e66556

Background

In the United States, about 3 million people have autism spectrum disorder (ASD), and around 1 out of 59 children are diagnosed with ASD. People with ASD have characteristic social communication deficits and repetitive behaviors. The causes of this disorder remain unknown; however, in up to 25% of cases, a genetic cause can be identified. Detecting ASD as early as possible is desirable because early detection of ASD enables timely interventions in children with ASD. Identification of ASD based on objective pathogenic mutation screening is the major first step toward early intervention and effective treatment of affected children.

Objective

Recent investigation interrogated genomics data for detecting and treating autism disorders, in addition to the conventional clinical interview as a diagnostic test. Since deep neural networks perform better than shallow machine learning models on complex and high-dimensional data, in this study, we sought to apply deep learning to genetic data obtained across thousands of simplex families at risk for ASD to identify contributory mutations and to create an advanced diagnostic classifier for autism screening.

Methods

After preprocessing the genomics data from the Simons Simplex Collection, we extracted top ranking common variants that may be protective or pathogenic for autism based on a chi-square test. A convolutional neural network–based diagnostic classifier was then designed using the identified significant common variants to predict autism. The performance was then compared with shallow machine learning–based classifiers and randomly selected common variants.

Results

The selected contributory common variants were significantly enriched in chromosome X while chromosome Y was also discriminatory in determining the identification of autistic individuals from nonautistic individuals. The ARSD, MAGEB16, and MXRA5 genes had the largest effect in the contributory variants. Thus, screening algorithms were adapted to include these common variants. The deep learning model yielded an area under the receiver operating characteristic curve of 0.955 and an accuracy of 88% for identifying autistic individuals from nonautistic individuals. Our classifier demonstrated a considerable improvement of ~13% in terms of classification accuracy compared to standard autism screening tools.

Conclusions

Common variants are informative for autism identification. Our findings also suggest that the deep learning process is a reliable method for distinguishing the diseased group from the control group based on the common variants of autism.

deep learning autism spectrum disorder common genetic variants, diagnostic classification

Introduction

Autism spectrum disorder (ASD) is a common neurodevelopmental disorder that begins early in childhood and lasts throughout a person's life. In the United States, around 1 out of 59 children have been diagnosed with ASD. People with ASD have characteristic social communication deficits and repetitive behaviors. Early detection of ASD enables timely interventions for children with ASD. Such interventions could provide the best opportunity to improve outcomes as opposed to treatments started after diagnosis. The epigenetic landscape has revealed that ASD may result from a complex regulatory network, including epigenetic, genetic, and environmental factors [1]. Although the causes of ASD remain unknown, recent studies have found that ASDs are 80% reliant on the inherited genes [1-3]. Twin studies of ASD show heritability as a highly responsible factor causing the disorder [4,5]. Therefore, identifying genomic mutations for autism based upon genotype information for early diagnosis of autism is significantly important. The genetic landscape of ASD is heterogeneous and consists of various types of genetic abnormalities involving almost all genes (eg, SHANK3, SHANK2, CHD8, SEMA5A, DOCK4) with different levels of penetrance [6-8]. Thus, autism studies have been conducted with different types of genetic variants [9-14], including de novo or inherited copy number variants, multiple hits, rare variants, common variants, and genetic pathways associated with ASD.

Rare variants, both inherited and de novo, are causal in 10%-30% of people with ASDs [15-17]. Although risk-associated genes of autism have been identified from rare variations, recent studies have shown that most genetic risks for ASD reside with common variations [18]. A Population-Based Autism Genetics and Environment Study on a Swedish epidemiological sample shows synthesis of results regarding the genetic architecture of ASD and concludes that inherited rare variations constitute a smaller fraction of the total heritability than common variations [18]. Several genome-wide association studies have also examined that 15%-40% of the genetic risk associated with ASD diagnosis is tagged by common variants [19-21]. Therefore, common variants may be informative with respect to the identification of ASD. Numerous studies have since used genetic information to predict the diagnosis of ASD. A single nucleotide polymorphism–based test has been demonstrated to allow for early identification of ASD [22]. In this study, they applied machine learning to identify single nucleotide polymorphisms to generate a predictive classifier for ASD diagnosis and have proved and concluded that the predictive classifier can be a tool to estimate the probability of at-risk status for ASD. To enable earlier and more accurate diagnoses of ASD, a statistical model has been developed for autism to analyze measurements of metabolite concentrations and it indicated that the metabolites under consideration are highly associated with an autism diagnosis [23]. A gene expression–based study has demonstrated that the accuracy of distinguishing ASD subgroups from nonautistic controls by using a support vector machine can be up to 94% [24]. Combining a brain-specific gene network with a complementary machine learning approach has also been conducted to present a genome-wide prediction of autism risk genes [25]. However, none of the existing works provide adequate accuracy or specificity that can be used for autism diagnosis with common variations. Recently, deep neural networks have achieved record-breaking performance in a variety of real-world applications [26-29]. In this study, we adapt deep learning to the task of predicting ASD and propose a deep learning–based framework, named DeepAutism, to predict autism disorder phenotypes by using common variants.

This study first identified significant common variants that may be protective or pathogenic for ASD as well as their additive contribution to ASD; therefore, deep learning models are applicable using common variants. Then, this study applied deep learning prediction algorithms to verify the identified common variants and generate a predictive classifier for ASD diagnosis. The results were tested on a hold-out test data set from the Simons Simplex Collection (SSC), and the proposed strategic approach achieved the best performance in distinguishing the diseased group from the control group based on selected significant common variants of ASD.

The objectives of this study were to (1) discover significant common variants that may be protective or pathogenic for ASD, (2) create an advanced diagnostic classifier for autism screening based on the identified common variants, and (3) verify the developed classifier and significant common variants across thousands of simplex families.

Methods Data Set

We used an autism data set from the SSC [30]. The SSC data consist of 2600 simplex families, each of which has 1 child affected with ASD (a proband), unaffected parents, and at least one unaffected sibling. The data consist of 3931 individuals whose exome sequences are available (Figure 1A), and 2249 samples of these individuals are labeled as diseased group (ASD). From the SSC data set, we can query the specific variables for exome variants (Figure 1A), and the variants are in the variant call format (VCF). There are more than 1.5 million variants in the data set, which has the genotype information along with read depth, allele depth, and genotype quality. In the VCF data, VCF_GT represents the genotype quality, encoded as allele values separated by “/,” such as “0/1” and “2/3”, where 0 represents the reference allele, 1 for the first allele listed in the alternate allele, 2 for the second allele listed in the alternate allele, and so on. Thus, VCF_GT can be “0/0”, “0/1”, “2/0”, “1/2”, and so on. The read depth is denoted as VCF_DP, and the conditional genotype quality is denoted as VCF_CQ. We mainly used the information of VCF_GT, VCF_DP, and VCF_CQ in the SSC data for this study. The Harvard Medical School Research Ethics Committee approved this study.

Figure 1

Overall framework for deciphering contributory common variants and predicting autism spectrum disorder diagnosis. A. Data preprocessing. VCF_GT recoding is to encode VCF_GT values as dummy variables. If both alleles are reference alleles, it is encoded as 0; if both alleles are alternate alleles, it is encoded as 2; otherwise, it is 1. B. Data split and significant variant selection. The data set was split into training set and test set. Variants were ranked based on their chi-score and P value, and only top ranked (high chi-score value and low P value) variants were selected as contributory common variants for autism spectrum disorder. C. Convolutional neural network classifier. The selected significant common variants in the training data were fed into a convolutional neural network to train a classifier. Thereafter, the trained model was applied on the test data for autism spectrum disorder diagnosis prediction. ASD: autism spectrum disorder; CNN: convolutional neural network; SSC: Simons Simplex Collection; VCF: variant call format; VCF_CQ: variant call format-conditional genotype quality; VCF_DP: variant call format-read depth; VCF_GT: variant call format-genotype quality.

Data Preprocessing and Genotype Quality Filters

For all the variants, we have their unphased genotype information using the format of VCF_GT. To make the data processable for deep learning models, we encoded the VCF_GT data by creating categorical values to represent different types of genotype [27]. Specifically, 0 denotes that both allele values are reference alleles, 1 represents one allele value is a reference allele and the other one is the alternate allele, and 2 represents both are alternate alleles. For example, 0/0 is made as 0, 1/0 is made as 1, 1/1 is made as 2, and so on. Therefore, the variants consist of 3 categories: 0, 1, and 2. We used VCF_DP (read depth at a position for a sample) and VCF_CQ (conditional genotype quality) as a filter to control the genotype information quality (Figure 1A). We extracted the genotype information for each variant that has a read depth no less than 10 and genotype quality no less than 90 [31]. Therefore, the genotypes of read depth less than 10 (VCF_DP < 10) or genotype quality less than 90 (VCF_CQ < 90) were excluded. Since we only explored common variants in our study, we removed all the variants with occurrence frequency less than 1% over the whole data set, resulting in 153,347 variants selected as common variants after the genotype quality filters (Figure 1A). We used these common variants for our study. After selecting the common variants, the SSC samples were partitioned into 2 sets based on random sampling of individuals into a training set (80%) and a hold-out test set (20%). There was no overlap of individuals across the 2 partitions. The test set was only used after model fitting to assess performance.

Identifying Contributory Common Genetic Variants

As the number of variants was too large to apply deep learning models directly, to construct the features for the deep learning models, we used feature selection to reduce variant dimension (Figure 1B). Feature selection is one of the core concepts in machine learning that hugely impacts the performance of a model [32-35]. The data features that are used to train machine learning models have a huge influence on the performance that we can achieve. Therefore, our hypothesis is that not all variables contribute to the predictive performance of the models we built. Variant selection is the process wherein we automatically select those features that contribute most to our prediction accuracy and are considered as contributory variants to ASD diagnosis. Therefore, significant common variation selection was applied because variant selection is the process of removing redundant or irrelevant features from the original data set to reduce overfitting. To this end, we analyzed the importance scores of the common variants that are related with ASD development mechanisms in the training data set. For each individual, a 153,347-dimensional vector was constructed, corresponding to 153,347 common variants identified from the data preprocessing. Chi-square test was applied to evaluate the importance of each variant to distinguish the class in order to select the most significant common variants. Given the training data D, we estimated the following quantity for each variant and ranked them by their scores:

where N is the observed frequency in D and E is the expected frequency, e_t takes the value 1 if the training data contains term t and 0; otherwise, e_c takes the value 1 if the training data is in class c and 0 otherwise. For each variant, a corresponding high score indicates that the null hypothesis H₀ of independence (meaning the individual’s category has no influence over the term’s frequency) should be rejected and the occurrence of the variant and class are dependent. In this case, we select the variant for the ASD diagnosis prediction. We used the implementation from scikit-learn [36] for “Chi-Square Feature Selection” with default settings.

By calculating the chi-square scores for all the variants, we can rank the variants by the chi-square scores and then choose the top ranked variants as significant variants for model training. Figure 2A lists the variant importance of the high scoring 20 features (variants) that are selected via the chi-square test. In Figure 2A, while the Y-axis corresponds to variant IDs of the variants, the X-axis corresponds to relative importance, which is calculated using the chi-square score. We selected the top 100 most significant variants as inputs to train a deep learning classifier. Therefore, the number of input contributory variants to our classifier after selection was 100 for ASD prediction. In order to analyze whether the variation data can be divided into 2 clusters representing control and ASD cases, the first 2 groups of data were obtained using t-distributed stochastic neighbor embedding (t-SNE) as an unsupervised learning approach. The visualization of clusters for the top 100 variants using t-SNE in both case group and control group is shown in Figure 2B. From the visualization, accurate genetic classification of control group versus ASD is possible using 100 common variants determined to be highly significant. Therefore, for each individual in the training set, a 100-dimensional input vector was constructed corresponding to 100 selected significant common variants for training a deep learning model.

Figure 2

A. Variants with high relative importance scores in chi-square test. The Y-axis corresponds to variant IDs of these variants, and the X-axis corresponds to the relative importance values of the corresponding variants. B. Visualization of the top 100 selected significantly common variants using t-distributed stochastic neighbor embedding. Different colors represent different classes (ie, case and control). This visualization indicates that the 2 groups are differentiable using the selected top common variants. t-SNE: t-distributed stochastic neighbor embedding.

DeepAutism Architecture

The overall framework of the proposed DeepAutism (Figure 1) consists of 3 components, namely, data preprocessing, variant selection, and neural network classifier. Figure 1C illustrates the convolutional neural network (CNN) architecture. We used Keras and TensorFlow version 2.0 for constructing and training the CNN model. We used a block of two 1D convolutional layers, followed by a max-pooling layer to generate feature maps that contain only the most important features. The max-pooling layer is followed by a dropout layer to avoid overfitting the data. Then, the learned feature maps are combined using a fully connected layer. The final layer contains a sigmoid function to produce probabilities of output from 0 to 1, with the diseased group belonging to class 1 and the control set belonging to class 0. All the parameters, including the weights and biases of hidden layers, are learned through backpropagation [37]. The detailed network topology used in our CNN architecture is shown in Multimedia Appendix 1.

DeepAutism Training and Evaluation

For training, DeepAutism uses a set of selected common variants (top 100 significant common variants) to estimate the probabilities of an individual belonging to control case or autism. For a set of variants v from a testing individual, DeepAutism computes a probability p(v) using 4 states:

p(v) = Sigmoid(netW(pool(ReLUb(convf(v))))

The sigmoid function is used for computing probabilities of a set of variants v belonging to either control group or autism group, and the produced probabilities are from 0 to 1, with the control set belonging to class 0 and the ASD group belonging to class 1. The convolution stage (convf) scans a set of filters as feature maps across the variants. Each neuron consists of a rectified linear unit (ReLU) activation function to introduce nonlinearity between 2 neural networks. The pooling layer only picks the maximum values from the convolved feature maps. Since the variant data are categorical values, one-hot encoding is employed to ensure that the DeepAutism model is unbiased and does not favor one genotype over the others. The DeepAutism is then trained using mini-batch gradient descent by backpropagation algorithm [37]. The performance was evaluated using the area under the receiver operating characteristic curve (AUC). We also used the most common procedure for evaluating classifiers for ASD prediction, including accuracy, sensitivity (recall), specificity (precision), F1-score, and false discovery rate, in which the lower value indicates better performance to evaluate the classifiers for ASD diagnosis.

Baseline Methods to Compare the Effectiveness of DeepAutism

Apart from CNNs, we also employed conventional machine learning techniques to evaluate the effectiveness of DeepAutism for classifying autism diagnosis. The conventional machine learning models that we compared were random forests, logistic regression, and Naive Bayes. We used the same training and test data (with the selected 100 common variants) for the conventional machine learning models as used for the DeepAutism model, aiming to evaluate whether the CNN model outperforms other machine learning classifiers. To evaluate whether the selected top 100 common variants are significant for ASD diagnosis, we also compared the chi-square–based variant selection method with random variant selection by using the same training and test data sets. We randomly selected 100 common variants as inputs that were fed into both DeepAutism and conventional machine learning models to compare the changes in their performance.

Results Identification of Contributory Variants and Genes

Statistical analyses focused on the selected top 100 common variants, which most significantly contributed to the classifiers of ASD. Of the 100 common variants within our classifier, 66% are exonic mutations and 23% are intronic mutations, while small proportions are splicing mutations or from an untranslated region. Within the 66% exonic mutations, about half are synonymous single nucleotide variants and about half are nonsynonymous single nucleotide variants. It is important to point out that the selected contributory common variants were significantly enriched on chromosome X while chromosome Y is also discriminatory in identifying individuals with ASD from individuals without ASD.

A number of variants were populated by the same genes. Related to the contributory common variants, the statistically significant genes were ARSD, MAGEB16, and MXRA5. There are 18 common variants in the ARSD gene. ARSD is a protein-coding gene and is located within a cluster of similar arylsulfatase genes on chromosome X, while a related pseudogene has been identified in the pseudoautosomal region of chromosome Y. Variants rs209372, rs2109135, and rs1047248 in 3 genes, namely, NRK, TLR8, and MAGEA4, respectively, have the highest scores in determining an individual’s classification as with ASD or with no ASD.

Deep Learning Performance Based Upon Contributory Common Variants

After the training phase was over, we picked the same common variations from the test data for each individual. We used the rest of the 787 samples for testing. Based on the trained DeepAutism model, each test individual was predicted the probabilities of belonging to the control group or the diseased group. The deep learning model was extremely accurate in classification of the holdout test set with an AUC of 0.955 (Figure 3A). Figure 3B describes the performance of the DeepAutism classifier on the test data. DeepAutism predicted ASD in 423 samples out of 456 samples with ASD.

Figure 3

A. The area under the receiver operating characteristic curve of DeepAutism, random forest, logistic regression, and Naive Bayes for predicting autism spectrum disorder diagnosis based on the selected top 100 significantly common variants on the test data. B. The visualization table that describes the performance of the DeepAutism classifier on the test data. DeepAutism correctly predicted 697 out of 787 total samples and correctly predicted autism spectrum disorder in 423 samples out of 456 samples with autism spectrum disorders. AUC: area under the receiver operating characteristic curve; ASD: autism spectrum disorder; NB: Naive Bayes; LR: logistic regression; RF: random forest.

Apart from deep learning, we also employed Naive Bayes, logistic regression, support vector machine, random forest, and deep neural network classifiers to compare the prediction of ASD diagnosis. We applied five-fold cross-validation to evaluate the selected significant common variants. Our classifier performed better than the conventional machine learning techniques in terms of AUC, accuracy, specificity, sensitivity, and F1-score. As shown in Table 1, accuracy was 0.886 in the case of DeepAutism, followed by 0.808 for random forest in the same test data set for ASD diagnosis prediction. DeepAutism also yielded the best sensitivity of 0.881 for prediction of ASD and best specificity of 0.893 for non-ASD prediction. The false positive (discriminatory) rate is minimum for DeepAutism with 7% compared with other machine learning techniques. These results are shown in Table 1.

Table 1

Performance of the classifiers with respect to accuracy, sensitivity, specificity, F1-score, and false discovery rate on test sets.^a

Model	Accuracy	Sensitivity	Specificity	F1-score	False discovery rate
DeepAutism	0.886	0.881	0.893	0.905	0.072
Naive Bayes	0.679	0.706	0.633	0.733	0.237
Random forest	0.808	0.785	0.857	0.848	0.079
Logistic regression	0.704	0.715	0.683	0.761	0.186
Support vector machine	0.789	0.773	0.821	0.831	0.101
Deep neural network	0.804	0.766	0.885	0.842	0.073

^aItalicized data demonstrate the best performance; DeepAutism outperformed other models on all the metrics.

Performance Using Randomly Selected Common Variants for ASD Diagnosis

We assessed the classification performance by using randomly picked 100 common variants as inputs to train classifiers. We used the same training and test data as in the above experiment. As shown in Table 2, when the classifiers classify ASD using randomly selected common variants, all the classifiers achieved reduced performance compared to using selected significant common variants. For instance, the AUC and accuracy of DeepAutism dramatically dropped from 0.955 to 0.670 and from 0.885 to 0.689, respectively. The random 100 common variants yielded accuracy of 0.454 and 0.583 using Naive Bayes and logistic regression classifiers, respectively, which is like random guessing. This revealed that the random 100 common variants are not discriminative in distinguishing ASD diagnosis. These results suggest that variant selection is important for identifying significant common variants that are more correlated and significant in improving the classification accuracy.

Table 2

Performance of the classifiers with respect to area under receiver operating characteristic curve, accuracy, sensitivity, specificity, F1-score, and false discovery rate on test sets with randomly picked 100 common variants.^a

Model	Area under receiver operating characteristic curve	Accuracy	Sensitivity	Specificity	F1-score	False discovery rate
DeepAutism	0.670	0.689	0.685	0.697	0.755	0.145
Naive Bayes	0.556	0.454	0.717	0.432	0.166	0.906
Random forest	0.701	0.629	0.612	0.855	0.754	0.018
Logistic regression	0.571	0.583	0.598	0.489	0.704	0.143
Support vector machine	0.672	0.679	0.633	0.571	0.696	0.139
Deep neural network	0.656	0.677	0.681	0.702	0.733	0.143

^aItalicized data show the best performance; the performance of all models became worse on all the metrics with randomly selected common variants.

Discussion

Predicting ASD based on genetic data is challenging. Using common variant analysis, we generated a genetic diagnostic classifier (DeepAutism) based on a deep learning architecture using 100 significant common variants, and we accurately distinguished ASD from controls within the SSC data set. The diagnostic classifier was able to correctly classify individuals with ASD with an accuracy of 88.6% and an AUC of 0.955. Our findings showed that the sensitivity and specificity of the classifier when applied to identify ASD were 88% and 89%, respectively. It is notable that the sensitivity for identifying cases is highly desirable for screening purposes. We also investigated the classification performance of different approaches and the corresponding proportion of subjects who did not have ASD who could be reliably classified as controls. DeepAutism can be suggested as an alternative to conventional shallow machine learning approaches. In the comparisons among the classifiers, DeepAutism performed the best, followed by random forest. Both these classifiers are nonlinear models. Therefore, the causes of ASD are not a simple linear combination of common variants.

Interestingly, when we altered the classifier by using randomly selected 100 common variants, the AUC and accuracy of DeepAutism reduced to 0.670 and 0.689, respectively. The performance became worse because irrelevant variants can include noisy data, thereby affecting the classification accuracy negatively. This verifies the significance of selecting common variants and greatly adds strength to our original findings. Our results suggest that common variants may contribute to ASD diagnosis. A study [18] has shown that the genetic architecture of ASD is contributed by inherited common variants, which supports our findings. The common variants contributing most to the diagnosis in our classifier corresponded to genes on chromosome X. This suggests that ASD is associated with gender. As ASD is strongly biased toward males with ratios of 4:1 (male:female) [38] and statistics have also shown that ASD has a higher prevalence in males than in females [39], mutations in the genes on the X chromosome may explain the increased prevalence of autism in boys compared to that in girls. Thus, this supports our finding that gender bias affects individuals with autism.

In our findings, ARSD, MAGEB16, and MXRA5 genes were found to have a high contributory effect on ASD. ARSD is located within a cluster of similar arylsulfatase genes on chromosome X. ARSD is clinically heterogeneous and is likely to result from mutations in developmental genes or from regulating transcription factors [40]. ARSD has already been reported to be related to ASD or Asperger’s syndrome [41]. The cytogenetic location of ARSD is Xp22.33, and this location significantly contributes to ASD, as shown in the SFARI Gene Database [42]. These regions play a role in neurodevelopment disorders [31,43-48]. Although we used common variants as features in our classifier, we also found that prevalence of X-chromosome copy number variations contribute to ASD. MAGEB16 is also a protein-coding gene, which is located on Xp21.1. MAGEB16 has been implicated in syndromic X-linked intellectual disability and neurodevelopmental disorders. It has also been reported to be associated with autistic disorders [49]. MXRA5 is a protein-coding gene and encodes a protein that forms the extracellular matrix structural constituent. It is involved in the response to transforming growth factor beta and has a pseudogene on chromosome Y. An association has been curated linking MXRA5 and an autistic disorder in Pan paniscus. Although mutations in SHANK3 have been identified in multiple individuals with ASD, most of the mutations are rare variants and not common variants, where the ratio between rare variants and common variants is 230:9 according to the SFARI Gene Database [50].

ASD is a complex behavioral disorder with a strong genetic influence [51]. Diagnosing ASD can be difficult because there is no medical test (such as a blood test) to diagnose this disorder. Although the majority of studies toward biomarker identification for autism have focused on rare genetic variants, we have proven that common genetic variants are also informative with respect to the identification of ASD. In our study, our genetic classifier obtained a high level of diagnostic accuracy, thereby demonstrating that genetic biomarkers can correctly identify individuals with ASD from individuals without ASD. Common variants can play a very important role in screening ASD at an early stage. We identified a few genes with various common variants that could determine whether an individual fell within the case or control group. Our results demonstrate the value of a data-driven approach for the identification of significant common variants and a deep learning method for ASD diagnosis. Overall, these findings indicate that a common variant–based test may allow for early identification of ASD. A genetic predictive classifier as described here may be a tool for ASD screening at birth to provide probability estimates of ASD.

Although our approach for identifying autism based on the selected common variants achieves high accuracy, some limitations exist that need improvement in the future work: (1) the experiments were conducted on the SSC dataset; however, more datasets could be used to evaluate the proposed method and the selected common variants and (2) the proposed algorithm, based on CNN, is a straightforward solution for identifying autism from nonautism; however, more state-of-the-art classifiers could be applied to this ASD classification problem.

While the proposed DeepAutism approach has achieved great success in ASD identification with promising empirical results, we would still like to explore several important directions on DeepAutism in the future. First, we plan to further design an advanced deep learning algorithm that can handle high-dimensional features and output the feature importance for variant selection. By using the designed model, we can select significant variants and classify autistic individuals simultaneously as an end-to-end framework. Second, we will evaluate the proposed method on 2 more distinct ASD cohorts: (1) Simons Foundation Powering Autism Research for Knowledge data and (2) Autism Speaks MMSSNG cohort. We will also validate our algorithms with the UK Biobank clinical and genomic data. Third, we will investigate the full sequences of coding and noncoding regions of the genome between probands and unaffected siblings to explore all of the components in the genetic architecture of ASD.

Multimedia Appendix 1

Architecture of DeepAutism.

Abbreviations

ASD

autism spectrum disorder

AUC

area under the receiver operating characteristic curve

CNN

convolutional neural network

SSC

Simons Simplex Collection

t-SNE

t-distributed stochastic neighbor embedding

VCF

variant call format

VCF_CQ

variant call format-conditional genotype quality

VCF_DP

variant call format-read depth

VCF_GT

variant call format-genotype quality

None declared.

Rylaarsdam

Guemez-Gamboa

Genetic Causes and Modifiers of Autism Spectrum Disorder

Front Cell Neurosci 2019 13 385

10.3389/fncel.2019.00385

31481879

PMC6710438

Frazier

Thompson

Youngstrom

Law

Hardan

Eng

Morris

A twin study of heritable and shared environmental contributions to autism

J Autism Dev Disord 2014 08 44 8 2013 25

10.1007/s10803-014-2081-2

24604525

PMC4104233

Sutton

Autism caused mostly by genetics, according to study

Disability Compliance for Higher Education 2019 08 22 25 2 9 9

10.1002/dhe.30707

McDonald

Senturk

Scheffler

Brian

Carver

Charman

Chawarska

Curtin

Hertz-Piccioto

Jones

EJH

Klin

Landa

Messinger

Ozonoff

Stone

Tager-Flusberg

Webb

Young

Zwaigenbaum

Jeste

Developmental Trajectories of Infants With Multiplex Family Risk for Autism: A Baby Siblings Research Consortium Study

JAMA Neurol 2020 01 01 77 1 73 81

10.1001/jamaneurol.2019.3341

31589284

2752283

PMC6784852

de la Torre-Ubieta

Won

Stein

Geschwind

Advancing the understanding of autism disease mechanisms through genetics

Nat Med 2016 04 22 4 345 61

10.1038/nm.4071

27050589

nm.4071

PMC5072455

Derecki

Cronk

Abbott

SBG

Guyenet

Kipnis

Wild-type microglia arrest pathology in a mouse model of Rett syndrome

Nature 2012 03 18 484 7392 105 9

10.1038/nature10907

22425995

nature10907

PMC3321067

Trost

Engchuan

Nguyen

Thiruvahindrapuram

Dolzhenko

Backstrom

Mirceta

Mojarad

Yin

Dov

Chandrakumar

Prasolava

Shum

Hamdan

Pellecchia

Howe

Whitney

Klee

Baheti

Amaral

Anagnostou

Elsabbagh

Fernandez

Hoang

Lewis

MES

Liu

Sjaarda

Smith

Szatmari

Zwaigenbaum

Glazer

Hartley

Stewart

Eberle

Sato

Pearson

Scherer

Yuen

RKC

Genome-wide detection of tandem DNA repeats that are expanded in autism

Nature 2020 10 586 7827 80 86

10.1038/s41586-020-2579-z

32717741

10.1038/s41586-020-2579-z

Maestrini

Pagnamenta

Lamb

Bacchelli

Sykes

Sousa

Toma

Barnby

Butler

Winchester

Scerri

Minopoli

Reichert

Cai

Buxbaum

Korvatska

Schellenberg

Dawson

de Bildt

Minderaa

Mulder

Morris

Bailey

Monaco

IMGSAC

High-density SNP association study and copy number variation analysis of the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility

Mol Psychiatry 2010 09 15 9 954 68

10.1038/mp.2009.34

19401682

mp200934

PMC2934739

Bai

Yip

BHK

Windham

Sourander

Francis

Yoffe

Glasson

Mahjani

Suominen

Leonard

Gissler

Buxbaum

Wong

Schendel

Kodesh

Breshnahan

Levine

Parner

Hansen

Hultman

Reichenberg

Sandin

Association of Genetic and Environmental Factors With Autism in a 5-Country Cohort

JAMA Psychiatry 2019 10 01 76 10 1035 1043

10.1001/jamapsychiatry.2019.1411

31314057

2737582

PMC6646998

Lintas

Picinelli

Piras

Sacco

Brogna

Persico

Copy number variation in 19 Italian multiplex families with autism spectrum disorder: Importance of synaptic and neurite elongation genes

Am J Med Genet B Neuropsychiatr Genet 2017 07 174 5 547 556

10.1002/ajmg.b.32537

28304131

Sahin

Keshav

Salisbury

Vahabzadeh

Second Version of Google Glass as a Wearable Socio-Affective Aid: Positive School Desirability, High Usability, and Theoretical Framework in a Sample of Children with Autism

JMIR Hum Factors 2018 01 04 5 1 e1

10.2196/humanfactors.8785

29301738

v5i1e1

PMC5773819

Ahmed

Simon

Dempsey

Samaco

Goin-Kochel

Evaluating Two Common Strategies for Research Participant Recruitment Into Autism Studies: Observational Study

J Med Internet Res 2020 09 24 22 9 e16752

10.2196/16752

32969826

v22i9e16752

PMC7545324

Siu

Butcher

D T

Turinsky

A L

Cytrynbaum

Stavropoulos

D J

Walker

Caluseriu

Carter

Lou

Nicolson

Georgiades

Szatmari

Anagnostou

Scherer

S W

Choufani

Brudno

Weksberg

Functional DNA methylation signatures for autism spectrum disorder genomic risk loci: 16p11.2 deletions and CHD8 variants

Clin Epigenetics 2019 07 16 11 1 103

10.1186/s13148-019-0684-3

31311581

10.1186/s13148-019-0684-3

PMC6636171

Turner

Coe

Dickel

Hoekzema

Nelson

Zody

Kronenberg

Hormozdiari

Raja

Pennacchio

Darnell

Eichler

Genomic Patterns of De Novo Mutation in Simplex Autism

Cell 2017 10 19 171 3 710 722.e12

10.1016/j.cell.2017.08.047

28965761

S0092-8674(17)31006-1

PMC5679715

Vorstman

JAS

Parr

Moreno-De-Luca

Anney

RJL

Nurnberger

John I

Hallmayer

Autism genetics: opportunities and challenges for clinical translation

Nat Rev Genet 2017 06 18 6 362 376

10.1038/nrg.2017.4

28260791

nrg.2017.4

Ronemus

Iossifov

Levy

Wigler

The role of de novo mutations in the genetics of autism spectrum disorders

Nat Rev Genet 2014 02 15 2 133 41

10.1038/nrg3585

24430941

nrg3585

Sanders

Willsey

Ercan-Sencicek

Samocha

Cicek

Murtha

Bal

Bishop

Dong

Goldberg

Jinlu

Keaney

Klei

Mandell

Moreno-De-Luca

Poultney

Robinson

Smith

Solli-Nowlan

Teran

Walker

Werling

Beaudet

Cantor

Fombonne

Geschwind

Grice

Lord

Lowe

Mane

Martin

Morrow

Talkowski

Sutcliffe

Walsh

Autism Sequencing Consortium Ledbetter

David H

Martin

Christa Lese

Cook

Edwin H

Buxbaum

Joseph D

Daly

Mark J

Devlin

Bernie

Roeder

Kathryn

State

Matthew W

Insights into Autism Spectrum Disorder Genomic Architecture and Biology from 71 Risk Loci

Neuron 2015 09 23 87 6 1215 1233

10.1016/j.neuron.2015.09.016

26402605

S0896-6273(15)00773-4

PMC4624267

Gaugler

Klei

Sanders

Bodea

Goldberg

Lee

Mahajan

Manaa

Pawitan

Reichert

Ripke

Sandin

Sklar

Svantesson

Reichenberg

Hultman

Devlin

Roeder

Buxbaum

Most genetic risk for autism resides with common variation

Nat Genet 2014 08 46 8 881 5

10.1038/ng.3039

25038753

ng.3039

PMC4137411

Grove Ripke

Stephan

Als

Thomas D

Mattheisen

Manuel

Walters

Raymond K

Won

Hyejung

Pallesen

Jonatan

Agerbo

Esben

Andreassen

Ole A

Anney

Richard

Awashti

Swapnil

Belliveau

Rich

Bettella

Francesco

Buxbaum

Joseph D

Bybjerg-Grauholm

Jonas

Bækvad-Hansen

Marie

Cerrato

Felecia

Chambert

Kimberly

Christensen

Jane H

Churchhouse

Claire

Dellenvall

Karin

Demontis

Ditte

De Rubeis

Silvia

Devlin

Bernie

Djurovic

Srdjan

Dumont

Ashley L

Goldstein

Jacqueline I

Hansen

Christine S

Hauberg

Mads Engel

Hollegaard

Mads V

Hope

Sigrun

Howrigan

Daniel P

Huang

Hailiang

Hultman

Christina M

Klei

Lambertus

Maller

Julian

Martin

Joanna

Martin

Alicia R

Moran

Jennifer L

Nyegaard

Mette

Nærland

Terje

Palmer

Duncan S

Palotie

Aarno

Pedersen

Carsten Bøcker

Pedersen

Marianne Giørtz

dPoterba

Timothy

Poulsen

Jesper Buchhave

Pourcain

Beate St

Qvist

Per

Rehnström

Karola

Reichenberg

Abraham

Reichert

Jennifer

Robinson

Elise B

Roeder

Kathryn

Roussos

Panos

Saemundsen

Evald

Sandin

Sven

Satterstrom

F Kyle

Davey Smith

George

Stefansson

Hreinn

Steinberg

Stacy

Stevens

Christine R

Sullivan

Patrick F

Turley

Patrick

Walters

G Bragi

Xinyi

Autism Spectrum Disorder Working Group of the Psychiatric Genomics Consortium BUPGEN Major Depressive Disorder Working Group of the Psychiatric Genomics Consortium 23andMe Research Team Stefansson

Kari

Geschwind

Daniel H

Nordentoft

Merete

Hougaard

David M

Werge

Thomas

Mors

Ole

Mortensen

Preben Bo

Neale

Benjamin M

Daly

Mark J

Børglum

Anders D

Identification of common genetic risk variants for autism spectrum disorder

Nat Genet 2019 03 51 3 431 444

10.1038/s41588-019-0344-8

30804558

10.1038/s41588-019-0344-8

PMC6454898

Thapar

Rutter

Genetic Advances in Autism

J Autism Dev Disord 2020 09 17

10.1007/s10803-020-04685-z

32940822

10.1007/s10803-020-04685-z

Chen

Peñagarikano

Olga

Belgard

Swarup

Geschwind

The emerging picture of autism spectrum disorder: genetics and pathology

Annu Rev Pathol 2015 10 111 44

10.1146/annurev-pathol-012414-040405

25621659

Skafidas

Testa

Zantomio

Chana

Everall

Pantelis

Predicting the diagnosis of autism spectrum disorder using gene pathway analysis

Mol Psychiatry 2014 04 19 4 504 10

10.1038/mp.2012.126

22965006

mp2012126

PMC3966080

Howsmon

Kruger

Melnyk

James

Hahn

Classification and adaptive behavior prediction of children with autism spectrum disorder based upon multivariate data analysis of markers of oxidative stress and DNA methylation

PLoS Comput Biol 2017 03 13 3 e1005385

10.1371/journal.pcbi.1005385

28301476

PCOMPBIOL-D-16-01316

PMC5354243

Amoedo

Martnez-Costa

MDP

Moreno

An analysis of the communication strategies of Spanish commercial music networks on the web: http://los40.com, http://los40principales.com, http://cadena100.es, http://europafm.es and http://kissfm.es

Radio Journal: International Studies in Broadcast & Audio Media 2009 02 01 6 1 5 20

10.1386/rajo.6.1.5_4

Krishnan

Zhang

Yao

Theesfeld

Wong

Tadych

Volfovsky

Packer

Lash

Troyanskaya

Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder

Nat Neurosci 2016 11 19 11 1454 1462

10.1038/nn.4353

27479844

nn.4353

PMC5803797

Chen

Yuan

Gerstein

Liang

Froehlich

The Development of a Practical Artificial Intelligence Tool for Diagnosing and Evaluating Autism Spectrum Disorder: Multicenter Study

JMIR Med Inform 2020 05 08 8 5 e15767

10.2196/15767

32041690

v8i5e15767

PMC7244998

Sundaram

Laksshman

Bhat

Viswanath

DeepBipolar: Identifying genomic mutations for bipolar disorder via deep learning

Hum Mutat 2017 09 38 9 1217 1224

10.1002/humu.23272

28600868

PMC5656045

Moon

Hwang

Kana

Torous

Kim

Accuracy of Machine Learning Algorithms for the Diagnosis of Autism Spectrum Disorder: Systematic Review and Meta-Analysis of Brain Magnetic Resonance Imaging Studies

JMIR Ment Health 2019 12 20 6 12 e14108

10.2196/14108

31562756

v6i12e14108

PMC6942187

Ben-Sasson

Robins

Yom-Tov

Risk Assessment for Parents Who Suspect Their Child Has Autism Spectrum Disorder: Machine Learning Approach

J Med Internet Res 2018 04 24 20 4 e134

10.2196/jmir.9496

29691210

v20i4e134

PMC5941093

Sullivan

Gallagher

Heron

Gaining Insights into Aggressive Behaviour in Autism Spectrum Disorder Using Latent Profile Analysis

J Autism Dev Disord 2019 10 49 10 4209 4218

10.1007/s10803-019-04129-3

31292900

10.1007/s10803-019-04129-3

PMC6751274

Doan

Lim

De Rubeis

Betancur

Cutler

Chiocchetti

Overman

Soucy

Goetze

Autism Sequencing Consortium Freitag

Christine M

Daly

Mark J

Walsh

Christopher A

Buxbaum

Joseph D

Timothy W

Recessive gene disruptions in autism spectrum disorder

Nat Genet 2019 07 51 7 1092 1098

10.1038/s41588-019-0433-8

31209396

10.1038/s41588-019-0433-8

PMC6629034

Velusamy

Ramasamy

Ensemble of heterogeneous classifiers for diagnosis and prediction of coronary artery disease with reduced feature subset

Comput Methods Programs Biomed 2021 01 198 105770

10.1016/j.cmpb.2020.105770

33027698

S0169-2607(20)31603-5

Jiang

Cameron

Yang

Analysis of Massive Online Medical Consultation Service Data to Understand Physicians' Economic Return: Observational Data Mining Study

JMIR Med Inform 2020 02 18 8 2 e16765

10.2196/16765

32069213

v8i2e16765

PMC7055801

Chikersal

Doryab

Tumminia

Villalba

Dutcher

Liu

Cohen

Creswell

Mankoff

Creswell

Goel

Dey

Detecting Depression and Predicting its Onset Using Longitudinal Symptoms Captured by Passive Sensing

ACM Trans Comput Hum Interact 2021 02 28 1 1 41

10.1145/3422821

Zhang

Zhou

Zhang

Song

A Stroke Risk Detection: Improving Hybrid Feature Selection Method

J Med Internet Res 2019 04 02 21 4 e12437

10.2196/12437

30938684

v21i4e12437

PMC6466481

Scikit-learn: Machine learning in Python 2021-03-22

http://scikit-learn.org

Obeid Dahne

Jennifer

Christensen

Sean

Howard

Samuel

Crawford

Tami

Frey

Lewis J

Stecker

Tracy

Bunnell

Brian E

Identifying and Predicting Intentional Self-Harm in Electronic Health Record Clinical Notes: Deep Learning Approach

JMIR Med Inform 2020 07 30 8 7 e17784

10.2196/17784

32729840

v8i7e17784

PMC7426805

Cahill

A New Link Between Autism and Masculinity

JAMA Psychiatry 2017 04 01 74 4 318

10.1001/jamapsychiatry.2016.4066

28196210

2601065

Hull

Petrides

Mandy

The Female Autism Phenotype and Camouflaging: a Narrative Review

Rev J Autism Dev Disord 2020 01 29 7 4 306 317

10.1007/s40489-020-00197-9

Turnpenny

Bulman

Frayling

Abu-Nasra

Garrett

Hattersley

Ellard

A gene for autosomal recessive spondylocostal dysostosis maps to 19q13.1-q13.3

Am J Hum Genet 1999 07 65 1 175 82

10.1086/302464

10364530

S0002-9297(07)63741-0

PMC1378088

ARSD: arylsulfatase D 2021-03-22

https://www.wikigenes.org/e/gene/e/414.html

Copy number variants/Xp22.33

SFARI Gene 2021-03-22

https://gene-archive.sfari.org/database/cnv/Xp22.33

Willemsen

de Leeuw

de Brouwer

Pfundt

Hehir-Kwa

Yntema

Nillesen

de Vries

van Bokhoven

Kleefstra

Interpretation of clinical relevance of X-chromosome copy number variations identified in a large cohort of individuals with cognitive disorders and/or congenital anomalies

Eur J Med Genet 2012 11 55 11 586 98

10.1016/j.ejmg.2012.05.001

22796527

S1769-7212(12)00138-3

Asadollahi

Oneda

Joset

Azzarello-Burri

Bartholdi

Steindl

Vincent

Cobilanschi

Sticht

Baldinger

Reissmann

Sudholt

Thiel

Ekici

Reis

Bijlsma

Andrieux

Dieux

FitzPatrick

Ritter

Baumer

Latal

Plecko

Jenni

Rauch

The clinical significance of small copy number variants in neurodevelopmental disorders

J Med Genet 2014 10 51 10 677 88

10.1136/jmedgenet-2014-102588

25106414

jmedgenet-2014-102588

PMC4173859

Kushima Aleksic

Branko

Nakatochi

Masahiro

Shimamura

Teppei

Okada

Takashi

Uno

Yota

Morikawa

Mako

Ishizuka

Kanako

Shiino

Tomoko

Kimura

Hiroki

Arioka

Yuko

Yoshimi

Akira

Takasaki

Yuto

Yanjie

Nakamura

Yukako

Yamamoto

Maeri

Iidaka

Tetsuya

Iritani

Shuji

Inada

Toshiya

Ogawa

Nanayo

Shishido

Emiko

Torii

Youta

Kawano

Naoko

Omura

Yutaka

Yoshikawa

Toru

Uchiyama

Tokio

Yamamoto

Toshimichi

Ikeda

Masashi

Hashimoto

Ryota

Yamamori

Hidenaga

Yasuda

Yuka

Someya

Toshiyuki

Watanabe

Yuichiro

Egawa

Jun

Nunokawa

Ayako

Itokawa

Masanari

Arai

Makoto

Miyashita

Mitsuhiro

Kobori

Akiko

Suzuki

Michio

Takahashi

Tsutomu

Usami

Masahide

Kodaira

Masaki

Watanabe

Kyota

Sasaki

Tsukasa

Kuwabara

Hitoshi

Tochigi

Mamoru

Nishimura

Fumichika

Yamasue

Hidenori

Eriguchi

Yosuke

Benner

Seico

Kojima

Masaki

Yassin

Walid

Munesue

Toshio

Yokoyama

Shigeru

Kimura

Ryo

Funabiki

Yasuko

Kosaka

Hirotaka

Ishitobi

Makoto

Ohmori

Tetsuro

Numata

Shusuke

Yoshikawa

Takeo

Toyota

Tomoko

Yamakawa

Kazuhiro

Suzuki

Toshimitsu

Inoue

Yushi

Nakaoka

Kentaro

Goto

Yu-Ichi

Inagaki

Masumi

Hashimoto

Naoki

Kusumi

Ichiro

Son

Shuraku

Murai

Toshiya

Ikegame

Tempei

Okada

Naohiro

Kasai

Kiyoto

Kunimoto

Shohko

Mori

Daisuke

Iwata

Nakao

Ozaki

Norio

Comparative Analyses of Copy-Number Variation in Autism Spectrum Disorder and Schizophrenia Reveal Etiological Overlap and Biological Insights

Cell Rep 2018 09 11 24 11 2838 2856

10.1016/j.celrep.2018.08.022

30208311

S2211-1247(18)31293-2

Rosenfeld

Ballif

Torchia

Sahoo

Ravnan

Schultz

Lamb

Bejjani

Shaffer

Copy number variations associated with autism spectrum disorders contribute to a spectrum of neurodevelopmental disorders

Genet Med 2010 08 30 12 11 694 702

10.1097/gim.0b013e3181f0c5f3

Edens

Lyons

Duron

Dupont

Barbara R

Holden

Kenton R

Autism in two females with duplications involving Xp11.22-p11.23

Dev Med Child Neurol 2011 05 53 5 463 6

10.1111/j.1469-8749.2010.03909.x

21418194

Ben-David

Granot-Hershkovitz

Monderer-Rothkoff

Lerer

Elad

Levi

Shlomit

Yaari

Maya

Ebstein

Richard P

Yirmiya

Nurit

Shifman

Sagiv

Identification of a functional rare variant in autism using genome-wide screen for monoallelic expression

Hum Mol Genet 2011 09 15 20 18 3632 41

10.1093/hmg/ddr283

21680558

ddr283

MAGEB16

Alliance of genome resources 2021-03-22

https://www.alliancegenome.org/gene/HGNC:21188

SHANK3: SH3 and multiple ankyrin repeat domains 3

SFARI Gene 2021-03-22

https://gene.sfari.org/database/human-gene/SHANK3

Yoo

Genetics of Autism Spectrum Disorder: Current Status and Possible Clinical Applications

Exp Neurobiol 2015 12 24 4 257 72

10.5607/en.2015.24.4.257

26713075

PMC4688327