This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
In the United States, about 3 million people have autism spectrum disorder (ASD), and around 1 out of 59 children are diagnosed with ASD. People with ASD have characteristic social communication deficits and repetitive behaviors. The causes of this disorder remain unknown; however, in up to 25% of cases, a genetic cause can be identified. Detecting ASD as early as possible is desirable because early detection of ASD enables timely interventions in children with ASD. Identification of ASD based on objective pathogenic mutation screening is the major first step toward early intervention and effective treatment of affected children.
Recent investigation interrogated genomics data for detecting and treating autism disorders, in addition to the conventional clinical interview as a diagnostic test. Since deep neural networks perform better than shallow machine learning models on complex and high-dimensional data, in this study, we sought to apply deep learning to genetic data obtained across thousands of simplex families at risk for ASD to identify contributory mutations and to create an advanced diagnostic classifier for autism screening.
After preprocessing the genomics data from the Simons Simplex Collection, we extracted top ranking common variants that may be protective or pathogenic for autism based on a chi-square test. A convolutional neural network–based diagnostic classifier was then designed using the identified significant common variants to predict autism. The performance was then compared with shallow machine learning–based classifiers and randomly selected common variants.
The selected contributory common variants were significantly enriched in chromosome X while chromosome Y was also discriminatory in determining the identification of autistic individuals from nonautistic individuals. The
Common variants are informative for autism identification. Our findings also suggest that the deep learning process is a reliable method for distinguishing the diseased group from the control group based on the common variants of autism.
Autism spectrum disorder (ASD) is a common neurodevelopmental disorder that begins early in childhood and lasts throughout a person's life. In the United States, around 1 out of 59 children have been diagnosed with ASD. People with ASD have characteristic social communication deficits and repetitive behaviors. Early detection of ASD enables timely interventions for children with ASD. Such interventions could provide the best opportunity to improve outcomes as opposed to treatments started after diagnosis. The epigenetic landscape has revealed that ASD may result from a complex regulatory network, including epigenetic, genetic, and environmental factors [
Rare variants, both inherited and
This study first identified significant common variants that may be protective or pathogenic for ASD as well as their additive contribution to ASD; therefore, deep learning models are applicable using common variants. Then, this study applied deep learning prediction algorithms to verify the identified common variants and generate a predictive classifier for ASD diagnosis. The results were tested on a hold-out test data set from the Simons Simplex Collection (SSC), and the proposed strategic approach achieved the best performance in distinguishing the diseased group from the control group based on selected significant common variants of ASD.
The objectives of this study were to (1) discover significant common variants that may be protective or pathogenic for ASD, (2) create an advanced diagnostic classifier for autism screening based on the identified common variants, and (3) verify the developed classifier and significant common variants across thousands of simplex families.
We used an autism data set from the SSC [
Overall framework for deciphering contributory common variants and predicting autism spectrum disorder diagnosis. A. Data preprocessing. VCF_GT recoding is to encode VCF_GT values as dummy variables. If both alleles are reference alleles, it is encoded as 0; if both alleles are alternate alleles, it is encoded as 2; otherwise, it is 1. B. Data split and significant variant selection. The data set was split into training set and test set. Variants were ranked based on their chi-score and
For all the variants, we have their unphased genotype information using the format of VCF_GT. To make the data processable for deep learning models, we encoded the VCF_GT data by creating categorical values to represent different types of genotype [
As the number of variants was too large to apply deep learning models directly, to construct the features for the deep learning models, we used feature selection to reduce variant dimension (
where
By calculating the chi-square scores for all the variants, we can rank the variants by the chi-square scores and then choose the top ranked variants as significant variants for model training.
A. Variants with high relative importance scores in chi-square test. The Y-axis corresponds to variant IDs of these variants, and the X-axis corresponds to the relative importance values of the corresponding variants. B. Visualization of the top 100 selected significantly common variants using t-distributed stochastic neighbor embedding. Different colors represent different classes (ie, case and control). This visualization indicates that the 2 groups are differentiable using the selected top common variants. t-SNE: t-distributed stochastic neighbor embedding.
The overall framework of the proposed DeepAutism (
For training, DeepAutism uses a set of selected common variants (top 100 significant common variants) to estimate the probabilities of an individual belonging to control case or autism. For a set of variants
p(v) = Sigmoid(netW(pool(ReLUb(convf(v))))
The sigmoid function is used for computing probabilities of a set of variants v belonging to either control group or autism group, and the produced probabilities are from 0 to 1, with the control set belonging to class 0 and the ASD group belonging to class 1. The convolution stage (
Apart from CNNs, we also employed conventional machine learning techniques to evaluate the effectiveness of DeepAutism for classifying autism diagnosis. The conventional machine learning models that we compared were random forests, logistic regression, and Naive Bayes. We used the same training and test data (with the selected 100 common variants) for the conventional machine learning models as used for the DeepAutism model, aiming to evaluate whether the CNN model outperforms other machine learning classifiers. To evaluate whether the selected top 100 common variants are significant for ASD diagnosis, we also compared the chi-square–based variant selection method with random variant selection by using the same training and test data sets. We randomly selected 100 common variants as inputs that were fed into both DeepAutism and conventional machine learning models to compare the changes in their performance.
Statistical analyses focused on the selected top 100 common variants, which most significantly contributed to the classifiers of ASD. Of the 100 common variants within our classifier, 66% are exonic mutations and 23% are intronic mutations, while small proportions are splicing mutations or from an untranslated region. Within the 66% exonic mutations, about half are synonymous single nucleotide variants and about half are nonsynonymous single nucleotide variants. It is important to point out that the selected contributory common variants were significantly enriched on chromosome X while chromosome Y is also discriminatory in identifying individuals with ASD from individuals without ASD.
A number of variants were populated by the same genes. Related to the contributory common variants, the statistically significant genes were
After the training phase was over, we picked the same common variations from the test data for each individual. We used the rest of the 787 samples for testing. Based on the trained DeepAutism model, each test individual was predicted the probabilities of belonging to the control group or the diseased group. The deep learning model was extremely accurate in classification of the holdout test set with an AUC of 0.955 (
A. The area under the receiver operating characteristic curve of DeepAutism, random forest, logistic regression, and Naive Bayes for predicting autism spectrum disorder diagnosis based on the selected top 100 significantly common variants on the test data. B. The visualization table that describes the performance of the DeepAutism classifier on the test data. DeepAutism correctly predicted 697 out of 787 total samples and correctly predicted autism spectrum disorder in 423 samples out of 456 samples with autism spectrum disorders. AUC: area under the receiver operating characteristic curve; ASD: autism spectrum disorder; NB: Naive Bayes; LR: logistic regression; RF: random forest.
Apart from deep learning, we also employed Naive Bayes, logistic regression, support vector machine, random forest, and deep neural network classifiers to compare the prediction of ASD diagnosis. We applied five-fold cross-validation to evaluate the selected significant common variants. Our classifier performed better than the conventional machine learning techniques in terms of AUC, accuracy, specificity, sensitivity, and F1-score. As shown in
Performance of the classifiers with respect to accuracy, sensitivity, specificity, F1-score, and false discovery rate on test sets.a
Model | Accuracy | Sensitivity | Specificity | F1-score | False discovery rate |
DeepAutism |
|
|
|
|
|
Naive Bayes | 0.679 | 0.706 | 0.633 | 0.733 | 0.237 |
Random forest | 0.808 | 0.785 | 0.857 | 0.848 | 0.079 |
Logistic regression | 0.704 | 0.715 | 0.683 | 0.761 | 0.186 |
Support vector machine | 0.789 | 0.773 | 0.821 | 0.831 | 0.101 |
Deep neural network | 0.804 | 0.766 | 0.885 | 0.842 | 0.073 |
aItalicized data demonstrate the best performance; DeepAutism outperformed other models on all the metrics.
We assessed the classification performance by using randomly picked 100 common variants as inputs to train classifiers. We used the same training and test data as in the above experiment. As shown in
Performance of the classifiers with respect to area under receiver operating characteristic curve, accuracy, sensitivity, specificity, F1-score, and false discovery rate on test sets with randomly picked 100 common variants.a
Model | Area under receiver operating characteristic curve | Accuracy | Sensitivity | Specificity | F1-score | False discovery rate |
DeepAutism | 0.670 |
|
0.685 | 0.697 |
|
0.145 |
Naive Bayes | 0.556 | 0.454 |
|
0.432 | 0.166 | 0.906 |
Random forest |
|
0.629 | 0.612 |
|
0.754 |
|
Logistic regression | 0.571 | 0.583 | 0.598 | 0.489 | 0.704 | 0.143 |
Support vector machine | 0.672 | 0.679 | 0.633 | 0.571 | 0.696 | 0.139 |
Deep neural network | 0.656 | 0.677 | 0.681 | 0.702 | 0.733 | 0.143 |
aItalicized data show the best performance; the performance of all models became worse on all the metrics with randomly selected common variants.
Predicting ASD based on genetic data is challenging. Using common variant analysis, we generated a genetic diagnostic classifier (DeepAutism) based on a deep learning architecture using 100 significant common variants, and we accurately distinguished ASD from controls within the SSC data set. The diagnostic classifier was able to correctly classify individuals with ASD with an accuracy of 88.6% and an AUC of 0.955. Our findings showed that the sensitivity and specificity of the classifier when applied to identify ASD were 88% and 89%, respectively. It is notable that the sensitivity for identifying cases is highly desirable for screening purposes. We also investigated the classification performance of different approaches and the corresponding proportion of subjects who did not have ASD who could be reliably classified as controls. DeepAutism can be suggested as an alternative to conventional shallow machine learning approaches. In the comparisons among the classifiers, DeepAutism performed the best, followed by random forest. Both these classifiers are nonlinear models. Therefore, the causes of ASD are not a simple linear combination of common variants.
Interestingly, when we altered the classifier by using randomly selected 100 common variants, the AUC and accuracy of DeepAutism reduced to 0.670 and 0.689, respectively. The performance became worse because irrelevant variants can include noisy data, thereby affecting the classification accuracy negatively. This verifies the significance of selecting common variants and greatly adds strength to our original findings. Our results suggest that common variants may contribute to ASD diagnosis. A study [
In our findings,
ASD is a complex behavioral disorder with a strong genetic influence [
Although our approach for identifying autism based on the selected common variants achieves high accuracy, some limitations exist that need improvement in the future work: (1) the experiments were conducted on the SSC dataset; however, more datasets could be used to evaluate the proposed method and the selected common variants and (2) the proposed algorithm, based on CNN, is a straightforward solution for identifying autism from nonautism; however, more state-of-the-art classifiers could be applied to this ASD classification problem.
While the proposed DeepAutism approach has achieved great success in ASD identification with promising empirical results, we would still like to explore several important directions on DeepAutism in the future. First, we plan to further design an advanced deep learning algorithm that can handle high-dimensional features and output the feature importance for variant selection. By using the designed model, we can select significant variants and classify autistic individuals simultaneously as an end-to-end framework. Second, we will evaluate the proposed method on 2 more distinct ASD cohorts: (1) Simons Foundation Powering Autism Research for Knowledge data and (2) Autism Speaks MMSSNG cohort. We will also validate our algorithms with the UK Biobank clinical and genomic data. Third, we will investigate the full sequences of coding and noncoding regions of the genome between probands and unaffected siblings to explore all of the components in the genetic architecture of ASD.
Architecture of DeepAutism.
autism spectrum disorder
area under the receiver operating characteristic curve
convolutional neural network
Simons Simplex Collection
t-distributed stochastic neighbor embedding
variant call format
variant call format-conditional genotype quality
variant call format-read depth
variant call format-genotype quality
None declared.