This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
In the United States, a rare disease is characterized as the one affecting no more than 200,000 patients at a certain period. Patients suffering from rare diseases are often either misdiagnosed or left undiagnosed, possibly due to insufficient knowledge or experience with the rare disease on the part of clinical practitioners. With an exponentially growing volume of electronically accessible medical data, a large volume of information on thousands of rare diseases and their potentially associated diagnostic information is buried in electronic medical records (EMRs) and medical literature.
This study aimed to leverage information contained in heterogeneous datasets to assist rare disease diagnosis. Phenotypic information of patients existed in EMRs and biomedical literature could be fully leveraged to speed up diagnosis of diseases.
In our previous work, we advanced the use of a collaborative filtering recommendation system to support rare disease diagnostic decision making based on phenotypes derived solely from EMR data. However, the influence of using heterogeneous data with collaborative filtering was not discussed, which is an essential problem while facing large volumes of data from various resources. In this study, to further investigate the performance of collaborative filtering on heterogeneous datasets, we studied EMR data generated at Mayo Clinic as well as published article abstracts retrieved from the Semantic MEDLINE Database. Specifically, in this study, we designed different data fusion strategies from heterogeneous resources and integrated them with the collaborative filtering model.
We evaluated performance of the proposed system using characterizations derived from various combinations of EMR data and literature, as well as with sole EMR data. We extracted nearly 13 million EMRs from the patient cohort generated between 2010 and 2015 at Mayo Clinic and retrieved all article abstracts from the semistructured Semantic MEDLINE Database that were published till the end of 2016. We applied a collaborative filtering model and compared the performance generated by different metrics. Log likelihood ratio similarity combined with k-nearest neighbor on heterogeneous datasets showed the optimal performance in patient recommendation with area under the precision-recall curve (PRAUC) 0.475 (string match), 0.511 (systematized nomenclature of medicine [SNOMED] match), and 0.752 (Genetic and Rare Diseases Information Center [GARD] match). Log likelihood ratio similarity also performed the best with mean average precision 0.465 (string match), 0.5 (SNOMED match), and 0.749 (GARD match). Performance of rare disease prediction was also demonstrated by using the optimal algorithm. Macro-average
This study demonstrated potential utilization of heterogeneous datasets in a collaborative filtering model to support rare disease diagnosis. In addition to phenotypic-based analysis, in the future, we plan to further resolve the heterogeneity issue and reduce miscommunication between EMR and literature by mining genotypic information to establish a comprehensive disease-phenotype-gene network for rare disease diagnosis.
In the United States, a rare disease is described as the one affecting no more than 200,000 patients at a certain time [
The very initial step in diagnosing rare disease is to stratify patients into subgroups with similar phenotypic characterizations. In addition, with computationally accessible medical data growing at an exponential rate, an abundance of rare disease-related phenotypic information is believed to be buried in electronic medical records (EMRs) and medical literature. Therefore, we hypothesize that patients’ phenotypic information available among these resources can be leveraged to accelerate disease diagnosis. Few studies focus on phenotypic characterization of diseases and the analysis of phenotype-disease associations from free-text data such as EMRs and medical literature. One of the most representative efforts, the Human Phenotype Ontology (HPO) [
Since all datasets are flawed, it is important to prepare data with good quality, as machine learning depends heavily on data [
One of these challenges is the alignment of semantic heterogeneity. Semantic heterogeneity is referred to as a situation where 2 or more datasets are provided by different parties with various perspectives and purposes [
Another challenge is to get benefit from heterogeneous data to improve performance of machine learning. To investigate this, Lewis et al applied support vector machine on heterogeneous biological data to infer gene function [
According to the aforementioned related work, although some success was demonstrated, the issue regarding semantic heterogeneity is still an unsolved puzzle. Moreover, to the best of our knowledge, no study has paid attention to the impact of applying collaborative filtering on heterogeneous data, especially in biomedical domain. Therefore, it is interesting to investigate how data fusion strategies on heterogeneous resources can work with collaborative filtering for an optimal recommendation.
In this work, we developed a new framework based on our previous designed collaborative filtering system to incorporate heterogeneous data sources with different data fusion strategies to assist in diagnosing rare diseases. We extracted Unified Medical Language System concepts with MetaMap [
For the EMR dataset, we collected clinical notes generated at the Mayo Clinic from 2010 to 2015. The extracted corpus maintained about 13 million unstructured clinical notes for over 700,000 patients. We only annotated sections with problems and diagnoses. For the medical literature dataset, we extracted abstracts of research articles from the SemMedDB. We then used HPO and GARD terms to match either subject or object for each predication [
In e-commerce, collaborative filtering techniques [
In our previous work, we developed a collaborative filtering model based on a cohort of rare disease patients to stratify patients into subgroups and accelerate the diagnosis of rare diseases. Here, we treated patient profiles with their respective phenotypes as binary inputs, which means that the patient either has or does not have a phenotype. For the patients with a confirmed rare disease diagnosis, we used their phenotypes as input and treated their rare disease diagnosis as labels to train the collaborative filtering model.
Specifically, we applied the Tanimoto coefficient similarity (TANI), overlap coefficient similarity (OL), Fager & McGowan coefficient similarity (FMG), and log likelihood ratio similarity (LL) as 4 measurements to compute patient similarity [
We also applied 2 neighborhood algorithms to provide recommendations: k-nearest neighbors (KNN) and threshold patient neighbor (TPN) [
SemMedDB is a repository of semantic predications (ie, subject-predicate-object triples) extracted from the titles and abstracts of all PubMed citations [
The HPO is a standardized vocabulary for phenotypic terms, and it is built based on collecting phenotypic knowledge from various biomedical literature as well as databases. In this study, we used HPO released in September 2016 to annotate phenotypic terms.
The GARD is a database that contains information on rare diseases. It groups collected 4560 diseases into 32 disease categories. In this study, we used the GARD to extract rare disease terms.
Input: Sorted Similarity Score Map S (Neighbor_Patient, Score) for each patient, number of neighbor k, similarity threshold t
Output for KNN: Neighbor List LK
Output for TPN: Neighbor List LT
1. FOR each neighbor_patient NP in S
2. scorenp=S.get(NP)
3. IF (LK.size()<k)
4. add NP into LK
5. IF (scorenp>t)
6. add NP into LT
7. RETURN LK, LT
The format of input data is composed of patient identification or PMID and unique phenotypes manifested by each patient or article. We treated a positive diagnosis of a rare disease as a gold standard for association tasks involving patients and PMID to rare disease mentions as a gold standard for literature association tasks. We used 3 different data fusion strategies to prepare homogeneous and heterogeneous resources:
EMR only: Only patient-phenotype information extracted from the EMR was used.
EMR and literature (EMR+L): We first conducted a treatment on medical literature. Since each publication might only mention 1 phenotype with 1 rare disease, to strengthen the evidence power provided by the literature, we merged multiple literature sources together as 1 large document if those sources shared the same rare disease. Therefore, the number of documents used will be less than 91,680. We then mixed patient-phenotype association with literature-phenotype information and randomly permuted them without any additional treatment. Detailed steps of this process are shown as case 1 in
EMR and pruned literature (EMR+PL). A similar approach as EMR+L was followed, but some phenotype-rare disease associations mined from literature were additionally filtered out if they did not appear in the EMR. In this case, we tried to enhance the correlation and coexisting evidence between phenotypes and rare diseases a bit further to provide a better prediction output. Case 2 in
Different phenotype-disease associations with 3 different data fusion strategies were imported to collaborative filtering model and the final recommendation outputs based on 3 data inputs would be given. For example, if a new patient has phenotypes
We evaluated 24 various evaluation groups as: (1) TANI with KNN on EMR; (2) TANI with KNN on EMR and literature; (3) TANI with KNN on EMR and pruned literature; (4) TANI with TPN on EMR; (5) TANI with TPN on EMR and literature; (6) TANI with TPN on EMR and pruned literature; (7) LL with KNN on EMR; (8) LL with KNN on EMR and literature; (9) LL with KNN on EMR and pruned literature; (10) LL with TPN on EMR; (11) LL with TPN on EMR and literature; (12) LL with TPN on EMR and pruned literature; (13) OL with KNN on EMR; (14) OL with KNN on EMR and literature; (15) OL with KNN on EMR and pruned literature; (16) OL with TPN on EMR; (17) OL with TPN on EMR and literature; (18) OL with TPN on EMR and pruned literature; (19) FMG with KNN on EMR; (20) FMG with KNN on EMR and literature; (21) FMG with KNN on EMR and pruned literature; (22) FMG with TPN on EMR; (23) FMG with TPN on EMR and literature; and (24) FMG with TPN on EMR and pruned literature.
We used the same metrics adopted in our previous work to evaluate system performance. Specifically, we applied root mean square error (RMSE) [
Similar to our previous study, we used 3 matching strategies to measure the similarity between any 2 rare diseases: string matching, systematized nomenclature of medicine-clinical terms (SNOMED) matching, and GARD matching to provide different levels of relaxation on predicting rare diseases [
System workflow. EMR: electronic medical record; UMLS: Unified Medical Language System.
Input: Map A (PMID, Rare Disease), Map B (PMID, Map(Rare Disease, List(Phenotypes)))
Output for Case 1: Merged literature with same rare disease, stored rare diseases along with their associated phenotypes in Map C
Output for Case 2: Pruned Map C’
Case 1: EMR+L
1. FOR each PMID and Rare Disease RD in A
2. retrieve all relevant phenotypes {P} for RD and PMID from B
3. IF C does not contain RD
4. create new document_ID
5. add {P} to list L
6. add (document_ID, (RD, L)) to C
7. ELSE
8. List L=A.retrieve(document_ID)
9. add nonduplicate elements from {P} to list L
10. add (document_ID, (RD, L)) to C
11. RETURN C
Case 2: EMR+PL
12. C’=C
13. FOR each phenotype-disease pair PD1 in Map E
14. FOR each phenotype-disease pair PD2 in Map C’
15. IF (PD1 !=PD2)
16. remove PD2 from C’
17. RETURN C’
As shown in
For KNN combined with different similarity measurements,
Statistics for prepared datasets.
Datasets | EMRa only, n | EMR and literature (EMR+L), n | EMR and pruned literature (EMR+PL), n |
Patients or literature sources | 38,607 | 40,241 | 39,677 |
Phenotypes | 3271 | 3818 | 3271 |
Rare diseases | 1074 | 1634 | 1074 |
Phenotype-disease associations | 141,036 | 154,802 | 141,036 |
GARDb categories covered | 28 | 31 | 28 |
aEMR: electronic medical record.
bGARD: Genetic and Rare Diseases Information Center.
Root mean square error (RMSE) for k-nearest neighbors (KNN) with four similarity measurements. EMR: electronic medical record; FMG: Fager and McGowan coefficient similarity; L: literature; LL: log likelihood ratio similarity; OL: overlap coefficient similarity; PL: pruned literature; TANI: Tanimoto coefficient similarity.
We plotted precision-recall curves for each of the 24 experiments and area under the precision-recall curve (PRAUC) for each matching criterion. Overall, we observed that GARD matching contributed to the optimal performance among all matching criteria, and SNOMED semantic matching was always a suboptimal strategy.
Root mean square error (RMSE) for threshold patient neighbor (TPN) with four similarity measurements. EMR: electronic medical record; FMG: Fager and McGowan coefficient similarity; L: literature; LL: log likelihood ratio similarity; OL: overlap coefficient similarity; PL: pruned literature; TANI: Tanimoto coefficient similarity.
Optimal thresholds for different evaluation groups.
Optimal parameters | TANIa | LLb | OLc | FMGd | ||||||||
EMRe | EMR+Lf | EMR+PLg | EMR | EMR+L | EMR+PL | EMR | EMR+L | EMR+PL | EMR | EMR+L | EMR+PL | |
Optimal |
11 | 10 | 9 | 4 | 4 | 4 | 4 | 4 | 4 | 7 | 6 | 6 |
Optimal |
0.19 | 0.19 | 0.2 | 0.72 | 0.73 | 0.76 | 0.51 | 0.49 | 0.51 | 0.12 | 0.11 | 0.12 |
aTANI: Tanimoto coefficient similarity.
bLL: log likelihood ratio similarity.
cOL: overlap coefficient similarity.
dFMG: Fager and McGowan coefficient similarity.
eEMR: electronic medical record.
fL: literature.
gPL: pruned literature.
hKNN: k-nearest neighbors.
iTPN: threshold patient neighbor.
Precision-recall curves and area under the precision-recall curve (PRAUC) for Tanimoto coefficient similarity (TANI) with k-nearest neighbors (KNN) and threshold patient neighbors (TPN). EMR: electronic medical record; GARD: Genetic and Rare Diseases Information Center; KNN: k-nearest neighbors; SNOMED: systematized nomenclature of medicine; TANI: Tanimoto coefficient similarity.
Precision-recall curves and area under the precision-recall curve (PRAUC) for log likelihood ratio similarity with k-nearest neighbors and threshold patient neighbors. EMR: electronic medical record; GARD: Genetic and Rare Diseases Information Center; KNN: k-nearest neighbors; LL: log likelihood ratio similarity; SNOMED: systematized nomenclature of medicine; TPN: threshold patient neighbor.
Precision-recall curves and area under the precision-recall curve (PRAUC) for overlap coefficient similarity with k-nearest neighbors and threshold patient neighbors. EMR: electronic medical record; GARD: Genetic and Rare Diseases Information Center; KNN: k-nearest neighbors; OL: overlap coefficient similarity; SNOMED: systematized nomenclature of medicine; TPN: threshold patient neighbor.
Precision-recall curves and area under the precision-recall curve (PRAUC) for Fager and McGowan coefficient similarity with k-nearest neighbors and threshold patient neighbors. EMR: electronic medical record; FMG: Fager and McGowan coefficient similarity; GARD: Genetic and Rare Diseases Information Center; KNN: k-nearest neighbors; SNOMED: systematized nomenclature of medicine; TPN: threshold patient neighbor.
Mean average precision for TANIa with EMRb, EMR+Lc, and EMR+PLd (optimal in italics).
Matching criterion | EMR | EMR+L | EMR+PL | |||
KNNe | TPNf | KNN | TPN | KNN | TPN | |
String | 0.435 | 0.441 | 0.436 | 0.445 | ||
SNOMEDg | 0.469 | 0.475 | 0.474 | 0.479 | ||
GARDh | 0.739 | 0.742 | 0.742 | 0.745 |
aTANI: Tanimoto coefficient similarity.
bEMR: electronic medical record.
cL: literature.
dPL: pruned literature.
eKNN: k-nearest neighbors.
fTPN: threshold patient neighbor.
gSNOMED: systematized nomenclature of medicine.
hGARD: Genetic and Rare Diseases Information Center.
Mean average precision for LLa with EMRb, EMR+Lc, and EMR+PLd (optimal in italics).
Matching criterion | EMR | EMR+L | EMR+PL | |||
KNNe | TPNf | KNN | TPN | KNN | TPN | |
String | 0.368 | 0.351 | 0.46 | 0.391 | ||
SNOMEDg | 0.405 | 0.386 | 0.495 | 0.426 | ||
GARDh | 0.683 | 0.67 | 0.745 | 0.71 |
aLL: log likelihood ratio similarity.
bEMR: electronic medical record.
cL: literature.
dPL: pruned literature.
eKNN: k-nearest neighbors.
fTPN: threshold patient neighbor.
gSNOMED: systematized nomenclature of medicine.
hGARD: Genetic and Rare Diseases Information Center.
Mean average precision for OLa with EMRb, EMR+Lc, and EMR+PLd (optimal in italics).
Matching criterion | EMR | EMR+L | EMR+PL | |||
KNNe | TPNf | KNN | TPN | KNN | TPN | |
String | 0.117 | 0.11 | 0.167 | 0.148 | ||
SNOMEDg | 0.126 | 0.122 | 0.179 | 0.162 | ||
GARDh | 0.457 | 0.48 | 0.505 | 0.509 |
aOL: overlap coefficient similarity.
bEMR: electronic medical record.
cL: literature.
dPL: pruned literature.
eKNN: k-nearest neighbors.
fTPN: threshold patient neighbor.
gSNOMED: systematized nomenclature of medicine.
hGARD: Genetic and Rare Diseases Information Center.
Mean average precision for FMGa with EMRb, EMR+Lc, and EMR+PLd (optimal in italics).
Matching criterion | EMR | EMR+L | EMR+PL | |||
KNNe | TPNf | KNN | TPN | KNN | TPN | |
String | 0.18 | 0.205 | 0.264 | 0.27 | ||
SNOMEDg | 0.192 | 0.221 | 0.288 | 0.297 | ||
GARDh | 0.568 | 0.584 | 0.651 |
aFMG: Fager and McGowan coefficient similarity.
bEMR: electronic medical record.
cL: literature.
dPL: pruned literature.
eKNN: k-nearest neighbors.
fTPN: threshold patient neighbor.
gSNOMED: systematized nomenclature of medicine.
hGARD: Genetic and Rare Diseases Information Center.
We selected LL with KNN as the optimal metric, trained it with EMR+PL, and applied it on 44,060 patients with only 1 rare disease. We only selected rare diseases with at least 3 affected patients, which resulted in 702 rare diseases in total. Prediction performances for different matching criteria are described as shown in
In
For string matching,
Prediction performance for rare diseases. GARD: Genetic and Rare Diseases Information Center; SNOMED: systematized nomenclature of medicine.
Recommendation performance for selected rare diseases (3 high, 3 medium to high, 3 low).
Approaches and top diseases | Number of patients affected | Precision | Recall | ||
Holoprosencephaly | <10 | 0.75 | 1 | 0.86 | |
Huntington disease | <10 | 1 | 0.67 | 0.8 | |
Juvenile polyposis syndrome | <10 | 0.91 | 0.71 | 0.8 | |
Sacrococcygeal teratoma | 15 | 0.83 | 0.67 | 0.74 | |
Frontotemporal dementia | 202 | 0.69 | 0.58 | 0.63 | |
Polycystic liver disease | 72 | 0.64 | 0.58 | 0.61 | |
Hemicrania continua | 36 | 0.08 | 0.25 | 0.12 | |
Intrahepatic cholangiocarcinoma | 94 | 0.08 | 0.22 | 0.12 | |
Neuromyelitis optica | 50 | 0.16 | 0.1 | 0.12 | |
Myxoid liposarcoma | 37 | 0.94 | 0.89 | 0.91 | |
Linear scleroderma | 16 | 0.91 | 0.71 | 0.8 | |
Migraine with brainstem aura | 15 | 0.75 | 1 | 0.86 | |
Hypophosphatemic rickets | <10 | 0.83 | 0.75 | 0.79 | |
Congenital radio ulnar synostosis | 14 | 0.67 | 0.86 | 0.75 | |
Spasmodic dysphonia | 177 | 0.83 | 0.67 | 0.74 | |
Acute graft-versus-host disease | 20 | 0.1 | 0.5 | 0.15 | |
Cryptogenic organizing pneumonia | 37 | 0.14 | 0.17 | 0.15 | |
Cerebellar degeneration | 29 | 0.14 | 0.17 | 0.15 | |
Acrospiroma | <10 | 1 | 1 | 1 | |
Birt-Hogg-Dube syndrome | <10 | 1 | 1 | 1 | |
Dendritic cell tumor | <10 | 1 | 1 | 1 | |
Acute promyelocytic leukemia | 15 | 0.97 | 0.93 | 0.95 | |
Migraine with brainstem aura | 15 | 1 | 0.88 | 0.93 | |
Thyroid cancer, anaplastic | 30 | 1 | 0.86 | 0.92 | |
Addison disease | 34 | 0.88 | 0.45 | 0.6 | |
Encephalocele | 56 | 0.4 | 0.59 | 0.48 | |
Mixed connective tissue disease | 78 | 0.4 | 0.48 | 0.43 |
aLL: log likelihood ratio similarity.
bKNN: k-nearest neighbors.
cSNOMED: Systematized Nomenclature of Medicine.
dGARD: Genetic and Rare Diseases Information Center.
In our EMR,
For SNOMED matching, the top predicted diseases are
Since GARD matching was able to have a broader recommendation based on system categories of rare diseases, it usually yielded a better prediction performance than the other 2 strategies.
This study demonstrates the potential to provide decision support on rare diseases for differential diagnosis. With more comprehensive knowledge extracted from clinical notes and literature, collaborative filtering performed better on both patient recommendation and rare disease prediction. The current clinical decision support (CDS) system is limited to a narrow area of clinical practice due to the inability to utilize information embedded in clinical narratives and challenges in making good semantic alignment among precision medicine knowledge and clinical data stored in various formats and heterogeneous resources. Therefore, there exists a huge opportunity to integrate our proposed work into current CDS system for a better rare disease differential diagnosis in clinical practice.
For homogeneous data, LL performance would be depressed when compared with TANI (eg, EMR only). On the other hand, LL is good at dealing with heterogeneous data, and as phenotype-rare disease associations extracted from EMR and medical literature share different perspectives, such flexibility can help us find more patterns compared with TANI. Therefore, it is not surprising that patient recommendation performance improved when we combined EMR and literature randomly, and performance improved further after we used pruned literature. OL and FMG, however, performed worse than TANI and LL. We found that OL gives too much weight to patient similarities even with few shared phenotypes. Such strict similarity measurements have difficulty finding semantic relationships and lack the ability to stratify patients well. This is possibly an explanation for the better performance of OL for single EMR data with high homogeneity but poor performance for combined datasets with high heterogeneity. Similar to OL, FMG is not good at dealing with heterogeneous data; nevertheless, it yielded a better patient recommendation performance than OL in the EMR+L and EMR+PL datasets. Furthermore, we observed that LL is sensitive to the selection of KNN or TPN, especially for combined datasets, which infers that making a good balance between KNN and TPN has the potential ability to optimize overall performance and eliminate bias with idealized neighbors and similarity at the same time.
The combination of EMR and literature did not always contribute to optimal performance in patient recommendations. The reason for this is that some biases exist when physicians or researchers documented phenotype-disease associations. For EMRs, each document is recorded based on individual physician instinct and experiences starting from a clinical perspective, and for literature, phenotypes and rare diseases with positive relationships are reported based on a large number of gene tests from a biomedical experimental perspective, which may increase the gap between these two sources. Collaborative filtering with different similarity measurements and neighborhood algorithms can remedy this problem to some extent. In the future, we plan to investigate on gene level to reduce miscommunication and balance the heterogeneity between different datasets. Besides the use of literature only, it would also be interesting to integrate cross-institutional EMRs with balanced heterogeneity to acquire diagnostic experience and knowledge from multiple hospitals and health care institutions to build a more general system for rare disease diagnostic decision support.
We investigated the application of a patient-based collaborative filtering model on heterogeneous EMRs and literature with different similarity measurements and neighborhood algorithms. Results demonstrated the potential of combining heterogeneous datasets to support diagnostic decision making for rare diseases.
In the future, we are going to fully utilize the graph structure provided by the HPO and leverage its node embeddings [
clinical decision support
electronic medical record
Fager and McGowan coefficient similarity
Genetic and Rare Diseases Information Center
Human Phenotype Ontology
K-nearest neighbors
log likelihood ratio similarity
overlap coefficient similarity
area under the precision-recall curve
root mean square error
systematized nomenclature of medicine
Tanimoto coefficient similarity
threshold patient neighbor
This work has been supported by the National Institute of Health grants OT3TR002019, R01EB19403, R01LM011934, R01GM102282, and TR02062.
None declared.