Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Heterogeneous network embedding for identifying symptom candidate genes

Heterogeneous network embedding for identifying symptom candidate genes Abstract Objective Investigating the molecular mechanisms of symptoms is a vital task in precision medicine to refine disease taxonomy and improve the personalized management of chronic diseases. Although there are abundant experimental studies and computational efforts to obtain the candidate genes of diseases, the identification of symptom genes is rarely addressed. We curated a high-quality benchmark dataset of symptom-gene associations and proposed a heterogeneous network embedding for identifying symptom genes. Methods We proposed a heterogeneous network embedding representation algorithm, which constructed a heterogeneous symptom-related network that integrated symptom-related associations and applied an embedding representation algorithm to obtain the low-dimensional vector representation of nodes. By measuring the relevance between symptoms and genes via calculating the similarities of their vectors, the candidate genes of given symptoms can be obtained. Results A benchmark dataset of 18 270 symptom-gene associations between 505 symptoms and 4549 genes was curated. We compared our method to baseline algorithms (FSGER and PRINCE). The experimental results indicated our algorithm achieved a significant improvement over the state-of-the-art method, with precision and recall improved by 66.80% (0.844 vs 0.506) and 53.96% (0.311 vs 0.202), respectively, for TOP@3 and association precision improved by 37.71% (0.723 vs 0.525) over the PRINCE. Conclusions The experimental validation of the algorithms and the literature validation of typical symptoms indicated our method achieved excellent performance. Hence, we curated a prediction dataset of 17 479 symptom-candidate genes. The benchmark and prediction datasets have the potential to promote investigations of the molecular mechanisms of symptoms and provide candidate genes for validation in experimental settings. heterogeneous network embedding, symptom gene identification, network medicine Introduction Symptoms and signs (called symptoms in brief) are the primary evidence for clinical diagnosis and disease classification.1 As a critical layer connecting exposomes and genomes in the knowledge network, symptoms play an important role in precision medicine to refine disease taxonomy.2 In recent years, increasingly more phenotype (disease and symptom) databases, such as Human Phenotype Ontology (HPO),3 Human Disease Ontology (DO),4 and Orphanet Rare Disease Ontology (Orphanet)5 have been constructed. Most biomedical researchers are mainly focused on analyzing and understanding the molecular mechanism of disease phenotypes.6–8 The investigation of the underlying molecular mechanisms of symptom phenotypes has rarely been addressed, except for disease conditions overlapping with symptom phenotypes, such as obesity9 and pain.10 In addition, to impel the study of genome and phenotypes, the U.S. National Human Genome Research Institute initiated 2 projects, eMERGE,11 which correlates whole genome scans with phenotype data extracted from the electronic medical record systems and PhenX12 which provides investigators with high-priority, well-established, low-burden standard measures to collect phenotypic and environmental data for large-scale genomic studies. Jyotishman et al13 adopted multiple standards and biomedical terminologies to promote cross-study pooling of data and complex genotype-phenotype associations detection. Similar to the computational approaches for disease-gene prediction, symptom gene identification is also a key task for revealing the underlying molecular mechanisms of symptoms. Gene prediction of given diseases requires extensive experiments to test hundreds of candidate genes in a wet lab.14 In fact, experimental gene identification for symptoms and diseases is a difficult and time-consuming task.15 The success of network-based computational methods for identifying disease genes8,14,16 demonstrated that it is an effective method for disease gene prediction. There exists preliminary work1 that indicates it is feasible to use a network propagation approach to predict the candidate genes of symptoms and complicated factors involved in the influence of prediction performance.17 In addition, recent increasing curation of large-scale symptom-related association data, such as disease-gene associations (eg OMIM,18 DisGeNet19 and Malacards20) symptom-disease associations (Disease Ontology,4 HPO3 and Orphanet5) and protein-protein interactions (HPRD,21 BioGRID,22 and IntAct23) offer a rare opportunity for the development of computational approaches. However, to substantially promote these efforts, we still need to address 2 essential tasks: curation of a high-quality benchmark dataset and making full use of the heterogeneous symptom-related indirect association data, such as symptom-disease associations, disease-gene associations and protein-protein interactions to improve the symptom gene prediction performance. Here, by integrating symptom-disease and disease-gene associations, we curated a benchmark dataset of symptom-gene associations. We proposed a deep embedding representation algorithm on a heterogeneous symptom-related network to identify symptom genes (Figure 1). First, we constructed a heterogeneous symptom-related network, which includes symptom-disease, disease-gene and protein-protein associations. Then, the network embedding representation algorithm was applied to construct low-dimensional vector representation (LVR) of nodes (symptoms and genes) in the network. By calculating the relevance between symptoms and genes that were measured by the similarities of their vectors, the candidate genes of symptoms can be obtained. We compared the prediction performance of our algorithm to the baseline algorithms (FSGER and PRINCE). The experimental results indicated our algorithm achieved a significant improvement over baseline algorithms. Finally, a high-quality prediction dataset of symptom-candidate gene associations was curated based on the results predicted by our method. Figure 1. View largeDownload slide An overview of LSGER method. First, by integrating disease-symptom, disease-gene, and protein-protein associations (a), a heterogeneous symptom-related network (b) was constructed. Then, the network embedding algorithm was applied to obtain a low-dimensional vector representation of nodes (c). Finally, the relevance between the symptom and gene nodes can be measured by the similarities of their vectors (d). By sorting predicted genes by relevance, the candidate genes of given symptoms can be identified. Figure 1. View largeDownload slide An overview of LSGER method. First, by integrating disease-symptom, disease-gene, and protein-protein associations (a), a heterogeneous symptom-related network (b) was constructed. Then, the network embedding algorithm was applied to obtain a low-dimensional vector representation of nodes (c). Finally, the relevance between the symptom and gene nodes can be measured by the similarities of their vectors (d). By sorting predicted genes by relevance, the candidate genes of given symptoms can be identified. Methods Dataset Disease-gene associations Disease-gene associations were collected from the DisGeNet19 and Malacards20 databases (Figure 2). First, we extracted 130 820 curated disease-gene associations between 13 074 diseases with UMLS code (CUI) and 8947 genes from the DisGeNet database, which integrates disease-gene associations from UniProt,24 PsyGeNET,25 ClinVar,26 Orphanet,5 the GWAS Catalog,27 CTD28 and HPO3 databases. Second, we collected 73 064 disease-gene associations between 6118 diseases with CUIs and 8370 genes from the Malacards database. To unify and integrate the disease terms, we mapped the original disease identifiers of the 2 databases to Unified Medical Language System (UMLS) codes. Finally, the 2 data sources were integrated to obtain 196 397 disease-gene associations that include 16 594 unique diseases and 11 497 unique genes. Figure 2. View largeDownload slide A flow chart of data collection and integration. First, 87 442 disease-symptom associations were collected by integrating disease-symptom associations from the DO, HPO and Orphanet databases. We collected and integrated 196 397 disease-gene associations from the DisGeNet and Malacards databases. Then, we selected a set of 1278 symptoms with DP characteristics from the MeSH database and the integrated associations. Finally, a benchmark dataset of 18 270 symptom-gene associations was curated. Figure 2. View largeDownload slide A flow chart of data collection and integration. First, 87 442 disease-symptom associations were collected by integrating disease-symptom associations from the DO, HPO and Orphanet databases. We collected and integrated 196 397 disease-gene associations from the DisGeNet and Malacards databases. Then, we selected a set of 1278 symptoms with DP characteristics from the MeSH database and the integrated associations. Finally, a benchmark dataset of 18 270 symptom-gene associations was curated. Protein-protein interactions The protein-protein interactions (PPIs) were collected from Menche et al,29 and include 213 888 records with 15 964 unique proteins. These data are integrated PPI data derived from multiple data sources, such as HPRD,21 BioGRID,22 IntAct23 and PINA.30 Disease-symptom associations Disease-symptom associations were collected from the DO,4 HPO3 and Orphanet5 databases (Figure 2). To unify the disease terms from the different datasets, we mapped the original disease codes to UMLS codes. We collected 1008 disease-symptom associations between 204 diseases and 417 symptoms from the DO database, 87 442 disease-symptom associations between 4366 diseases and 6176 symptoms from the HPO database, and 35 039 disease-symptom associations between 2391 diseases and 3721 symptoms from the Orphanet database. By integrating the 3 data sources, we finally obtained 100 305 distinct disease-symptom associations (DSA) between 5605 diseases and 6935 symptoms. Benchmark dataset construction of symptom-gene associations By integrating symptom-related and gene-related association data, we curated a benchmark dataset of symptom-gene associations (called BDSG) (Figure 2). In particular, to obtain the high quality symptom gene associations, we utilized the phenomenon of some “Dual Phenotypes” (DP), such as obesity, fever, back pain, and vertigo, which are not only regarded as diseases, but also as symptoms in medical fields. The associated genes of symptoms with DP characteristics can be directly derived from the disease-gene associations with high quality assurance. To identify these kinds of phenotype terms with DP characteristics, we utilized the hierarchical tree codes (eg C08: respiratory tract diseases and C08.618.248: cough) from MeSH31 terminology to relate the disease terms in our dataset. First, we collected 1051 symptom terms whose MeSH tree codes start with C23.888. Second, we extracted the disease term list and symptom term list from DSA, respectively, and identified the DP symptoms by intersecting the 2 lists. After obtaining the union set of the aforementioned 2 symptom lists, we curated 1278 symptoms with distinct UMLS CUIs. Then, by intersecting the CUIs from the diseases in the integrated disease-gene associations, we obtained 505 symptoms with the DP characteristics, from which we finally curated 18 270 high quality symptom-gene associations (Supplementary Material S1) between these 505 symptoms and 4549 genes. In addition, to curate a more comprehensive symptom-gene benchmark dataset, we further collected the symptom-gene associations derived from the SEMMED32 database, which offered semantic predictions from the titles and abstracts of PubMed33 literatures. We extracted the gene-related semantic predictions about symptom terminologies and finally obtained 50 907 symptom-gene associations (called SPSG) between 932 symptoms and 9382 genes. Fisher-based statistics model for symptom gene prediction Based on the Fisher exact test,34 we proposed a Fisher-based statistical model to predict symptom genes (FSGER) as a baseline method. Based on the symptom-disease and disease-gene associations, we considered the diseases as a bridge to connect symptoms and genes. In detail, for symptom s and gene g ⁠, we defined a ⁠, b ⁠, c and d to represent the number of diseases associated with s and g ⁠, associated with s but not g ⁠, associated with g but not s and associated with neither s nor g ⁠, respectively. The relevance Rel(sg) between the symptom s and the gene g can be defined as follows: Relsg=1-a+b!c+d!a+c!b+d!a!b!c!d!n! where n represents the number of all the related diseases. Then, by ranking the predicted genes by the relevance, the ranking gene lists of given symptoms can be obtained. Heterogeneous symptom-related network embedding representation Network embedding representation learning35 is an effective algorithm for learning the low-dimensional feature vectors of the nodes in a given network, and it can effectively preserve the local and global structure information of the network. Network embedding representation methods are applicable in many tasks, such as visualization, label classification and link prediction.35 In this study, we constructed a heterogeneous symptom-related network, and applied the network embedding algorithm node2vec35 to obtain the low-dimensional vector representation of the nodes in the network. As a well-known algorithm for network embedding representation, the main idea of node2vec is to learn a mapping of nodes to a low-dimensional space of features that maximizes the likelihood of preserving network neighborhoods of nodes. In detail, for a given network G=(VE) ⁠, the aim of node2vec is to learn the mapping function f:V→Rd (parameter d is the number of feature dimensions) from nodes to feature representations. By applying the Skip-Gram architecture to the network,36,37 the objective function can be optimized by maximizing the log-probability of observing the network neighborhood Ns(u) for node u conditioned on its feature representation as follows: maxf⁡∑u∈Vlog⁡Pr(Ns(u)|f(u)) For the node u∈V ⁠, its network neighborhoods Ns(u) can be generated through a neighborhood sampling strategy S. The authors of node2vec proposed a biased random walk strategy, which can flexibly and efficiently explore the diverse neighborhoods of nodes. Given a source node u ⁠, the random walk of fixed length i can be simulated, and node ci (that is, the i-th node in the random walk, and c0=u ⁠) was generated by the distribution function: Pci=xci-1=v=πvxZif(vx)∈E0otherwise where πvx is the unnormalized transition probability between nodes v and x ⁠, and Z is the normalizing constant. By applying the 2 standard assumptions, conditional independence and symmetry in the feature space, the low-dimensional vector features of nodes can be measured using stochastic gradient ascent over the model. We constructed 2 heterogeneous networks, SDGNet, which integrated symptom-disease and disease-gene associations and SDGPNet, which integrated symptom-disease, disease-gene, and protein-protein associations. Given a heterogeneous network G=VE ⁠, V and E represented the nodes and edges of the network. Then, we applied the network embedding representation algorithm to learn the LVR of nodes. Finally, the node v can be mapped to a low-dimensional vector Nv ⁠. LVR-based similarity prediction model to identify symptom genes We can obtain the LVR of the nodes in the given network based on a heterogeneous network embedding representation algorithm. The low-dimensional vector features of nodes fused the local structure (neighbor of nodes) and global structure information of the network. Then, we proposed a LVR-based similarity model for symptom gene prediction (LSGER). The relevance between the symptom and gene nodes can be measured by the similarities of their low-dimensional vectors. Mathematically, given the symptom node vs and the gene node vg ⁠, we can measure the relevance Relvsvg between them by calculating the LVR-based cosine similarity  cos NvsNvg of their vectors Nvs and Nvg as follows: Relvsvg= cos NvsNvg=Nvs·NvgNvs·Nvg By calculating and sorting the correlations between query symptom and all candidate genes, we can obtain a ranking list of candidate genes for the query symptom. Otherwise, for the symptom vs ⁠, we designed a pre-selection strategy of candidate genes: selecting the genes of diseases related to vs as candidate gene pool and compared to no-selection strategy: selecting all genes as a candidate gene pool. Based on the 2 strategies, the 2 variants LSGER-AG (all genes) and LSGER-DG (with filtered disease gene) of LSGER algorithm were proposed. Experimental setting and evaluation We constructed 2 benchmark datasets of symptom-gene associations (BDSG and SPSG), which can be used to evaluate the prediction performance of different algorithms. In the experiment, we removed all the known genes of the symptoms in the benchmark dataset and predicted the candidate genes of every test symptom, which indicated that there were not any priori symptom-gene associations for all the prediction algorithms. Our method was compared to the baseline algorithms FSGER and PRINCE.1 Foremost, the PRINCE was proposed by Vanunu et al38 to predict disease genes. Li et al1 extended the PRINCE and applied it to the task of symptom genes prediction. In their work, a network propagation method was used in the PPI network to obtain priority scores of candidate genes. The FSGER algorithm is a Fisher-based statistics model that connected disease-symptom and disease-gene associations for symptom genes prediction. We adopted precision (PR), recall (RE), F1-score (F1),39 association precision (AP) and area under curve (AUC) as the evaluation metrics. Given a test symptom set S with m symptoms, for every test symptom s∈S ⁠, T(s) represents the test gene set of symptom s ⁠. Given a ranking list of predicted genes, we selected the top i genes Ris of the ranking list (⁠ i=310 ⁠) as candidate genes. The precision, recall and F1-score for TOP@i can be defined as follows: Precision=1M∑s∈S|T(s)∩Ris||Ris| Recall=1M∑s∈S|T(s)∩Ris||T(s)| F1-score=2·precision·recallprecision+recall The recall was calculated in the top 3 or 10 candidate genes, which may lead to low recall values. Since we used the same mode of calculating the recall, it is fair for all the prediction algorithms. In addition, for every test symptom s ⁠, the top k genes Rks of ranking list were also selected (⁠ k equals to the number of test genes of symptom s ⁠). The association precision can be defined, as follows: AP=∑s∈S|T(s)∩Rks|∑s∈S|Rks| In addition, we also used the AUC to evaluate the prediction performance. For every test symptom, we selected the top 100 predicted genes as candidate genes and obtained the predicted scores of symptom-candidate genes pairs. Then, we ranked all the symptom-candidate gene pairs by the scores and calculated the AUC values. Compared to the AUC calculation of homogeneous network in link prediction tasks, the AUC calculation in this study may lead to the inapposite AUC of prediction results. Hence, the AUC evaluation is only a supplement to the other metrics. Results LVR-based similarity model to predict symptom genes For LSGER, we compared it to the PRINCE and FSGER algorithms. We adopted precision, recall, F1-score for TOP@3 and TOP@10, association precision and AUC as evaluation metrics. For LSGER-AG and LSGER-DG algorithms, we used 2 heterogeneous networks, SDGNet and SDGPNet, as test networks. First, the experimental results (Table 1) on the BDSG dataset show that, compared to the baseline algorithm PRINCE (AP = 0.525; PR = 0.506 and RE = 0.202 for TOP@3), the FSGER algorithm achieved slightly better performance: AP improved by 2.10%; PR and RE improved by 20.55% and 17.33%, respectively, for TOP@3. The LSGER-AG with SDGPNet yielded the best performance: compared to PRINCE, AP improved by 37.71%; AUC improved by 21.60%; PR and RE improved by 66.80% and 53.96%, respectively, for TOP@3. Second, the LSGER algorithm with SDGPNet obtained slightly higher performance than did the SDGNet (LSGER-AG: PR and RE improved by 1.69% and 3.67%, respectively, for TOP@3; LSGER-DG: PR and RE improved by 1.58% and 3.32%, respectively, for TOP@3), which indicated that the fusion of more gene-related information (PPI network) improved prediction performance of LSGER algorithm. Finally, in terms of precision and recall for TOP@3, both LSGER-AG and LSGER-DG had similar prediction performance. However, in terms of AP, the prediction performance of LSGER-DG was better than that of LSGER-AG (with SDGNet: AP improved by 6.31%; with SDGPNet: AP improved by 9.54%), which indicated the candidate gene pre-selection improved the prediction performance of the LSGER algorithm. Table 1. The performance comparison of symptom gene prediction algorithms TOP@3 TOP@10 Network Algorithm AP AUC Precision Recall F1-score Precision Recall F1-score – PRINCE 0.525 0.736 0.506 0.202 0.211 0.420 0.371 0.296 – FSGER 0.536 0.564 0.610 0.237 0.252 0.486 0.422 0.344 SDGNet LSGER-AG 0.745 0.890 0.830 0.300 0.327 0.719 0.572 0.488 SDGNet LSGER-DG 0.792 0.856 0.821 0.301 0.327 0.693 0.561 0.473 SDGPNet LSGER-AG 0.723 0.895 0.844 0.311 0.338 0.719 0.576 0.489 SDGPNet LSGER-DG 0.792 0.853 0.834 0.311 0.336 0.698 0.568 0.478 TOP@3 TOP@10 Network Algorithm AP AUC Precision Recall F1-score Precision Recall F1-score – PRINCE 0.525 0.736 0.506 0.202 0.211 0.420 0.371 0.296 – FSGER 0.536 0.564 0.610 0.237 0.252 0.486 0.422 0.344 SDGNet LSGER-AG 0.745 0.890 0.830 0.300 0.327 0.719 0.572 0.488 SDGNet LSGER-DG 0.792 0.856 0.821 0.301 0.327 0.693 0.561 0.473 SDGPNet LSGER-AG 0.723 0.895 0.844 0.311 0.338 0.719 0.576 0.489 SDGPNet LSGER-DG 0.792 0.853 0.834 0.311 0.336 0.698 0.568 0.478 The bold values represent best performance for each metrics (e.g. AUC, precision and recall). AP represents association precision. Table 1. The performance comparison of symptom gene prediction algorithms TOP@3 TOP@10 Network Algorithm AP AUC Precision Recall F1-score Precision Recall F1-score – PRINCE 0.525 0.736 0.506 0.202 0.211 0.420 0.371 0.296 – FSGER 0.536 0.564 0.610 0.237 0.252 0.486 0.422 0.344 SDGNet LSGER-AG 0.745 0.890 0.830 0.300 0.327 0.719 0.572 0.488 SDGNet LSGER-DG 0.792 0.856 0.821 0.301 0.327 0.693 0.561 0.473 SDGPNet LSGER-AG 0.723 0.895 0.844 0.311 0.338 0.719 0.576 0.489 SDGPNet LSGER-DG 0.792 0.853 0.834 0.311 0.336 0.698 0.568 0.478 TOP@3 TOP@10 Network Algorithm AP AUC Precision Recall F1-score Precision Recall F1-score – PRINCE 0.525 0.736 0.506 0.202 0.211 0.420 0.371 0.296 – FSGER 0.536 0.564 0.610 0.237 0.252 0.486 0.422 0.344 SDGNet LSGER-AG 0.745 0.890 0.830 0.300 0.327 0.719 0.572 0.488 SDGNet LSGER-DG 0.792 0.856 0.821 0.301 0.327 0.693 0.561 0.473 SDGPNet LSGER-AG 0.723 0.895 0.844 0.311 0.338 0.719 0.576 0.489 SDGPNet LSGER-DG 0.792 0.853 0.834 0.311 0.336 0.698 0.568 0.478 The bold values represent best performance for each metrics (e.g. AUC, precision and recall). AP represents association precision. Furthermore, we have performed the comparative experiments with different similarity metrics in the supplementary materials (SM). We have selected 3 classical similarity metrics, cosine similarity (Sim_cos), Euclidean distance similarity (Sim_eu) and Pearson similarity (Sim_pea), to measure the vector similarities of symptom and gene nodes. The results predicted by LSGER-AG algorithm with the SDGNet and SDGPNet networks indicated that different similarity metrics had some degree of influence on the prediction performance of our algorithm. For example, in term of precision (PR) and recall (RE) for TOP@3, the prediction algorithm with Sim_pea (PR = 0.852; RE = 0.314), Sim_eu (PR = 0.871; RE = 0.318) and Sim_cos (PR = 0.844; RE = 0.311) obtained similar performances on recall but different results on precision measure. In the SM section, we also compared the performance of symptom-gene prediction algorithms on the SPSG dataset. The prediction results indicated that the LSGER-DG with SDGPNet still obtained the best performance: compared to the PRINCE algorithm, the recall and F1-score improved by 35.32% and 64.24%, respectively. Compared to the BDSG dataset with highly credible symptom-gene associations, the prediction associations offered by the SEMMED had a low confidence. Therefore, the evaluation results on the BDSG dataset can be of greater value than those on the SPSG dataset. From the above, our method had a higher performance than other prediction algorithms. Case study: candidate genes of some typical symptoms To illustrate the performance of prediction algorithm, we showed the prediction performance using LSGER-AG with SDGPNet of several typical symptoms (Table 2), including constipation (CUI: C0009806), nausea (CUI: C0027497), pain (CUI: C0030193), Usher syndromes (C1568248), vision disorders (C0042790), and aphasia (C0003537), which are regarded as DP symptom terms. The top 10 candidate genes of these symptoms were also listed (Table 3), and the bold genes in the table are the known genes of these symptoms. For example, for constipation, the top 9 candidate genes are the known genes (PR = 0.9 for TOP@10). In addition, for the candidate genes (Table 3) of pain, we found 9 benchmark genes and the left gene ZNF470 (rank = 5) was related to amyotrophic lateral sclerosis (ALS).40 We searched HPO3 and found that pain is one of the typical symptoms of ALS. Therefore, ZNF470 might be a novel gene for pain. Table 2. The prediction performance of some specific symptoms TOP@3 TOP@10 ID Symptom (CUI) Number of hit genes/test genes Precision Recall F1-score Precision Recall F1-score 1 Constipation (C0009806) 109/158 1.000 0.019 0.037 0.900 0.057 0.107 2 Nausea (C0027497) 11/17 1.000 0.176 0.300 0.800 0.471 0.593 3 Pain (C0030193) 54/79 1.000 0.038 0.073 0.900 0.114 0.202 4 Usher syndromes (C1568248) 15/18 0.667 0.111 0.190 0.700 0.389 0.500 5 Vision disorders (C0042790) 6/6 1.000 0.500 0.667 0.600 1.000 0.750 6 Aphasia (C0003537) 5/9 0.333 0.111 0.167 0.600 0.667 0.632 TOP@3 TOP@10 ID Symptom (CUI) Number of hit genes/test genes Precision Recall F1-score Precision Recall F1-score 1 Constipation (C0009806) 109/158 1.000 0.019 0.037 0.900 0.057 0.107 2 Nausea (C0027497) 11/17 1.000 0.176 0.300 0.800 0.471 0.593 3 Pain (C0030193) 54/79 1.000 0.038 0.073 0.900 0.114 0.202 4 Usher syndromes (C1568248) 15/18 0.667 0.111 0.190 0.700 0.389 0.500 5 Vision disorders (C0042790) 6/6 1.000 0.500 0.667 0.600 1.000 0.750 6 Aphasia (C0003537) 5/9 0.333 0.111 0.167 0.600 0.667 0.632 Table 2. The prediction performance of some specific symptoms TOP@3 TOP@10 ID Symptom (CUI) Number of hit genes/test genes Precision Recall F1-score Precision Recall F1-score 1 Constipation (C0009806) 109/158 1.000 0.019 0.037 0.900 0.057 0.107 2 Nausea (C0027497) 11/17 1.000 0.176 0.300 0.800 0.471 0.593 3 Pain (C0030193) 54/79 1.000 0.038 0.073 0.900 0.114 0.202 4 Usher syndromes (C1568248) 15/18 0.667 0.111 0.190 0.700 0.389 0.500 5 Vision disorders (C0042790) 6/6 1.000 0.500 0.667 0.600 1.000 0.750 6 Aphasia (C0003537) 5/9 0.333 0.111 0.167 0.600 0.667 0.632 TOP@3 TOP@10 ID Symptom (CUI) Number of hit genes/test genes Precision Recall F1-score Precision Recall F1-score 1 Constipation (C0009806) 109/158 1.000 0.019 0.037 0.900 0.057 0.107 2 Nausea (C0027497) 11/17 1.000 0.176 0.300 0.800 0.471 0.593 3 Pain (C0030193) 54/79 1.000 0.038 0.073 0.900 0.114 0.202 4 Usher syndromes (C1568248) 15/18 0.667 0.111 0.190 0.700 0.389 0.500 5 Vision disorders (C0042790) 6/6 1.000 0.500 0.667 0.600 1.000 0.750 6 Aphasia (C0003537) 5/9 0.333 0.111 0.167 0.600 0.667 0.632 Table 3. The top 10 candidate genes of some specific symptoms Rank Constipation (C0009806) Nausea (C0027497) Pain (C0030193) Usher syndromes (C1568248) Vision disorders (C0042790) Aphasia (C0003537) 1 SEMA3C ETFDH PROKR1 USH1G TSEN54 LOC643387 2 NRTN ETFB PON3 PDZD7 TSEN2 ATP1A2 3 GPBAR1 ETFA PNOC USH1K TSEN34 PSNP2 4 HMBS LPL DAO USH1H TTPA GRN 5 MLNR HMBS ZNF470 USH1E CLN6 GRIN2A 6 DUOX2 IFNA2 HTR3B CIB2 ATXN7 ADA2 7 TRHR COQ4 NTSR1 CDH23 CNGA3 REEP1 8 MLN SLC7A7 TRPA1 MT-TS2 GRM6 MAPT 9 SCN11A ACADM UNC13A USH1C PRPH2 L1CAM 10 CELIAC8 TNF BDKRB1 WHRN NR2E3 NOTCH3 Rank Constipation (C0009806) Nausea (C0027497) Pain (C0030193) Usher syndromes (C1568248) Vision disorders (C0042790) Aphasia (C0003537) 1 SEMA3C ETFDH PROKR1 USH1G TSEN54 LOC643387 2 NRTN ETFB PON3 PDZD7 TSEN2 ATP1A2 3 GPBAR1 ETFA PNOC USH1K TSEN34 PSNP2 4 HMBS LPL DAO USH1H TTPA GRN 5 MLNR HMBS ZNF470 USH1E CLN6 GRIN2A 6 DUOX2 IFNA2 HTR3B CIB2 ATXN7 ADA2 7 TRHR COQ4 NTSR1 CDH23 CNGA3 REEP1 8 MLN SLC7A7 TRPA1 MT-TS2 GRM6 MAPT 9 SCN11A ACADM UNC13A USH1C PRPH2 L1CAM 10 CELIAC8 TNF BDKRB1 WHRN NR2E3 NOTCH3 The genes with bold fonts represent the candiate genes that are known genes of the corresponding symptoms. Table 3. The top 10 candidate genes of some specific symptoms Rank Constipation (C0009806) Nausea (C0027497) Pain (C0030193) Usher syndromes (C1568248) Vision disorders (C0042790) Aphasia (C0003537) 1 SEMA3C ETFDH PROKR1 USH1G TSEN54 LOC643387 2 NRTN ETFB PON3 PDZD7 TSEN2 ATP1A2 3 GPBAR1 ETFA PNOC USH1K TSEN34 PSNP2 4 HMBS LPL DAO USH1H TTPA GRN 5 MLNR HMBS ZNF470 USH1E CLN6 GRIN2A 6 DUOX2 IFNA2 HTR3B CIB2 ATXN7 ADA2 7 TRHR COQ4 NTSR1 CDH23 CNGA3 REEP1 8 MLN SLC7A7 TRPA1 MT-TS2 GRM6 MAPT 9 SCN11A ACADM UNC13A USH1C PRPH2 L1CAM 10 CELIAC8 TNF BDKRB1 WHRN NR2E3 NOTCH3 Rank Constipation (C0009806) Nausea (C0027497) Pain (C0030193) Usher syndromes (C1568248) Vision disorders (C0042790) Aphasia (C0003537) 1 SEMA3C ETFDH PROKR1 USH1G TSEN54 LOC643387 2 NRTN ETFB PON3 PDZD7 TSEN2 ATP1A2 3 GPBAR1 ETFA PNOC USH1K TSEN34 PSNP2 4 HMBS LPL DAO USH1H TTPA GRN 5 MLNR HMBS ZNF470 USH1E CLN6 GRIN2A 6 DUOX2 IFNA2 HTR3B CIB2 ATXN7 ADA2 7 TRHR COQ4 NTSR1 CDH23 CNGA3 REEP1 8 MLN SLC7A7 TRPA1 MT-TS2 GRM6 MAPT 9 SCN11A ACADM UNC13A USH1C PRPH2 L1CAM 10 CELIAC8 TNF BDKRB1 WHRN NR2E3 NOTCH3 The genes with bold fonts represent the candiate genes that are known genes of the corresponding symptoms. We further evaluated the predicted genes of pain by additional validations from PPI interactions and genetic functional analysis. In particular, we extracted the interaction of the top 49 predicted genes of pain in the context of the whole PPI network and showed the interaction map of them (Figure 3a), which includes 36 benchmark genes and 13 novel candidate genes. There are dense interactions (95 interactions) between those benchmark genes and the novel candidate genes compared to the interactions with random controls (p-value = 6.82e-68), which indicated that the novel genes are located close to benchmark genes in the PPI network. Further enrichment analysis (Gene Ontology and Pathway) of the pain predicted genes obtained similar results (Figure 3b). For example, there are 9 candidate genes and 11 known genes on the neuroactive ligand-receptor interaction pathway (p-value = 9.90E-15). Therefore, additional analysis indicated that there exist heavy interactions among the candidate genes and known genes of pain, which partially validate the rationality of the prediction results. Figure 3. View largeDownload slide PPI interaction and genetic functional analysis of the predicted genes of pain. We extracted the interaction of the top 49 predicted genes of pain in the context of the whole protein-protein interaction (PPI) network and showed interaction matrix of them (a), which includes 36 known genes (ie benchmark genes) and 13 novel candidate genes. There are dense interactions (95 interactions) between those benchmark genes and the novel candidate genes compared to those with random controls (p-value=6.82e-68), which indicated that the novel genes were located close to benchmark genes in the PPI network. Further pathway and Gene Ontology (termed GO) enrichment analysis of the pain predicted genes obtained similar results. The bold and underlined genes are known and candidate genes of pain, respectively. Figure 3. View largeDownload slide PPI interaction and genetic functional analysis of the predicted genes of pain. We extracted the interaction of the top 49 predicted genes of pain in the context of the whole protein-protein interaction (PPI) network and showed interaction matrix of them (a), which includes 36 known genes (ie benchmark genes) and 13 novel candidate genes. There are dense interactions (95 interactions) between those benchmark genes and the novel candidate genes compared to those with random controls (p-value=6.82e-68), which indicated that the novel genes were located close to benchmark genes in the PPI network. Further pathway and Gene Ontology (termed GO) enrichment analysis of the pain predicted genes obtained similar results. The bold and underlined genes are known and candidate genes of pain, respectively. To fully evaluate the candidate genes that were not recorded in the BDSG dataset, we manually searched the recently published biomedical papers to verify the novel candidate genes. For example, for the novel candidate genes of Usher syndromes (PR = 0.7 for TOP@10), we found that Jaworek et al41 confirmed the locus (chromosome 10p11.21-q21.1) of USH1K gene (rank = 3) associated with type 1 Usher syndrome. The candidate gene USH1H (rank = 4) is likely to associate with the Usher syndrome as well, which was investigated by Dad et al42 In addition, for all 4 novel candidate genes in the top 10 gene list of vision disorders, we found positive validations from recent independent publications. For example, Gootwine et al43 verified that the achromatopsia can be caused by the CNGA3 (rank = 7) mutations. Furthermore, the remaining 3 candidate genes GRM6 (rank = 8), PRPH2 (rank = 9) and NR2E3 (rank = 10) were likely to associate with the subtypes of vision disorders, such as night blindness,44 visual acuity,45 and enhanced S-cone syndrome.46 Discussion In real-world clinical settings, symptoms always play an essential role in both diagnosis and treatment of diseases. Symptoms are the most directly observable manifestations of a disease.47 Therefore, the investigation of the underlying molecular mechanism of symptoms has the potential to propel the refinement of disease taxonomy48 for precision medicine. In this study, we constructed a benchmark dataset of symptom-gene associations and proposed a heterogeneous symptom-related network embedding prediction algorithm for symptom gene prediction. The experimental results indicated our algorithm achieved a significant improvement over the state-of-the-art method. The heterogeneous symptom-related network embedding prediction algorithm that we proposed can make full use of multiple symptom-related information (eg symptom-disease, disease-gene and protein-protein associations). In particular, we integrated the symptom-disease and disease-gene associations to curate a benchmark dataset of symptom-gene associations, which can be used to evaluate the performance of the proposed novel symptom gene prediction algorithms. By systematic checking of the symptom terms (more details in SM), we curated a high-quality prediction dataset that contains 17 479 symptom-candidate genes between 461 symptoms and 3620 genes (Supplementary Material S2). The benchmark and prediction datasets of symptom-gene associations can also be used to further investigate the symptom-related molecular mechanisms in experimental settings. However, due to the lasting period of curation efforts, the general “temporal” lag from state-of-the-art publications exists in most biomedical knowledge databases (eg UMLS and SEMMED). To address the limitation, we conducted the latest literature manual validation to evaluate reliability of the candidate genes. Furthermore, the experimental results indicated more information fusion can improve prediction performance. Therefore, we will consider more heterogeneous data, such as gene ontology and expression data in the next efforts. The symptom terms that were extracted from the UMLS database have hierarchy structures. For example, as a high-level category, vision disorder is the hypernym of cataracts (CUI: C0086543), cortical blindness (CUI: C0155320), and night blindness (CUI: C0028077). We will extract and curate a symptom-gene benchmark with hierarchy structures, which can impel us to design a more reliable prediction algorithm. In addition, the symptom terms from MeSH database are high-quality but with limited number. Therefore, we need further collection of various symptom terms contained in the “Clinical Finding” category of SNOMED49 to expand our dataset. However, the curation of a high-quality symptom-gene benchmark dataset will always be a systematic task that needs to be performed continuously. The semantic prediction of SEMMED would be a high-quality resource to curate the benchmark dataset with wide symptom coverage. Conclusion Symptom-gene identification is a primary step towards understanding the molecular mechanism of symptoms and refining the disease taxonomy in precision medicine. In this study, we curated a benchmark dataset of 18 270 symptom-gene associations and proposed a heterogeneous symptom-related network embedding representation algorithm for symptom gene prediction. We compared our method to the baseline algorithms (FSGER and PRINCE), the results of which indicated our algorithm achieved a significant improvement. We also curated a high-quality prediction dataset of 17 479 symptom-candidate genes that contain 461 symptoms and 3620 genes. The analysis results of the candidate genes of typical symptoms indicated that the prediction results have the potential to investigate the underlying molecular mechanisms of symptoms in the experimental settings. Funding The work is partially supported by the National Key Research and Development Program (2017YFC1703506), the Fundamental Research Funds for the Central Universities (2017YJS057, 2017JBM020), the Special Programs of Traditional Chinese Medicine (201407001, JDZX2015170 and JDZX2015171), and the National Key Technology R&D Program (2013BAI02B01 and 2013BAI13B04). Competing interests None. Contributors X. Z conceived and designed the research. K. Y performed the experiments, analyzed the data, and drafted the manuscript; N. W, G. L and R. W were involved in the data curation and analysis; X. Z, J. C, J. Y and R. Z revised the manuscript. All authors read and approved the final manuscript. SUPPLEMENTARY MATERIAL Supplementary material is available at Journal of the American Medical Informatics Association online. References 1 Li X , Zhou X , Peng Y , et al. . Network based integrated analysis of phenotype-genotype data for prioritization of candidate symptom genes . Biomed Res Int 2014 ; 2014 : 435853 . Google Scholar PubMed 2 Hofmannapitius M , Alarcónriquelme ME , Chamberlain C , et al. . Towards the taxonomy of human disease . Nature Reviews Drug Discovery 2015 ; 14 2 : 75 – 6 . Google Scholar Crossref Search ADS PubMed 3 Köhler S , Vasilevsky NA , Engelstad M , et al. . The human phenotype ontology in 2017 . Nucleic Acids Res 2017 ; 45 ( D1 ): D865 – 76 . Google Scholar Crossref Search ADS PubMed 4 Kibbe WA , Arze C , Felix V , et al. . Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data . Nucleic Acids Res 2015 ; 43 ( D1 ): D1071 . Google Scholar Crossref Search ADS PubMed 5 Rath A , Olry A , Dhombres F , et al. . Representation of rare diseases in health information systems: the Orphanet approach to serve a wide range of end users . Human Mutation 2012 ; 33 5 : 803 – 8 . Google Scholar Crossref Search ADS PubMed 6 Lupski JR , Stankiewicz P. Genomic disorders: molecular mechanisms for rearrangements and conveyed phenotypes . Plos Genet 2005 ; 1 6 : e49. Google Scholar Crossref Search ADS PubMed 7 Zhou H , Skolnick J. A knowledge-based approach for predicting gene-disease associations . Bioinformatics 2016 ; 32 18 : 2831 – 8 . Google Scholar Crossref Search ADS PubMed 8 Zeng X , Liao Y , Liu Y , et al. . Prediction and validation of disease genes using HeteSim Scores . IEEE/ACM Trans Comput Biol Bioinf 2017 ; 14 3 : 687. Google Scholar Crossref Search ADS 9 Locke AE , Kahali B , Berndt SI , et al. . Genetic studies of body mass index yield new insights for obesity biology . Nature 2015 ; 518 7538 : 197 – 206 . Google Scholar Crossref Search ADS PubMed 10 de Heer EW , Have MT , Hwj VM , et al. . Pain as a risk factor for common mental disorders. Results from the Netherlands Mental Health Survey and Incidence Study-2: a longitudinal, population-based study . Pain 2018 ; 159 : 712 – 8 . Google Scholar Crossref Search ADS PubMed 11 Mccarty CA , Chisholm RL , Chute CG , et al. . The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies . BMC Med Genomics 2011 ; 4 1 : 1 – 11 . Google Scholar Crossref Search ADS PubMed 12 Stover PJ , Harlan WR , Hammond JA , et al. . PhenX: a toolkit for interdisciplinary genetics research . Curr Opin Lipidol 2010 ; 21 2 : 136 – 40 . Google Scholar Crossref Search ADS PubMed 13 Jyotishman P , Pan H , Wang J , et al. . Evaluating phenotypic data elements for genetics and epidemiological research: experiences from the eMERGE and PhenX Network Projects . AMIA Jt Summits Transl Sci Proc 2011 ; 2011 : 41 – 5 . Google Scholar PubMed 14 Le DH , Dang VT. Ontology-based disease similarity network for disease gene prediction . Vietnam J Comp Sci 2016 ; 3 3 : 1 – 9 . 15 Calvo B , López-Bigas N , Furney SJ , et al. . A partially supervised classification approach to dominant and recessive human disease gene prediction . Comp Methods Progr Biomed 2007 ; 85 3 : 229 – 37 . Google Scholar Crossref Search ADS 16 Jiang R. Walking on multiple disease-gene networks to prioritize candidate genes . J Mol Cell Biol 2015 ; 7 3 : 214. Google Scholar Crossref Search ADS PubMed 17 Gonzálezpérez S , Pazos F , Chagoyen M. Factors affecting interactome-based prediction of human genes associated with clinical signs . BMC Bioinformatics 2017 ; 18 1 : 340 . Google Scholar Crossref Search ADS PubMed 18 Ada Hamosh AFS , Amberger JS , Bocchini CA , Victor A. McKusick Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders . Nucleic Acids Res 2005 ; 33 1 : 514 – 7 . 19 Pinero J , Queralt-Rosinach N , Bravo A , et al. . DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes . Database 2015 ; 2015 0 : bav028. Google Scholar Crossref Search ADS PubMed 20 Rappaport N , Twik M , Plaschkes I , et al. . MalaCards: an amalgamated human disease compendium with diverse clinical and genetic annotation and structured search . Nucleic Acids Res 2017 ; 45 ( D1 ): D877 – 87 . Google Scholar Crossref Search ADS PubMed 21 Keshava Prasad TS , Goel R , Kandasamy K , et al. . Human Protein Reference Database–2009 update . Nucleic Acids Res 2009 ; 37 ( Database ): D767 . Google Scholar Crossref Search ADS PubMed 22 Chatraryamontri A , Breitkreutz BJ , Oughtred R , et al. The BioGRID interaction database: 2015 update. Nucleic Acids Res 2015; 43(Database issue): D470. 23 Orchard S , Ammari M , Aranda B , et al. . The MIntAct project—IntAct as a common curation platform for 11 molecular interaction databases . Nucleic Acids Res 2014 ; 42 : 358 – 63 . Google Scholar Crossref Search ADS 24 Apweiler R. Activities at the universal protein resource (UniProt ). Nucleic Acids Res 2014 ; 42 11 : 7486 . Google Scholar Crossref Search ADS 25 Gutiérrez-Sacristán A , Grosdidier S , Valverde O , et al. . PsyGeNET: a knowledge platform on psychiatric disorders and their genes . Bioinformatics 2015 ; 31 18 : 3075 – 3077 . Google Scholar Crossref Search ADS PubMed 26 Landrum MJ , Lee JM , Riley GR , et al. . ClinVar: public archive of relationships among sequence variation and human phenotype . Nucleic Acids Res 2014 ; 42 ( Database issue ): 980 – 5 . Google Scholar Crossref Search ADS 27 Welter D , Macarthur J , Morales J , et al. . The NHGRI GWAS catalog, a curated resource of sNP-trait associations . Nucleic Acids Res 2014 ; 42 ( Database issue ): 1001 – 6 . Google Scholar Crossref Search ADS 28 Peter DA , Grondin MC , Robin J , et al. . The Comparative Toxicogenomics Database: update 2013 . Nucleic Acids Res 2011 ; 39 ( Database issue ): 1067 – 72 . 29 Menche J , Sharma A , Kitsak M , et al. .; Disease networks . Uncovering disease-disease relationships through the incomplete interactome . Science 2015 ; 347 6224 : 1257601 . Google Scholar Crossref Search ADS PubMed 30 Cowley MJ , Pinese M , Kassahn KS , et al. . PINA v2.0: mining interactome modules . Nucleic Acids Res 2012 ; 40 ( Database issue ): 862 – 5 . Google Scholar Crossref Search ADS 31 Lipscomb CE. Medical Subject Headings (MeSH ). Bull Med Libr Assoc 2000 ; 88 3 : 265 . Google Scholar PubMed 32 Kilicoglu H , Fiszman M , Rodriguez A , et al. . Semantic MEDLINE: a web application for managing the results of PubMed searches . Proc Smbm . 2008 : 69 – 76 . 33 Wheeler DL , Church DM , Lash AE , et al. . Database resources of the National Center for Biotechnology Information: 2002 update . Nucleic Acids Res 2002 ; 30 1 : 13 – 16 . Google Scholar Crossref Search ADS PubMed 34 Fisher RA. On the interpretation of χ2 from contingency tables, and the calculation of P . J R Stat Soc 1922 ; 85 1 : 87 – 94 . Google Scholar Crossref Search ADS 35 Grover A , Leskovec J. Node2vec: scalable feature learning for networks. in proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. San Francisco, CA, USA. 2016 : 855 – 864 . 36 Mikolov T , Chen K , Corrado G , et al. . Efficient estimation of word representations in vector space . arXiv 2013 . (https://arxiv.org/abs/1301.3781v3) 37 Perozzi B , Al-Rfou R , Skiena S. DeepWalk: Online learning of social representations. in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014. New York, NY, USA. 2014 : 701 – 710 . 38 Vanunu O , Magger O , Ruppin E , et al. . Associating genes and protein complexes with disease via network propagation . Plos Comput Biol 2010 ; 6 1 : e1000641. Google Scholar Crossref Search ADS PubMed 39 Billsus D , Pazzani MJ , Learning collaborative information filters. in proceedings of the 15th International Conference on Machine Learning. San Francisco, CA, USA. 1998 : 46 – 54 . 40 Bauer J , Wendland J. Candida albicans Sfl1 suppresses flocculation and filamentation . Eukaryotic Cell 2007 ; 6 10 : 1736 – 1744 . Google Scholar Crossref Search ADS PubMed 41 Jaworek TJ , Bhatti R , Latief N , et al. . USH1K, a novel locus for type I Usher syndrome, maps to chromosome 10p11.21-q21.1 . J Hum Genet 2012 ; 57 10 : 633 – 637 . Google Scholar Crossref Search ADS PubMed 42 Dad S , Østergaard E , Thykjaer T , et al. . Identification of a novel locus for a USH3 like syndrome combined with congenital cataract . Clin Genet 2010 ; 78 4 : 388 – 397 . Google Scholar Crossref Search ADS PubMed 43 Gootwine E , Ofri R , Banin E , et al. . Safety and efficacy evaluation of rAAV2tYF-PR1.7-hCNGA3 vector delivered by subretinal injection in CNGA3 mutant achromatopsia sheep . Hum Gene Ther Clin Dev 2017 ; 28 : 96 – 107 . Google Scholar Crossref Search ADS PubMed 44 Ma NG , Ad UI , et al. . Mutations in GRM6 identified in consanguineous Pakistani families with congenital stationary night blindness . Mol Vis 2015 ; 21 : 1261 – 1271 . Google Scholar PubMed 45 Chowers I , Tiosano L , Audo I , et al. . Adult-onset foveomacular vitelliform dystrophy: a fresh perspective . Prog Retinal Eye Res 2015 ; 47 : 64 – 85 . Google Scholar Crossref Search ADS 46 Kuniyoshi K , Hayashi T , Sakuramoto H , et al. . New truncation mutation of the NR2E3 gene in a Japanese patient with enhanced S-cone syndrome . Jpn J Ophthalmol 2016 ; 60 6 : 476 – 485 . Google Scholar Crossref Search ADS PubMed 47 Zhou XZ , Menche J , Barabási A , et al. . Human symptoms–disease network . Nat Commun 2014 ; 5 : 4212 . Google Scholar Crossref Search ADS PubMed 48 Zhou X , Lei L , Liu J , et al. . A systems approach to refine disease taxonomy by integrating phenotypic and molecular networks . EBioMedicine 2018 ; 31 : 79 – 91 . Google Scholar Crossref Search ADS PubMed 49 Donnelly K. SNOMED-CT: the advanced terminology and coding system for eHealth . Stud Health Technol Inform 2006 ; 121 121 : 279 . Google Scholar PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of the American Medical Informatics Association Oxford University Press

Loading next page...
 
/lp/oxford-university-press/heterogeneous-network-embedding-for-identifying-symptom-candidate-pX0ADAlzuX

References (56)

Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com
ISSN
1067-5027
eISSN
1527-974X
DOI
10.1093/jamia/ocy117
Publisher site
See Article on Publisher Site

Abstract

Abstract Objective Investigating the molecular mechanisms of symptoms is a vital task in precision medicine to refine disease taxonomy and improve the personalized management of chronic diseases. Although there are abundant experimental studies and computational efforts to obtain the candidate genes of diseases, the identification of symptom genes is rarely addressed. We curated a high-quality benchmark dataset of symptom-gene associations and proposed a heterogeneous network embedding for identifying symptom genes. Methods We proposed a heterogeneous network embedding representation algorithm, which constructed a heterogeneous symptom-related network that integrated symptom-related associations and applied an embedding representation algorithm to obtain the low-dimensional vector representation of nodes. By measuring the relevance between symptoms and genes via calculating the similarities of their vectors, the candidate genes of given symptoms can be obtained. Results A benchmark dataset of 18 270 symptom-gene associations between 505 symptoms and 4549 genes was curated. We compared our method to baseline algorithms (FSGER and PRINCE). The experimental results indicated our algorithm achieved a significant improvement over the state-of-the-art method, with precision and recall improved by 66.80% (0.844 vs 0.506) and 53.96% (0.311 vs 0.202), respectively, for TOP@3 and association precision improved by 37.71% (0.723 vs 0.525) over the PRINCE. Conclusions The experimental validation of the algorithms and the literature validation of typical symptoms indicated our method achieved excellent performance. Hence, we curated a prediction dataset of 17 479 symptom-candidate genes. The benchmark and prediction datasets have the potential to promote investigations of the molecular mechanisms of symptoms and provide candidate genes for validation in experimental settings. heterogeneous network embedding, symptom gene identification, network medicine Introduction Symptoms and signs (called symptoms in brief) are the primary evidence for clinical diagnosis and disease classification.1 As a critical layer connecting exposomes and genomes in the knowledge network, symptoms play an important role in precision medicine to refine disease taxonomy.2 In recent years, increasingly more phenotype (disease and symptom) databases, such as Human Phenotype Ontology (HPO),3 Human Disease Ontology (DO),4 and Orphanet Rare Disease Ontology (Orphanet)5 have been constructed. Most biomedical researchers are mainly focused on analyzing and understanding the molecular mechanism of disease phenotypes.6–8 The investigation of the underlying molecular mechanisms of symptom phenotypes has rarely been addressed, except for disease conditions overlapping with symptom phenotypes, such as obesity9 and pain.10 In addition, to impel the study of genome and phenotypes, the U.S. National Human Genome Research Institute initiated 2 projects, eMERGE,11 which correlates whole genome scans with phenotype data extracted from the electronic medical record systems and PhenX12 which provides investigators with high-priority, well-established, low-burden standard measures to collect phenotypic and environmental data for large-scale genomic studies. Jyotishman et al13 adopted multiple standards and biomedical terminologies to promote cross-study pooling of data and complex genotype-phenotype associations detection. Similar to the computational approaches for disease-gene prediction, symptom gene identification is also a key task for revealing the underlying molecular mechanisms of symptoms. Gene prediction of given diseases requires extensive experiments to test hundreds of candidate genes in a wet lab.14 In fact, experimental gene identification for symptoms and diseases is a difficult and time-consuming task.15 The success of network-based computational methods for identifying disease genes8,14,16 demonstrated that it is an effective method for disease gene prediction. There exists preliminary work1 that indicates it is feasible to use a network propagation approach to predict the candidate genes of symptoms and complicated factors involved in the influence of prediction performance.17 In addition, recent increasing curation of large-scale symptom-related association data, such as disease-gene associations (eg OMIM,18 DisGeNet19 and Malacards20) symptom-disease associations (Disease Ontology,4 HPO3 and Orphanet5) and protein-protein interactions (HPRD,21 BioGRID,22 and IntAct23) offer a rare opportunity for the development of computational approaches. However, to substantially promote these efforts, we still need to address 2 essential tasks: curation of a high-quality benchmark dataset and making full use of the heterogeneous symptom-related indirect association data, such as symptom-disease associations, disease-gene associations and protein-protein interactions to improve the symptom gene prediction performance. Here, by integrating symptom-disease and disease-gene associations, we curated a benchmark dataset of symptom-gene associations. We proposed a deep embedding representation algorithm on a heterogeneous symptom-related network to identify symptom genes (Figure 1). First, we constructed a heterogeneous symptom-related network, which includes symptom-disease, disease-gene and protein-protein associations. Then, the network embedding representation algorithm was applied to construct low-dimensional vector representation (LVR) of nodes (symptoms and genes) in the network. By calculating the relevance between symptoms and genes that were measured by the similarities of their vectors, the candidate genes of symptoms can be obtained. We compared the prediction performance of our algorithm to the baseline algorithms (FSGER and PRINCE). The experimental results indicated our algorithm achieved a significant improvement over baseline algorithms. Finally, a high-quality prediction dataset of symptom-candidate gene associations was curated based on the results predicted by our method. Figure 1. View largeDownload slide An overview of LSGER method. First, by integrating disease-symptom, disease-gene, and protein-protein associations (a), a heterogeneous symptom-related network (b) was constructed. Then, the network embedding algorithm was applied to obtain a low-dimensional vector representation of nodes (c). Finally, the relevance between the symptom and gene nodes can be measured by the similarities of their vectors (d). By sorting predicted genes by relevance, the candidate genes of given symptoms can be identified. Figure 1. View largeDownload slide An overview of LSGER method. First, by integrating disease-symptom, disease-gene, and protein-protein associations (a), a heterogeneous symptom-related network (b) was constructed. Then, the network embedding algorithm was applied to obtain a low-dimensional vector representation of nodes (c). Finally, the relevance between the symptom and gene nodes can be measured by the similarities of their vectors (d). By sorting predicted genes by relevance, the candidate genes of given symptoms can be identified. Methods Dataset Disease-gene associations Disease-gene associations were collected from the DisGeNet19 and Malacards20 databases (Figure 2). First, we extracted 130 820 curated disease-gene associations between 13 074 diseases with UMLS code (CUI) and 8947 genes from the DisGeNet database, which integrates disease-gene associations from UniProt,24 PsyGeNET,25 ClinVar,26 Orphanet,5 the GWAS Catalog,27 CTD28 and HPO3 databases. Second, we collected 73 064 disease-gene associations between 6118 diseases with CUIs and 8370 genes from the Malacards database. To unify and integrate the disease terms, we mapped the original disease identifiers of the 2 databases to Unified Medical Language System (UMLS) codes. Finally, the 2 data sources were integrated to obtain 196 397 disease-gene associations that include 16 594 unique diseases and 11 497 unique genes. Figure 2. View largeDownload slide A flow chart of data collection and integration. First, 87 442 disease-symptom associations were collected by integrating disease-symptom associations from the DO, HPO and Orphanet databases. We collected and integrated 196 397 disease-gene associations from the DisGeNet and Malacards databases. Then, we selected a set of 1278 symptoms with DP characteristics from the MeSH database and the integrated associations. Finally, a benchmark dataset of 18 270 symptom-gene associations was curated. Figure 2. View largeDownload slide A flow chart of data collection and integration. First, 87 442 disease-symptom associations were collected by integrating disease-symptom associations from the DO, HPO and Orphanet databases. We collected and integrated 196 397 disease-gene associations from the DisGeNet and Malacards databases. Then, we selected a set of 1278 symptoms with DP characteristics from the MeSH database and the integrated associations. Finally, a benchmark dataset of 18 270 symptom-gene associations was curated. Protein-protein interactions The protein-protein interactions (PPIs) were collected from Menche et al,29 and include 213 888 records with 15 964 unique proteins. These data are integrated PPI data derived from multiple data sources, such as HPRD,21 BioGRID,22 IntAct23 and PINA.30 Disease-symptom associations Disease-symptom associations were collected from the DO,4 HPO3 and Orphanet5 databases (Figure 2). To unify the disease terms from the different datasets, we mapped the original disease codes to UMLS codes. We collected 1008 disease-symptom associations between 204 diseases and 417 symptoms from the DO database, 87 442 disease-symptom associations between 4366 diseases and 6176 symptoms from the HPO database, and 35 039 disease-symptom associations between 2391 diseases and 3721 symptoms from the Orphanet database. By integrating the 3 data sources, we finally obtained 100 305 distinct disease-symptom associations (DSA) between 5605 diseases and 6935 symptoms. Benchmark dataset construction of symptom-gene associations By integrating symptom-related and gene-related association data, we curated a benchmark dataset of symptom-gene associations (called BDSG) (Figure 2). In particular, to obtain the high quality symptom gene associations, we utilized the phenomenon of some “Dual Phenotypes” (DP), such as obesity, fever, back pain, and vertigo, which are not only regarded as diseases, but also as symptoms in medical fields. The associated genes of symptoms with DP characteristics can be directly derived from the disease-gene associations with high quality assurance. To identify these kinds of phenotype terms with DP characteristics, we utilized the hierarchical tree codes (eg C08: respiratory tract diseases and C08.618.248: cough) from MeSH31 terminology to relate the disease terms in our dataset. First, we collected 1051 symptom terms whose MeSH tree codes start with C23.888. Second, we extracted the disease term list and symptom term list from DSA, respectively, and identified the DP symptoms by intersecting the 2 lists. After obtaining the union set of the aforementioned 2 symptom lists, we curated 1278 symptoms with distinct UMLS CUIs. Then, by intersecting the CUIs from the diseases in the integrated disease-gene associations, we obtained 505 symptoms with the DP characteristics, from which we finally curated 18 270 high quality symptom-gene associations (Supplementary Material S1) between these 505 symptoms and 4549 genes. In addition, to curate a more comprehensive symptom-gene benchmark dataset, we further collected the symptom-gene associations derived from the SEMMED32 database, which offered semantic predictions from the titles and abstracts of PubMed33 literatures. We extracted the gene-related semantic predictions about symptom terminologies and finally obtained 50 907 symptom-gene associations (called SPSG) between 932 symptoms and 9382 genes. Fisher-based statistics model for symptom gene prediction Based on the Fisher exact test,34 we proposed a Fisher-based statistical model to predict symptom genes (FSGER) as a baseline method. Based on the symptom-disease and disease-gene associations, we considered the diseases as a bridge to connect symptoms and genes. In detail, for symptom s and gene g ⁠, we defined a ⁠, b ⁠, c and d to represent the number of diseases associated with s and g ⁠, associated with s but not g ⁠, associated with g but not s and associated with neither s nor g ⁠, respectively. The relevance Rel(sg) between the symptom s and the gene g can be defined as follows: Relsg=1-a+b!c+d!a+c!b+d!a!b!c!d!n! where n represents the number of all the related diseases. Then, by ranking the predicted genes by the relevance, the ranking gene lists of given symptoms can be obtained. Heterogeneous symptom-related network embedding representation Network embedding representation learning35 is an effective algorithm for learning the low-dimensional feature vectors of the nodes in a given network, and it can effectively preserve the local and global structure information of the network. Network embedding representation methods are applicable in many tasks, such as visualization, label classification and link prediction.35 In this study, we constructed a heterogeneous symptom-related network, and applied the network embedding algorithm node2vec35 to obtain the low-dimensional vector representation of the nodes in the network. As a well-known algorithm for network embedding representation, the main idea of node2vec is to learn a mapping of nodes to a low-dimensional space of features that maximizes the likelihood of preserving network neighborhoods of nodes. In detail, for a given network G=(VE) ⁠, the aim of node2vec is to learn the mapping function f:V→Rd (parameter d is the number of feature dimensions) from nodes to feature representations. By applying the Skip-Gram architecture to the network,36,37 the objective function can be optimized by maximizing the log-probability of observing the network neighborhood Ns(u) for node u conditioned on its feature representation as follows: maxf⁡∑u∈Vlog⁡Pr(Ns(u)|f(u)) For the node u∈V ⁠, its network neighborhoods Ns(u) can be generated through a neighborhood sampling strategy S. The authors of node2vec proposed a biased random walk strategy, which can flexibly and efficiently explore the diverse neighborhoods of nodes. Given a source node u ⁠, the random walk of fixed length i can be simulated, and node ci (that is, the i-th node in the random walk, and c0=u ⁠) was generated by the distribution function: Pci=xci-1=v=πvxZif(vx)∈E0otherwise where πvx is the unnormalized transition probability between nodes v and x ⁠, and Z is the normalizing constant. By applying the 2 standard assumptions, conditional independence and symmetry in the feature space, the low-dimensional vector features of nodes can be measured using stochastic gradient ascent over the model. We constructed 2 heterogeneous networks, SDGNet, which integrated symptom-disease and disease-gene associations and SDGPNet, which integrated symptom-disease, disease-gene, and protein-protein associations. Given a heterogeneous network G=VE ⁠, V and E represented the nodes and edges of the network. Then, we applied the network embedding representation algorithm to learn the LVR of nodes. Finally, the node v can be mapped to a low-dimensional vector Nv ⁠. LVR-based similarity prediction model to identify symptom genes We can obtain the LVR of the nodes in the given network based on a heterogeneous network embedding representation algorithm. The low-dimensional vector features of nodes fused the local structure (neighbor of nodes) and global structure information of the network. Then, we proposed a LVR-based similarity model for symptom gene prediction (LSGER). The relevance between the symptom and gene nodes can be measured by the similarities of their low-dimensional vectors. Mathematically, given the symptom node vs and the gene node vg ⁠, we can measure the relevance Relvsvg between them by calculating the LVR-based cosine similarity  cos NvsNvg of their vectors Nvs and Nvg as follows: Relvsvg= cos NvsNvg=Nvs·NvgNvs·Nvg By calculating and sorting the correlations between query symptom and all candidate genes, we can obtain a ranking list of candidate genes for the query symptom. Otherwise, for the symptom vs ⁠, we designed a pre-selection strategy of candidate genes: selecting the genes of diseases related to vs as candidate gene pool and compared to no-selection strategy: selecting all genes as a candidate gene pool. Based on the 2 strategies, the 2 variants LSGER-AG (all genes) and LSGER-DG (with filtered disease gene) of LSGER algorithm were proposed. Experimental setting and evaluation We constructed 2 benchmark datasets of symptom-gene associations (BDSG and SPSG), which can be used to evaluate the prediction performance of different algorithms. In the experiment, we removed all the known genes of the symptoms in the benchmark dataset and predicted the candidate genes of every test symptom, which indicated that there were not any priori symptom-gene associations for all the prediction algorithms. Our method was compared to the baseline algorithms FSGER and PRINCE.1 Foremost, the PRINCE was proposed by Vanunu et al38 to predict disease genes. Li et al1 extended the PRINCE and applied it to the task of symptom genes prediction. In their work, a network propagation method was used in the PPI network to obtain priority scores of candidate genes. The FSGER algorithm is a Fisher-based statistics model that connected disease-symptom and disease-gene associations for symptom genes prediction. We adopted precision (PR), recall (RE), F1-score (F1),39 association precision (AP) and area under curve (AUC) as the evaluation metrics. Given a test symptom set S with m symptoms, for every test symptom s∈S ⁠, T(s) represents the test gene set of symptom s ⁠. Given a ranking list of predicted genes, we selected the top i genes Ris of the ranking list (⁠ i=310 ⁠) as candidate genes. The precision, recall and F1-score for TOP@i can be defined as follows: Precision=1M∑s∈S|T(s)∩Ris||Ris| Recall=1M∑s∈S|T(s)∩Ris||T(s)| F1-score=2·precision·recallprecision+recall The recall was calculated in the top 3 or 10 candidate genes, which may lead to low recall values. Since we used the same mode of calculating the recall, it is fair for all the prediction algorithms. In addition, for every test symptom s ⁠, the top k genes Rks of ranking list were also selected (⁠ k equals to the number of test genes of symptom s ⁠). The association precision can be defined, as follows: AP=∑s∈S|T(s)∩Rks|∑s∈S|Rks| In addition, we also used the AUC to evaluate the prediction performance. For every test symptom, we selected the top 100 predicted genes as candidate genes and obtained the predicted scores of symptom-candidate genes pairs. Then, we ranked all the symptom-candidate gene pairs by the scores and calculated the AUC values. Compared to the AUC calculation of homogeneous network in link prediction tasks, the AUC calculation in this study may lead to the inapposite AUC of prediction results. Hence, the AUC evaluation is only a supplement to the other metrics. Results LVR-based similarity model to predict symptom genes For LSGER, we compared it to the PRINCE and FSGER algorithms. We adopted precision, recall, F1-score for TOP@3 and TOP@10, association precision and AUC as evaluation metrics. For LSGER-AG and LSGER-DG algorithms, we used 2 heterogeneous networks, SDGNet and SDGPNet, as test networks. First, the experimental results (Table 1) on the BDSG dataset show that, compared to the baseline algorithm PRINCE (AP = 0.525; PR = 0.506 and RE = 0.202 for TOP@3), the FSGER algorithm achieved slightly better performance: AP improved by 2.10%; PR and RE improved by 20.55% and 17.33%, respectively, for TOP@3. The LSGER-AG with SDGPNet yielded the best performance: compared to PRINCE, AP improved by 37.71%; AUC improved by 21.60%; PR and RE improved by 66.80% and 53.96%, respectively, for TOP@3. Second, the LSGER algorithm with SDGPNet obtained slightly higher performance than did the SDGNet (LSGER-AG: PR and RE improved by 1.69% and 3.67%, respectively, for TOP@3; LSGER-DG: PR and RE improved by 1.58% and 3.32%, respectively, for TOP@3), which indicated that the fusion of more gene-related information (PPI network) improved prediction performance of LSGER algorithm. Finally, in terms of precision and recall for TOP@3, both LSGER-AG and LSGER-DG had similar prediction performance. However, in terms of AP, the prediction performance of LSGER-DG was better than that of LSGER-AG (with SDGNet: AP improved by 6.31%; with SDGPNet: AP improved by 9.54%), which indicated the candidate gene pre-selection improved the prediction performance of the LSGER algorithm. Table 1. The performance comparison of symptom gene prediction algorithms TOP@3 TOP@10 Network Algorithm AP AUC Precision Recall F1-score Precision Recall F1-score – PRINCE 0.525 0.736 0.506 0.202 0.211 0.420 0.371 0.296 – FSGER 0.536 0.564 0.610 0.237 0.252 0.486 0.422 0.344 SDGNet LSGER-AG 0.745 0.890 0.830 0.300 0.327 0.719 0.572 0.488 SDGNet LSGER-DG 0.792 0.856 0.821 0.301 0.327 0.693 0.561 0.473 SDGPNet LSGER-AG 0.723 0.895 0.844 0.311 0.338 0.719 0.576 0.489 SDGPNet LSGER-DG 0.792 0.853 0.834 0.311 0.336 0.698 0.568 0.478 TOP@3 TOP@10 Network Algorithm AP AUC Precision Recall F1-score Precision Recall F1-score – PRINCE 0.525 0.736 0.506 0.202 0.211 0.420 0.371 0.296 – FSGER 0.536 0.564 0.610 0.237 0.252 0.486 0.422 0.344 SDGNet LSGER-AG 0.745 0.890 0.830 0.300 0.327 0.719 0.572 0.488 SDGNet LSGER-DG 0.792 0.856 0.821 0.301 0.327 0.693 0.561 0.473 SDGPNet LSGER-AG 0.723 0.895 0.844 0.311 0.338 0.719 0.576 0.489 SDGPNet LSGER-DG 0.792 0.853 0.834 0.311 0.336 0.698 0.568 0.478 The bold values represent best performance for each metrics (e.g. AUC, precision and recall). AP represents association precision. Table 1. The performance comparison of symptom gene prediction algorithms TOP@3 TOP@10 Network Algorithm AP AUC Precision Recall F1-score Precision Recall F1-score – PRINCE 0.525 0.736 0.506 0.202 0.211 0.420 0.371 0.296 – FSGER 0.536 0.564 0.610 0.237 0.252 0.486 0.422 0.344 SDGNet LSGER-AG 0.745 0.890 0.830 0.300 0.327 0.719 0.572 0.488 SDGNet LSGER-DG 0.792 0.856 0.821 0.301 0.327 0.693 0.561 0.473 SDGPNet LSGER-AG 0.723 0.895 0.844 0.311 0.338 0.719 0.576 0.489 SDGPNet LSGER-DG 0.792 0.853 0.834 0.311 0.336 0.698 0.568 0.478 TOP@3 TOP@10 Network Algorithm AP AUC Precision Recall F1-score Precision Recall F1-score – PRINCE 0.525 0.736 0.506 0.202 0.211 0.420 0.371 0.296 – FSGER 0.536 0.564 0.610 0.237 0.252 0.486 0.422 0.344 SDGNet LSGER-AG 0.745 0.890 0.830 0.300 0.327 0.719 0.572 0.488 SDGNet LSGER-DG 0.792 0.856 0.821 0.301 0.327 0.693 0.561 0.473 SDGPNet LSGER-AG 0.723 0.895 0.844 0.311 0.338 0.719 0.576 0.489 SDGPNet LSGER-DG 0.792 0.853 0.834 0.311 0.336 0.698 0.568 0.478 The bold values represent best performance for each metrics (e.g. AUC, precision and recall). AP represents association precision. Furthermore, we have performed the comparative experiments with different similarity metrics in the supplementary materials (SM). We have selected 3 classical similarity metrics, cosine similarity (Sim_cos), Euclidean distance similarity (Sim_eu) and Pearson similarity (Sim_pea), to measure the vector similarities of symptom and gene nodes. The results predicted by LSGER-AG algorithm with the SDGNet and SDGPNet networks indicated that different similarity metrics had some degree of influence on the prediction performance of our algorithm. For example, in term of precision (PR) and recall (RE) for TOP@3, the prediction algorithm with Sim_pea (PR = 0.852; RE = 0.314), Sim_eu (PR = 0.871; RE = 0.318) and Sim_cos (PR = 0.844; RE = 0.311) obtained similar performances on recall but different results on precision measure. In the SM section, we also compared the performance of symptom-gene prediction algorithms on the SPSG dataset. The prediction results indicated that the LSGER-DG with SDGPNet still obtained the best performance: compared to the PRINCE algorithm, the recall and F1-score improved by 35.32% and 64.24%, respectively. Compared to the BDSG dataset with highly credible symptom-gene associations, the prediction associations offered by the SEMMED had a low confidence. Therefore, the evaluation results on the BDSG dataset can be of greater value than those on the SPSG dataset. From the above, our method had a higher performance than other prediction algorithms. Case study: candidate genes of some typical symptoms To illustrate the performance of prediction algorithm, we showed the prediction performance using LSGER-AG with SDGPNet of several typical symptoms (Table 2), including constipation (CUI: C0009806), nausea (CUI: C0027497), pain (CUI: C0030193), Usher syndromes (C1568248), vision disorders (C0042790), and aphasia (C0003537), which are regarded as DP symptom terms. The top 10 candidate genes of these symptoms were also listed (Table 3), and the bold genes in the table are the known genes of these symptoms. For example, for constipation, the top 9 candidate genes are the known genes (PR = 0.9 for TOP@10). In addition, for the candidate genes (Table 3) of pain, we found 9 benchmark genes and the left gene ZNF470 (rank = 5) was related to amyotrophic lateral sclerosis (ALS).40 We searched HPO3 and found that pain is one of the typical symptoms of ALS. Therefore, ZNF470 might be a novel gene for pain. Table 2. The prediction performance of some specific symptoms TOP@3 TOP@10 ID Symptom (CUI) Number of hit genes/test genes Precision Recall F1-score Precision Recall F1-score 1 Constipation (C0009806) 109/158 1.000 0.019 0.037 0.900 0.057 0.107 2 Nausea (C0027497) 11/17 1.000 0.176 0.300 0.800 0.471 0.593 3 Pain (C0030193) 54/79 1.000 0.038 0.073 0.900 0.114 0.202 4 Usher syndromes (C1568248) 15/18 0.667 0.111 0.190 0.700 0.389 0.500 5 Vision disorders (C0042790) 6/6 1.000 0.500 0.667 0.600 1.000 0.750 6 Aphasia (C0003537) 5/9 0.333 0.111 0.167 0.600 0.667 0.632 TOP@3 TOP@10 ID Symptom (CUI) Number of hit genes/test genes Precision Recall F1-score Precision Recall F1-score 1 Constipation (C0009806) 109/158 1.000 0.019 0.037 0.900 0.057 0.107 2 Nausea (C0027497) 11/17 1.000 0.176 0.300 0.800 0.471 0.593 3 Pain (C0030193) 54/79 1.000 0.038 0.073 0.900 0.114 0.202 4 Usher syndromes (C1568248) 15/18 0.667 0.111 0.190 0.700 0.389 0.500 5 Vision disorders (C0042790) 6/6 1.000 0.500 0.667 0.600 1.000 0.750 6 Aphasia (C0003537) 5/9 0.333 0.111 0.167 0.600 0.667 0.632 Table 2. The prediction performance of some specific symptoms TOP@3 TOP@10 ID Symptom (CUI) Number of hit genes/test genes Precision Recall F1-score Precision Recall F1-score 1 Constipation (C0009806) 109/158 1.000 0.019 0.037 0.900 0.057 0.107 2 Nausea (C0027497) 11/17 1.000 0.176 0.300 0.800 0.471 0.593 3 Pain (C0030193) 54/79 1.000 0.038 0.073 0.900 0.114 0.202 4 Usher syndromes (C1568248) 15/18 0.667 0.111 0.190 0.700 0.389 0.500 5 Vision disorders (C0042790) 6/6 1.000 0.500 0.667 0.600 1.000 0.750 6 Aphasia (C0003537) 5/9 0.333 0.111 0.167 0.600 0.667 0.632 TOP@3 TOP@10 ID Symptom (CUI) Number of hit genes/test genes Precision Recall F1-score Precision Recall F1-score 1 Constipation (C0009806) 109/158 1.000 0.019 0.037 0.900 0.057 0.107 2 Nausea (C0027497) 11/17 1.000 0.176 0.300 0.800 0.471 0.593 3 Pain (C0030193) 54/79 1.000 0.038 0.073 0.900 0.114 0.202 4 Usher syndromes (C1568248) 15/18 0.667 0.111 0.190 0.700 0.389 0.500 5 Vision disorders (C0042790) 6/6 1.000 0.500 0.667 0.600 1.000 0.750 6 Aphasia (C0003537) 5/9 0.333 0.111 0.167 0.600 0.667 0.632 Table 3. The top 10 candidate genes of some specific symptoms Rank Constipation (C0009806) Nausea (C0027497) Pain (C0030193) Usher syndromes (C1568248) Vision disorders (C0042790) Aphasia (C0003537) 1 SEMA3C ETFDH PROKR1 USH1G TSEN54 LOC643387 2 NRTN ETFB PON3 PDZD7 TSEN2 ATP1A2 3 GPBAR1 ETFA PNOC USH1K TSEN34 PSNP2 4 HMBS LPL DAO USH1H TTPA GRN 5 MLNR HMBS ZNF470 USH1E CLN6 GRIN2A 6 DUOX2 IFNA2 HTR3B CIB2 ATXN7 ADA2 7 TRHR COQ4 NTSR1 CDH23 CNGA3 REEP1 8 MLN SLC7A7 TRPA1 MT-TS2 GRM6 MAPT 9 SCN11A ACADM UNC13A USH1C PRPH2 L1CAM 10 CELIAC8 TNF BDKRB1 WHRN NR2E3 NOTCH3 Rank Constipation (C0009806) Nausea (C0027497) Pain (C0030193) Usher syndromes (C1568248) Vision disorders (C0042790) Aphasia (C0003537) 1 SEMA3C ETFDH PROKR1 USH1G TSEN54 LOC643387 2 NRTN ETFB PON3 PDZD7 TSEN2 ATP1A2 3 GPBAR1 ETFA PNOC USH1K TSEN34 PSNP2 4 HMBS LPL DAO USH1H TTPA GRN 5 MLNR HMBS ZNF470 USH1E CLN6 GRIN2A 6 DUOX2 IFNA2 HTR3B CIB2 ATXN7 ADA2 7 TRHR COQ4 NTSR1 CDH23 CNGA3 REEP1 8 MLN SLC7A7 TRPA1 MT-TS2 GRM6 MAPT 9 SCN11A ACADM UNC13A USH1C PRPH2 L1CAM 10 CELIAC8 TNF BDKRB1 WHRN NR2E3 NOTCH3 The genes with bold fonts represent the candiate genes that are known genes of the corresponding symptoms. Table 3. The top 10 candidate genes of some specific symptoms Rank Constipation (C0009806) Nausea (C0027497) Pain (C0030193) Usher syndromes (C1568248) Vision disorders (C0042790) Aphasia (C0003537) 1 SEMA3C ETFDH PROKR1 USH1G TSEN54 LOC643387 2 NRTN ETFB PON3 PDZD7 TSEN2 ATP1A2 3 GPBAR1 ETFA PNOC USH1K TSEN34 PSNP2 4 HMBS LPL DAO USH1H TTPA GRN 5 MLNR HMBS ZNF470 USH1E CLN6 GRIN2A 6 DUOX2 IFNA2 HTR3B CIB2 ATXN7 ADA2 7 TRHR COQ4 NTSR1 CDH23 CNGA3 REEP1 8 MLN SLC7A7 TRPA1 MT-TS2 GRM6 MAPT 9 SCN11A ACADM UNC13A USH1C PRPH2 L1CAM 10 CELIAC8 TNF BDKRB1 WHRN NR2E3 NOTCH3 Rank Constipation (C0009806) Nausea (C0027497) Pain (C0030193) Usher syndromes (C1568248) Vision disorders (C0042790) Aphasia (C0003537) 1 SEMA3C ETFDH PROKR1 USH1G TSEN54 LOC643387 2 NRTN ETFB PON3 PDZD7 TSEN2 ATP1A2 3 GPBAR1 ETFA PNOC USH1K TSEN34 PSNP2 4 HMBS LPL DAO USH1H TTPA GRN 5 MLNR HMBS ZNF470 USH1E CLN6 GRIN2A 6 DUOX2 IFNA2 HTR3B CIB2 ATXN7 ADA2 7 TRHR COQ4 NTSR1 CDH23 CNGA3 REEP1 8 MLN SLC7A7 TRPA1 MT-TS2 GRM6 MAPT 9 SCN11A ACADM UNC13A USH1C PRPH2 L1CAM 10 CELIAC8 TNF BDKRB1 WHRN NR2E3 NOTCH3 The genes with bold fonts represent the candiate genes that are known genes of the corresponding symptoms. We further evaluated the predicted genes of pain by additional validations from PPI interactions and genetic functional analysis. In particular, we extracted the interaction of the top 49 predicted genes of pain in the context of the whole PPI network and showed the interaction map of them (Figure 3a), which includes 36 benchmark genes and 13 novel candidate genes. There are dense interactions (95 interactions) between those benchmark genes and the novel candidate genes compared to the interactions with random controls (p-value = 6.82e-68), which indicated that the novel genes are located close to benchmark genes in the PPI network. Further enrichment analysis (Gene Ontology and Pathway) of the pain predicted genes obtained similar results (Figure 3b). For example, there are 9 candidate genes and 11 known genes on the neuroactive ligand-receptor interaction pathway (p-value = 9.90E-15). Therefore, additional analysis indicated that there exist heavy interactions among the candidate genes and known genes of pain, which partially validate the rationality of the prediction results. Figure 3. View largeDownload slide PPI interaction and genetic functional analysis of the predicted genes of pain. We extracted the interaction of the top 49 predicted genes of pain in the context of the whole protein-protein interaction (PPI) network and showed interaction matrix of them (a), which includes 36 known genes (ie benchmark genes) and 13 novel candidate genes. There are dense interactions (95 interactions) between those benchmark genes and the novel candidate genes compared to those with random controls (p-value=6.82e-68), which indicated that the novel genes were located close to benchmark genes in the PPI network. Further pathway and Gene Ontology (termed GO) enrichment analysis of the pain predicted genes obtained similar results. The bold and underlined genes are known and candidate genes of pain, respectively. Figure 3. View largeDownload slide PPI interaction and genetic functional analysis of the predicted genes of pain. We extracted the interaction of the top 49 predicted genes of pain in the context of the whole protein-protein interaction (PPI) network and showed interaction matrix of them (a), which includes 36 known genes (ie benchmark genes) and 13 novel candidate genes. There are dense interactions (95 interactions) between those benchmark genes and the novel candidate genes compared to those with random controls (p-value=6.82e-68), which indicated that the novel genes were located close to benchmark genes in the PPI network. Further pathway and Gene Ontology (termed GO) enrichment analysis of the pain predicted genes obtained similar results. The bold and underlined genes are known and candidate genes of pain, respectively. To fully evaluate the candidate genes that were not recorded in the BDSG dataset, we manually searched the recently published biomedical papers to verify the novel candidate genes. For example, for the novel candidate genes of Usher syndromes (PR = 0.7 for TOP@10), we found that Jaworek et al41 confirmed the locus (chromosome 10p11.21-q21.1) of USH1K gene (rank = 3) associated with type 1 Usher syndrome. The candidate gene USH1H (rank = 4) is likely to associate with the Usher syndrome as well, which was investigated by Dad et al42 In addition, for all 4 novel candidate genes in the top 10 gene list of vision disorders, we found positive validations from recent independent publications. For example, Gootwine et al43 verified that the achromatopsia can be caused by the CNGA3 (rank = 7) mutations. Furthermore, the remaining 3 candidate genes GRM6 (rank = 8), PRPH2 (rank = 9) and NR2E3 (rank = 10) were likely to associate with the subtypes of vision disorders, such as night blindness,44 visual acuity,45 and enhanced S-cone syndrome.46 Discussion In real-world clinical settings, symptoms always play an essential role in both diagnosis and treatment of diseases. Symptoms are the most directly observable manifestations of a disease.47 Therefore, the investigation of the underlying molecular mechanism of symptoms has the potential to propel the refinement of disease taxonomy48 for precision medicine. In this study, we constructed a benchmark dataset of symptom-gene associations and proposed a heterogeneous symptom-related network embedding prediction algorithm for symptom gene prediction. The experimental results indicated our algorithm achieved a significant improvement over the state-of-the-art method. The heterogeneous symptom-related network embedding prediction algorithm that we proposed can make full use of multiple symptom-related information (eg symptom-disease, disease-gene and protein-protein associations). In particular, we integrated the symptom-disease and disease-gene associations to curate a benchmark dataset of symptom-gene associations, which can be used to evaluate the performance of the proposed novel symptom gene prediction algorithms. By systematic checking of the symptom terms (more details in SM), we curated a high-quality prediction dataset that contains 17 479 symptom-candidate genes between 461 symptoms and 3620 genes (Supplementary Material S2). The benchmark and prediction datasets of symptom-gene associations can also be used to further investigate the symptom-related molecular mechanisms in experimental settings. However, due to the lasting period of curation efforts, the general “temporal” lag from state-of-the-art publications exists in most biomedical knowledge databases (eg UMLS and SEMMED). To address the limitation, we conducted the latest literature manual validation to evaluate reliability of the candidate genes. Furthermore, the experimental results indicated more information fusion can improve prediction performance. Therefore, we will consider more heterogeneous data, such as gene ontology and expression data in the next efforts. The symptom terms that were extracted from the UMLS database have hierarchy structures. For example, as a high-level category, vision disorder is the hypernym of cataracts (CUI: C0086543), cortical blindness (CUI: C0155320), and night blindness (CUI: C0028077). We will extract and curate a symptom-gene benchmark with hierarchy structures, which can impel us to design a more reliable prediction algorithm. In addition, the symptom terms from MeSH database are high-quality but with limited number. Therefore, we need further collection of various symptom terms contained in the “Clinical Finding” category of SNOMED49 to expand our dataset. However, the curation of a high-quality symptom-gene benchmark dataset will always be a systematic task that needs to be performed continuously. The semantic prediction of SEMMED would be a high-quality resource to curate the benchmark dataset with wide symptom coverage. Conclusion Symptom-gene identification is a primary step towards understanding the molecular mechanism of symptoms and refining the disease taxonomy in precision medicine. In this study, we curated a benchmark dataset of 18 270 symptom-gene associations and proposed a heterogeneous symptom-related network embedding representation algorithm for symptom gene prediction. We compared our method to the baseline algorithms (FSGER and PRINCE), the results of which indicated our algorithm achieved a significant improvement. We also curated a high-quality prediction dataset of 17 479 symptom-candidate genes that contain 461 symptoms and 3620 genes. The analysis results of the candidate genes of typical symptoms indicated that the prediction results have the potential to investigate the underlying molecular mechanisms of symptoms in the experimental settings. Funding The work is partially supported by the National Key Research and Development Program (2017YFC1703506), the Fundamental Research Funds for the Central Universities (2017YJS057, 2017JBM020), the Special Programs of Traditional Chinese Medicine (201407001, JDZX2015170 and JDZX2015171), and the National Key Technology R&D Program (2013BAI02B01 and 2013BAI13B04). Competing interests None. Contributors X. Z conceived and designed the research. K. Y performed the experiments, analyzed the data, and drafted the manuscript; N. W, G. L and R. W were involved in the data curation and analysis; X. Z, J. C, J. Y and R. Z revised the manuscript. All authors read and approved the final manuscript. SUPPLEMENTARY MATERIAL Supplementary material is available at Journal of the American Medical Informatics Association online. References 1 Li X , Zhou X , Peng Y , et al. . Network based integrated analysis of phenotype-genotype data for prioritization of candidate symptom genes . Biomed Res Int 2014 ; 2014 : 435853 . Google Scholar PubMed 2 Hofmannapitius M , Alarcónriquelme ME , Chamberlain C , et al. . Towards the taxonomy of human disease . Nature Reviews Drug Discovery 2015 ; 14 2 : 75 – 6 . Google Scholar Crossref Search ADS PubMed 3 Köhler S , Vasilevsky NA , Engelstad M , et al. . The human phenotype ontology in 2017 . Nucleic Acids Res 2017 ; 45 ( D1 ): D865 – 76 . Google Scholar Crossref Search ADS PubMed 4 Kibbe WA , Arze C , Felix V , et al. . Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data . Nucleic Acids Res 2015 ; 43 ( D1 ): D1071 . Google Scholar Crossref Search ADS PubMed 5 Rath A , Olry A , Dhombres F , et al. . Representation of rare diseases in health information systems: the Orphanet approach to serve a wide range of end users . Human Mutation 2012 ; 33 5 : 803 – 8 . Google Scholar Crossref Search ADS PubMed 6 Lupski JR , Stankiewicz P. Genomic disorders: molecular mechanisms for rearrangements and conveyed phenotypes . Plos Genet 2005 ; 1 6 : e49. Google Scholar Crossref Search ADS PubMed 7 Zhou H , Skolnick J. A knowledge-based approach for predicting gene-disease associations . Bioinformatics 2016 ; 32 18 : 2831 – 8 . Google Scholar Crossref Search ADS PubMed 8 Zeng X , Liao Y , Liu Y , et al. . Prediction and validation of disease genes using HeteSim Scores . IEEE/ACM Trans Comput Biol Bioinf 2017 ; 14 3 : 687. Google Scholar Crossref Search ADS 9 Locke AE , Kahali B , Berndt SI , et al. . Genetic studies of body mass index yield new insights for obesity biology . Nature 2015 ; 518 7538 : 197 – 206 . Google Scholar Crossref Search ADS PubMed 10 de Heer EW , Have MT , Hwj VM , et al. . Pain as a risk factor for common mental disorders. Results from the Netherlands Mental Health Survey and Incidence Study-2: a longitudinal, population-based study . Pain 2018 ; 159 : 712 – 8 . Google Scholar Crossref Search ADS PubMed 11 Mccarty CA , Chisholm RL , Chute CG , et al. . The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies . BMC Med Genomics 2011 ; 4 1 : 1 – 11 . Google Scholar Crossref Search ADS PubMed 12 Stover PJ , Harlan WR , Hammond JA , et al. . PhenX: a toolkit for interdisciplinary genetics research . Curr Opin Lipidol 2010 ; 21 2 : 136 – 40 . Google Scholar Crossref Search ADS PubMed 13 Jyotishman P , Pan H , Wang J , et al. . Evaluating phenotypic data elements for genetics and epidemiological research: experiences from the eMERGE and PhenX Network Projects . AMIA Jt Summits Transl Sci Proc 2011 ; 2011 : 41 – 5 . Google Scholar PubMed 14 Le DH , Dang VT. Ontology-based disease similarity network for disease gene prediction . Vietnam J Comp Sci 2016 ; 3 3 : 1 – 9 . 15 Calvo B , López-Bigas N , Furney SJ , et al. . A partially supervised classification approach to dominant and recessive human disease gene prediction . Comp Methods Progr Biomed 2007 ; 85 3 : 229 – 37 . Google Scholar Crossref Search ADS 16 Jiang R. Walking on multiple disease-gene networks to prioritize candidate genes . J Mol Cell Biol 2015 ; 7 3 : 214. Google Scholar Crossref Search ADS PubMed 17 Gonzálezpérez S , Pazos F , Chagoyen M. Factors affecting interactome-based prediction of human genes associated with clinical signs . BMC Bioinformatics 2017 ; 18 1 : 340 . Google Scholar Crossref Search ADS PubMed 18 Ada Hamosh AFS , Amberger JS , Bocchini CA , Victor A. McKusick Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders . Nucleic Acids Res 2005 ; 33 1 : 514 – 7 . 19 Pinero J , Queralt-Rosinach N , Bravo A , et al. . DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes . Database 2015 ; 2015 0 : bav028. Google Scholar Crossref Search ADS PubMed 20 Rappaport N , Twik M , Plaschkes I , et al. . MalaCards: an amalgamated human disease compendium with diverse clinical and genetic annotation and structured search . Nucleic Acids Res 2017 ; 45 ( D1 ): D877 – 87 . Google Scholar Crossref Search ADS PubMed 21 Keshava Prasad TS , Goel R , Kandasamy K , et al. . Human Protein Reference Database–2009 update . Nucleic Acids Res 2009 ; 37 ( Database ): D767 . Google Scholar Crossref Search ADS PubMed 22 Chatraryamontri A , Breitkreutz BJ , Oughtred R , et al. The BioGRID interaction database: 2015 update. Nucleic Acids Res 2015; 43(Database issue): D470. 23 Orchard S , Ammari M , Aranda B , et al. . The MIntAct project—IntAct as a common curation platform for 11 molecular interaction databases . Nucleic Acids Res 2014 ; 42 : 358 – 63 . Google Scholar Crossref Search ADS 24 Apweiler R. Activities at the universal protein resource (UniProt ). Nucleic Acids Res 2014 ; 42 11 : 7486 . Google Scholar Crossref Search ADS 25 Gutiérrez-Sacristán A , Grosdidier S , Valverde O , et al. . PsyGeNET: a knowledge platform on psychiatric disorders and their genes . Bioinformatics 2015 ; 31 18 : 3075 – 3077 . Google Scholar Crossref Search ADS PubMed 26 Landrum MJ , Lee JM , Riley GR , et al. . ClinVar: public archive of relationships among sequence variation and human phenotype . Nucleic Acids Res 2014 ; 42 ( Database issue ): 980 – 5 . Google Scholar Crossref Search ADS 27 Welter D , Macarthur J , Morales J , et al. . The NHGRI GWAS catalog, a curated resource of sNP-trait associations . Nucleic Acids Res 2014 ; 42 ( Database issue ): 1001 – 6 . Google Scholar Crossref Search ADS 28 Peter DA , Grondin MC , Robin J , et al. . The Comparative Toxicogenomics Database: update 2013 . Nucleic Acids Res 2011 ; 39 ( Database issue ): 1067 – 72 . 29 Menche J , Sharma A , Kitsak M , et al. .; Disease networks . Uncovering disease-disease relationships through the incomplete interactome . Science 2015 ; 347 6224 : 1257601 . Google Scholar Crossref Search ADS PubMed 30 Cowley MJ , Pinese M , Kassahn KS , et al. . PINA v2.0: mining interactome modules . Nucleic Acids Res 2012 ; 40 ( Database issue ): 862 – 5 . Google Scholar Crossref Search ADS 31 Lipscomb CE. Medical Subject Headings (MeSH ). Bull Med Libr Assoc 2000 ; 88 3 : 265 . Google Scholar PubMed 32 Kilicoglu H , Fiszman M , Rodriguez A , et al. . Semantic MEDLINE: a web application for managing the results of PubMed searches . Proc Smbm . 2008 : 69 – 76 . 33 Wheeler DL , Church DM , Lash AE , et al. . Database resources of the National Center for Biotechnology Information: 2002 update . Nucleic Acids Res 2002 ; 30 1 : 13 – 16 . Google Scholar Crossref Search ADS PubMed 34 Fisher RA. On the interpretation of χ2 from contingency tables, and the calculation of P . J R Stat Soc 1922 ; 85 1 : 87 – 94 . Google Scholar Crossref Search ADS 35 Grover A , Leskovec J. Node2vec: scalable feature learning for networks. in proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. San Francisco, CA, USA. 2016 : 855 – 864 . 36 Mikolov T , Chen K , Corrado G , et al. . Efficient estimation of word representations in vector space . arXiv 2013 . (https://arxiv.org/abs/1301.3781v3) 37 Perozzi B , Al-Rfou R , Skiena S. DeepWalk: Online learning of social representations. in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014. New York, NY, USA. 2014 : 701 – 710 . 38 Vanunu O , Magger O , Ruppin E , et al. . Associating genes and protein complexes with disease via network propagation . Plos Comput Biol 2010 ; 6 1 : e1000641. Google Scholar Crossref Search ADS PubMed 39 Billsus D , Pazzani MJ , Learning collaborative information filters. in proceedings of the 15th International Conference on Machine Learning. San Francisco, CA, USA. 1998 : 46 – 54 . 40 Bauer J , Wendland J. Candida albicans Sfl1 suppresses flocculation and filamentation . Eukaryotic Cell 2007 ; 6 10 : 1736 – 1744 . Google Scholar Crossref Search ADS PubMed 41 Jaworek TJ , Bhatti R , Latief N , et al. . USH1K, a novel locus for type I Usher syndrome, maps to chromosome 10p11.21-q21.1 . J Hum Genet 2012 ; 57 10 : 633 – 637 . Google Scholar Crossref Search ADS PubMed 42 Dad S , Østergaard E , Thykjaer T , et al. . Identification of a novel locus for a USH3 like syndrome combined with congenital cataract . Clin Genet 2010 ; 78 4 : 388 – 397 . Google Scholar Crossref Search ADS PubMed 43 Gootwine E , Ofri R , Banin E , et al. . Safety and efficacy evaluation of rAAV2tYF-PR1.7-hCNGA3 vector delivered by subretinal injection in CNGA3 mutant achromatopsia sheep . Hum Gene Ther Clin Dev 2017 ; 28 : 96 – 107 . Google Scholar Crossref Search ADS PubMed 44 Ma NG , Ad UI , et al. . Mutations in GRM6 identified in consanguineous Pakistani families with congenital stationary night blindness . Mol Vis 2015 ; 21 : 1261 – 1271 . Google Scholar PubMed 45 Chowers I , Tiosano L , Audo I , et al. . Adult-onset foveomacular vitelliform dystrophy: a fresh perspective . Prog Retinal Eye Res 2015 ; 47 : 64 – 85 . Google Scholar Crossref Search ADS 46 Kuniyoshi K , Hayashi T , Sakuramoto H , et al. . New truncation mutation of the NR2E3 gene in a Japanese patient with enhanced S-cone syndrome . Jpn J Ophthalmol 2016 ; 60 6 : 476 – 485 . Google Scholar Crossref Search ADS PubMed 47 Zhou XZ , Menche J , Barabási A , et al. . Human symptoms–disease network . Nat Commun 2014 ; 5 : 4212 . Google Scholar Crossref Search ADS PubMed 48 Zhou X , Lei L , Liu J , et al. . A systems approach to refine disease taxonomy by integrating phenotypic and molecular networks . EBioMedicine 2018 ; 31 : 79 – 91 . Google Scholar Crossref Search ADS PubMed 49 Donnelly K. SNOMED-CT: the advanced terminology and coding system for eHealth . Stud Health Technol Inform 2006 ; 121 121 : 279 . Google Scholar PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Journal

Journal of the American Medical Informatics AssociationOxford University Press

Published: Nov 1, 2018

There are no references for this article.