Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

iTAGPred: A Two-Level Prediction Model for Identification of Angiogenesis and Tumor Angiogenesis Biomarkers

iTAGPred: A Two-Level Prediction Model for Identification of Angiogenesis and Tumor Angiogenesis... Hindawi Applied Bionics and Biomechanics Volume 2021, Article ID 2803147, 15 pages https://doi.org/10.1155/2021/2803147 Research Article iTAGPred: A Two-Level Prediction Model for Identification of Angiogenesis and Tumor Angiogenesis Biomarkers 1 2 3 Khalid Allehaibi, Yaser Daanial Khan , and Sher Afzal Khan Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia Department of Computer Science, University of Management and Technology, Lahore, Pakistan Department of Computer Sciences, Abdul Wali Khan University Mardan, Pakistan Correspondence should be addressed to Sher Afzal Khan; sher.afzal@awkum.edu.pk Received 1 June 2021; Accepted 2 September 2021; Published 27 September 2021 Academic Editor: Jose Merodio Copyright © 2021 Khalid Allehaibi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. A crucial biological process called angiogenesis plays a vital role in migration, growth, and wound healing of endothelial cells and other processes that are controlled by chemical signals. Angiogenesis is the process that controls the growth of blood vessels within tissues while angiogenesis proteins play a significant role in the proper working of this process. The balancing of these signals is necessary for the proper working of angiogenesis. Unbalancing of these signals increases blood vessel formation, which causes abnormal growth or several diseases including cancer. The proposed work focuses on developing a two-layered prediction model using different classifiers like random forest (RF), neural network, and support vector machine. The first level performs in silico identification of angiogenesis proteins based on the primary structure. In the case the protein is an angiogenesis protein, then the second level predicts whether the protein is linked with tumor angiogenesis or not. The performance of the model is evaluated through various validation techniques. The model was evaluated using k-fold cross-validation, independent, self-consistency, and jackknife testing. The overall accuracy using an RF classifier for angiogenesis at the first level was 97.8% and for tumor angiogenesis at the second level was 99.5%, ANN showed 94.1% accuracy for angiogenesis and 79.9% for tumor angiogenesis, and the accuracy of SVM for angiogenesis was 78.8% and for tumor angiogenesis was 65.19%. 1. Introduction the growth of new blood vessels. Without the angiogenesis process, abnormal or tumor cells cannot grow beyond 1- 2 mm in size [6, 7]. But this abnormal angiogenesis process The biological process in which new blood vessels develop from preexisting blood vessels is called angiogenesis [1]. It not only causes cancer but also is a precursor of several dis- is a normal process that plays a vital role in the migration, eases like leukemia, hematologic diseases, muscular degener- growth, and healing of endothelial cells. Angiogenesis itself ation, and eye diseases [8–10]. is controlled by chemical signals. Usually, the consequences Cancer is ranked as the leading cause of death in the 21st of these chemical signals remain balanced which means that century around the world. According to a survey report pub- new blood vessels only develop on a need basis. But some- lished in 2015 by the World Health Organization (WHO), times these signals can be unbalanced and may increase cancer is the first and second major reason for death before blood vessel formation, which in return causes abnormal the age of 70 in 91 countries around the globe [7]. Further- growth or diseases [2, 3]. Angiogenesis plays a vital role in more, according to the cancer statistics report 2018 by the the development and growth of cancer cells [4, 5]. Just like International Agency for Research on Cancer and Cancer normal cell growth, tumor cells also need oxygen and other Research UK, 9.6 million people around the world are dying nutrients to grow and expand. These elements are present in due to cancer [7, 11]. This ratio is predicted to increase in the blood. Tumor cells send chemical signals that stimulate the coming years. 2 Applied Bionics and Biomechanics as they are targets of such inhibitors [22]. Most of the cancer Researchers, scientists, and biologists all around the world are searching for different techniques for developing research revolves around finding ligands and substances that different drugs and systems to fight against this deadly dis- will bind with tumor angiogenesis proteins and inhibit its role [23]. Scientists use various methodologies for the identi- ease [12]. Until now, a lot of researchers have contributed their knowledge to develop different systems for tumor pre- fication of protein attributes [24–28]. In silico identification diction at different stages of its life cycle. Different strategies techniques have evolved and received acclaim over the past were proposed to control this disease like chemotherapy [13, few years as they provide robust and fast results and are 14], radiation therapy [15, 16], surgeries, and bone marrow cost-effective [29, 30]. Scientists have used various mathe- matical and computational models to identify attributes of transplant also known as cord blood and vaccines [17]. Can- cer can attack the brain that is the most crucial part of the proteins based on the composition and positioning of amino human body. It has the most delicate and complex structure, acid residues [31]. A position-based mathematical model, so it is difficult to inject drugs to cure it. But different namely, position-specific scoring matrix (PSSM), was intro- approaches can deliver drugs like high-dose chemotherapy, duced in 1982 [32]. Numerous prediction models have been designed that incorporate the use of PSSM for the identifica- blood-brain barriers, and disruption [18]. Many therapies for tumors revolve around the attempt to suppress the tion of proteomic attributes. However, since PSSM did not tumor angiogenesis process. Scientists have discovered many incorporate the composition relevant information into the ligands that can bind to tumor angiogenesis proteins such model, therefore it lacked a major aspect that determines that their function is inhibited. Hence, identification of proteomic attributes. In 2001, Chou introduced the pseudo amino acid composition model that encompassed position angiogenesis and tumor angiogenesis proteins is crucial in finding novel and effective tumor therapies. as well as composition information into the model and Formerly, several mathematical [3] and computational hence provided better results [33]. Many generalizations models have been developed for the classification or identifi- and variants have since been proposed to provide even better cation of various proteomic and genomic attributes [19]. results [31]. The choice of the most appropriate classifier pl The proposed work establishes a computational model based ays a pivotal role in the design of such methodologies. A on position and combinational information of a primary multitude of classifiers have been engaged for the prediction sequence that attempts to accurately identify angiogenesis of posttranslational modification sites including random for- and tumor angiogenesis proteins. Since tumor angiogenesis est, support vector machine, neural networks, and deep proteins are also characterized as angiogenesis proteins, the learning. In [34], the authors incorporate adapted normal similarity of their obscure features can often lead to an distribution biprofile, Bayes, with PseAAC to formulate a ambiguous outcome. Ambiguity among seemingly similar prediction model. The accuracy is further improved using angiogenesis and tumor angiogenesis proteins is resolved kernel sparse representation classification and minimum by a two-layer classification model. The initial layer distin- redundancy and maximum relevance algorithm [35]. Subse- guishes between angiogenesis and nonangiogenesis proteins quently, an improved depiction uses a deep learning algo- while the second layer deciphers if a protein identified as an rithm formulated by [36]. Deep learning has emerged as angiogenesis protein is tumor causing or not. The two- an encouraging model for the resolution of a multitude of layered model helps alleviate ambiguity and yield more problems [37–39]. The proposed work presents a two- accurate and assiduous results. layered model based on position and composition relative The rest of the paper is organized as follows. Section 2 features and statistical moments [31] for the identification illuminates the importance of angiogenesis uncovered in of angiogenesis and tumor angiogenesis proteins which are the previous research and also discusses the state-of-the- probed on various classifiers to accrue the best results. art models used for in silico identification of proteomic attributes. Section 3 discusses the methodology adopted 2. Materials and Methods for the proposed in silico identification model. Section 4 illustrates the accuracy of the model obtained through Angiogenesis has been identified as a critical process that well-defined rigorous testing methodologies. Section 5 pro- needs to be subjugated to disrupt the progression of cancer. vides a general discussion regarding the performance of the Angiogenesis proteins especially the ones that lead to tumor proposed model. angiogenesis have a crucial significance in this process. Since they promote the development of new blood vessels within 1.1. Current State of the Art. The crucial role of angiogenesis the cancerous tissue, therefore they are considered an in tumor progression was first discovered by Judah Folk- important biomarker for early detection of cancer. man in 1971 [20]. Angiogenesis is a crucial process of vas- Tumors also use the same process for their growth; how- cular system growth through the sprouting and splitting of ever, it is possible to uniquely identify the growth factors blood vessels [21]. Tumor cells also require a constant flow that are responsible for its growth. In terms of proteomic of blood for their growth for which they simulate the features, angiogenesis and tumor angiogenesis have mutual growth of blood vessels through secretion of various tumor properties. Therefore, to fulfill the arduous challenge of dis- angiogenesis proteins or growth factors. Cancer treatment tinctly identifying tumor angiogenesis proteins, a two- therapies are aimed at finding inhibitors for such growth layered approach is adopted as shown in Figure 1. factors. Identification of angiogenesis and tumor angiogene- The first layer of the model detects whether or not a pro- sis proteins bears enormous significance in cancer research tein is an angiogenesis protein, using the primary structure Applied Bionics and Biomechanics 3 Input sequence Yes Identify sequences No Layer 1 Non-angiogenesis as angiogenesis Yes Determine process No Non-tumor Layer 2 as angiogenesis tumor Yes Show result End Figure 1: Flowchart of the proposed system. of that protein. In the case it is an angiogenesis protein, then the second layer of the model is invoked to decide whether Data collection from uniProt and data Preprocessing the angiogenesis protein can potentially cause cancer or not. The proposed workflow is shown in Figure 2, consisting of the following five-step approach; initially, we will collect Feature extraction the well-reviewed and experimentally tested dataset consist- ing of angiogenesis proteins preprocessed to remove redun- dancies. Further, feature extractions are performed to Training transform the biological data into its equivalent mathemati- cal matrix. In the third step, the obtained feature matrix is used to train the model for further prediction. In the fourth Model evaluation step, the model is evaluated for its correctness, sensitivity, specificity, and MCC. In the fifth step, we developed the webserver. Webserver 2.1. Dataset Collection. The dataset was collected from the Figure 2: The workflow of the proposed model is shown which UniProt database using meticulously designed search parame- includes five steps: data collection and its preprocessing, feature ters. UniProt is a Universal Protein Resource that contains extraction, training, model evaluation, and the construction of the huge information about the sequence of proteins and their webserver. biological functions [22]. A dataset containing positive samples was composed for both angiogenesis and tumor tumor angiogenesis datasets was performed by setting the angiogenesis using the UniProt keyword “Angiogenesis.” sequence identity parameters at 60%. Ultimately, 761 posi- Similarly, negative samples were also collected. UniProt has tive and 2776 negative clusters were formed for the angio- no keyword for “Tumor Angiogenesis” proteins. Nonethe- genesis dataset. Similarly, 256 positive and 448 negative less, they comprise within the set of angiogenesis proteins; clusters were formed for the tumor angiogenesis dataset. A therefore, tumor angiogenesis proteins were manually representative sequence was selected from each cluster to curated from the acquired dataset. Each sample within the form the final dataset. dataset was manually analyzed for annotated proteomic properties and published evidence within the database to + − A = A ∪ A : ð1Þ form a set of tumor angiogenesis proteins. However, ambig- uous samples were left out. After the collection of data from UniProt, the CD hit suite (http://weizhong-lab.ucsd.edu/ The above equation shows the benchmark dataset used cdhit_suite/cgi-bin/index.cgi) was used to reduce the homol- in this work, where A represents the positive data samples ogy of data samples. Clustering of the angiogenesis and of angiogenesis protein and A shows the negative data. 4 Applied Bionics and Biomechanics ance and is calculated based on the Hahn polynomial. Cen- Also, the positive tumor angiogenesis samples are repre- sented as T , and negative tumor angiogenesis proteins are tral moments abide information regarding asymmetry, represented as T as shown in the equation below: mean, and variance. The central moments are derived for the centroid of collective data making these moments scale + − T = T ∪ T : ð2Þ variant and location invariant. Subsequently, raw moments are scale and location variants and represent properties like 2.2. Feature Extraction. A robust and efficient methodology asymmetry, variance, and mean. for the transformation of biological sequences into a numer- ′ A matrix P with m × m dimensions is formulated for ical notation for incorporation into a machine learning algo- a two-dimensional residual protein representation where pffiffiffiffi rithm is the most pivotal concept in the design of such = d L e. predictive models [31, 40]. This conversion must keep intact 2 3 the original information or features of the sequence for R R R ⋯ R 11 12 13 1m analysis in some numerical form. For this purpose, each pri- 6 7 6 7 R R R ⋯ R mary sequence within the collected data is converted into a 21 22 23 2m 6 7 P = : ð5Þ 6 7 fixed-size vector. A feature vector of static length is formed 6 7 ⋮⋮ ⋮ ⋮ ⋮ 4 5 which represents a primary sequence and remains essentially invariant upon the scale of the sequence [41]. Incorporation R ⋯⋯ ⋯ R m1 mn of such a transformation model is ideal as most of the state- of-the-art classifiers work with vectors [22, 42, 43]. A vector The vector P is easily transformed into matrix P by described in a model may also lose complete information of using a simple mapping function explained in [47]. The the pattern sequence [44]. For this problem, Chou’s PseAAC primary sequence is fitted into a two-dimensional matrix was proposed which is used by many scientists for the con- so that it could be formulated into the Hahn polynomial struction of genomic and proteomic prediction models and which is orthogonal. The same two-dimensional notation their applications [45, 46]. Later, this model was improved was used for deriving raw and central moments. The to provide a better correlation perspective among residues Hahn moment is computed using the Hahn polynomial that reflect onto feature coefficients. as given below. Let P be a sequence of proteins of length L, which is rep- resented as v,u H ðÞ r, N =ðÞ N + U − 1 ðÞ N − 1 n n ðÞ −n ðÞ −rðÞ 2N + v + u − n − 1 1 P = R R R R R R R , ð3Þ i i i 1 2 3⋯ 16 17 18⋯ L × 〠 −1 × : ðÞ N + u − 1 N − 1 i! ðÞ ðÞ i i i=0 where R is an arbitrary residue of a polypeptide chain with i ð6Þ length L. Feature extraction yields a vector with numerous numer- Central moments are computed using the equation ical coefficients. This transformation from a variable-length given below. polypeptide chain into a fixed-length feature vector is illus- trated in the following equation: k k s t μ = 〠〠ðÞ p − x ðÞ q − y P : ð7Þ pq st ΔðÞ P =½ Ψ Ψ ⋯ Ψ ⋯ Ψ , ð4Þ p=1 q=1 1 2 u Ω where Δ is the transformation function, Ψ is an arbitrary The following equation is used to compute the raw coefficient, and Ω is the constant length of the feature vector moments. [22, 31]. k k s t 2.3. Statistical Moments. The proposed methodology M = 〠〠 p q P : ð8Þ st pq develops on the use of statistical moments to form a numer- p=1 q=1 ical representation such that the obscured information within the primary structure of proteins stays intact. These In equations (7) and (8), s and t represent the order of moments form a succinct numerical form such that the orig- raw moments. Orthogonality of these moments renders its inal data can be reconstructed without any significant loss of use assiduous as their inverse functions can be used to information. Moments can be obtained up to several orders; reconstruct data. Detailed explanation and use of these each provides a deeper perspective into specific aspects of notations can be found in [48]. data like positioning, eccentricity, skewness, and peculiarity [31]. Mathematicians and statisticians have devised many 2.4. Frequency Vector Determination. The cumulative fre- moments generating coefficients incarnated based on well- quency of occurrence of each specific amino acid residue is defined distribution functions and polynomials [35, 44]. furnished into a frequency vector. Information about the In the proposed work, Hahn moments, raw moments, distribution of amino acid residues within the primary and central moments are organized to form a feature set. sequence is summarized into this frequency vector which is The Hahn moment bears location- and scale-oriented vari- represented as Applied Bionics and Biomechanics 5 th FV = f f f ⋯ f , ð9Þ Any i element in the above matrix is computed as 1, 2, 3, 20 where f refers to the frequency of occurrence of an arbitrary v = 〠 P , ð13Þ i k distinct amino acid residue. k=1 2.5. Position Relative Incidence Matrix (PRIM) Calculation. where P is the position of occurrence of a native amino acid The primary sequence of the proteins forms the basis of for- while n is its frequency of occurrence. mulation of feature vectors of primary structures which are All the above-defined features are aggregated to form a otherwise obscure. Information pertaining to position rela- feature vector. The dimensionality of P , X , and PRIM tive incidence of arbitrary protein residues is formulated as X is reduced by computing their Hahn, central, and RPRIM a matrix of size ð20 × 20Þ. The Position Relative Incidence raw moments. Ultimately, a fixed-size feature vector is Matrix (PRIM) is illustrated as formed to represent primary structures of varied lengths. 2 3 X X ⋯ X ⋯ 3. Prediction Algorithm 1,1 1,2 1,j 1,20 6 7 6 X ⋯ 7 X X ⋯ X After extraction of feature vectors from positive as well as 2,1 2,2 2,j 2,20 6 7 6 7 negative sequences, the data is used to train classifiers. A 6 7 ⋮ ⋮ ⋮ ⋮ 6 7 diverse set of currently widespread classifiers were used for X = : PRIM 6 7 X X ⋯ X 6 X ⋯ 7 the purpose which includes random forest, neural network, i,1 i,2 i,j i,20 6 7 and support vector machine. Comparison of results yielded 6 7 ⋮ ⋮ ⋮ 6 ⋮ 7 from each classifier work enables the identification of the 6 7 6 7 X X ⋯ X ⋯ most suitable classifier with the highest accuracy. N,1 N,2 N,j N,20 ð10Þ 3.1. Random Forest. The random forest (RF) classifier was trained at two levels for the prediction of angiogenesis and tumor angiogenesis proteins. At the first level, the classifier The sum of the relative position of the jth protein resi- was used to identify angiogenesis and nonangiogenesis pro- due corresponding to the first occurrence of the ith residue teins while at the second level the angiogenesis protein was is computed in the above matrix given as X . The matrix ij passed through another classifier to identify if the protein contains all the possible permutations for such occurrences is tumor causing or not. The random forest is a very power- as explained in [48]. ful classifier used for classification and regression problems [49, 50]. Initially, it converts the whole data into decision 2.6. Determination of Reverse Position Relative Incidence trees [23, 51]. Furthermore, a random forest classifier is Matrix (RPRIM). More obscure features of the primary applied to each tree to predict a class. The class with the sequence are uncovered with the help of the Reverse Posi- highest votes becomes the models’ prediction result [41] as tion Relative Incidence Matrix (RPRIM). The RPRIM is illustrated in Figure 3. obtained by forming the PRIM of the reversed primary sequence. X is illustrated as RPRIM 3.2. Artificial Neural Network (ANN). Subsequently, the arti- ficial neural network (ANN) was also similarly employed at 2 3 R R R two levels. ANN has interconnected layers of neurons [52]. 1,1 1,2⋯ 1,j⋯ 1,20 6 7 The connectionist architecture of the backpropagation net- 6 7 R R R R 2,j⋯ 2,1 2,2⋯ 2,20 work is illustrated in Figure 4. The ANN mechanism used 6 7 6 7 is based on a feedforward network and uses the backpropa- 6 7 ⋮ ⋮ ⋮ ⋮ 6 7 X = , ð11Þ RPRIM gation algorithm to reduce error. An input layer is clamped 6 7 R R R R 6 7 i,1 i,2⋯ i,20 i,j⋯ to the input feature vectors. It also has a hidden layer that 6 7 6 7 receives selected numbers of neurons from the input layer ⋮ ⋮ ⋮ ⋮ 6 7 6 7 and forms the main processing unit of the whole network. 6 7 R R R R N,20 N,1 N,2⋯ N,j⋯ The activation unit of ANN sums all preceding weighted inputs in addition to bias values [23, 31]. The output of the 3-layer feedforward network with error backpropagation where R is an arbitrary element of X . i,j RPRIM is represented by 2.7. Accumulative Absolute Position Incidence Vector (AAPI ! ! h k V) Calculation. The AAPIV matrix is used to calculate the O = f 〠 W × f 〠 W I , ð14Þ sum all the positions at which each native amino acid occurs m ym xy a y=1 x=1 within the primary sequence; hence, it bears a length of 20 and is denoted as where the input layer has k neurons and the hidden layer has h neurons. Partial output calculated by the mth neuron in AAPIV =½ v , v , v ,⋯v : ð12Þ 1 2 3 20 the network is denoted by O . Supposing that the arbitrary m . . . . . 6 Applied Bionics and Biomechanics Input data Subset Subset Subset Tree 1 Tree 2 Tree n …… Class 1 Class 2 Class 3 Maximum voting Final class Figure 3: Random forest classifier architecture. Error back propagation Input 1 Input 2 Error Input 3 Output layer Input 4 Input N Hidden layer Input layer Figure 4: Architecture of ANN. node receives an input I , then W represents the weight of a xy ∈ =0:5〠 O − P , ð16Þ ðÞ i i the edge connecting node x to node y. Similarly, W repre- ym i=1 sents the weight of the yth node connected to an arbitrary output layer neuron m. The classical sigma function which where O is the target output, P is the actual calculated output i i determines the activation of neurons is denoted as f in by the network, and o is the number of neurons in the output layer. The gradient descent method is used to minimize the fxðÞ = : ð15Þ error rate. The error generated at the output layer is sent back −x ðÞ 1+ e to the input layer. The set of all the weights is represented by a vector V. The backpropagation procedure selects a differential Actual activated levels in the output units are compared ΔV such that it lessens the error. This is continued iteratively with the target output for every training iteration. The error until convergence is achieved as shown below: rate hence observed is denoted by ∈ and is calculated by the difference between the expected output and actual activated VtðÞ +1 =VtðÞ+ΔVtðÞ, ð17Þ output given as Applied Bionics and Biomechanics 7 Class A where ∂ ∈ ΔV = η − jV =VtðÞ: ð18Þ Hyper ∂W plane This equation shows a change in weight at time t +1,and Class B apositive constant η signifies the learning rate usually set between 0 and 1. The change in weights is expressed as ∂ ∈ ΔV = −η : ð19Þ u,v ∂W u,v Figure 5: Architectural diagram of SVM. th Here, ΔV shows the minimal ∈ weight among the u u,v th th and v neurons in the i iteration. This procedure is followed All the classifiers were implemented using Python ver- in both backward and forward passes of input signals. It is a sion 3.6 using SciKit Learn API. Subsequently, results gath- lightweight procedure that consumes less memory space, ered using this framework are rigorously analyzed in terms and it is extensively used for the training of ANN. Patterns of their performance parameters. are repetitively offered to the network to train it and to make A major design issue regarding the design of a new it capable of minimizing the mean square error (MSE) as prediction model is to set up some parameters to measure shown in its accuracy. Researchers have predominantly used four descriptive metrics for performance analysis. These metrics n k o o are as follows: MSE = 〠〠 P − O : ð20Þ ðÞ i i 2n j=1 i=1 (1) Sp measures the specificity which quantifies the abil- ity of the model to identify positive samples accu- th The actual output received at the i neuron of the output rately [46] o o layer is represented as O ,and P represents the expected i i (2) Sn measures the sensitivity, which represents the value where the total number of input samples is n and there accuracy in predicting negative data samples are k output neurons. (3) Acc is used to measure the overall accuracy of the 3.3. Support Vector Machine (SVM). A support vector model machine (SVM) is a machine learning classifier that is used in regression-related problems. SVM works by attempting (4) MCC is for measuring the stability of the model to fit in a hyperplane in an N-dimensional space where N (5) The following formulation is used to quantify these represents the number of feature elements that represents metrics. the samples distinctly. Hyperplanes are simple decision boundaries that classify the data points, and these data points are present on both sides of the hyperplane, which TN ideally partitions different classes. The hyperplane is most SpecificityðÞ Sp = , ð21Þ ðÞ TN + FP optimally adjusted by means of support vectors. Figure 5 illustrates points on either side of the hyperplane belonging TN to different classes, namely, class A and class B. SenstivityðÞ Sn = , ð22Þ ðÞ TP + FN 4. Results and Discussion ðÞ TP + TN AccuracyðÞ Acc = × 100, ð23Þ 4.1. Evaluation of the Model. In the current study, the dataset ðÞ TP + FP + TN + FN was constructed on two levels. The first level uses 785 positive TP TN − FP FN ðÞðÞ ðÞðÞ and 2776 negative samples regarding angiogenesis proteins MCC = , pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi whereas the second level encompasses 256 positive and 448 ðÞ TP + FNðÞ TN + FPðÞ TP + FPðÞ TN + FN negative samples for tumor angiogenesis proteins. A feature ð24Þ vector input matrix (FIM) was formed for both angiogenesis and tumor angiogenesis datasets separately. Every single row where true negatives are represented by TN, true positives of FIM is a feature vector that represents a single data sample. are represented by TP, false positives are represented by Also, an Expected Output Matrix (EOM) was formed corre- FP, and false negatives are represented by FN [43, 53, 54]. sponding to FIM. All the classifiers were trained using both But unfortunately, the formation of equations (21), (22), FIM and EOM. FIM was given as an input for training the (23), and (24) is somewhat cryptic for biologists [55]. model where EOM was used to compute errors and retrain Another more intuitive format has been suggested by scien- until convergence is achieved [23, 31, 43, 45]. tists in [56, 57], and their modifiers were introduced in 8 Applied Bionics and Biomechanics Table 1: New symbol description for Chou’s fourth step. Symbols Explanation Represents the total number of true positives in the dataset The total number of true positives in the dataset projected incorrectly The total numbers of true negatives in the dataset N The total number of negatives projected incorrectly Table 2: Self-consistency results for angiogenesis and tumor angiogenesis. Angiogenesis Tumor angiogenesis Predictor TP FP TN FN Acc (%) Sp (%) Sn (%) MCC TP FP TN FN Acc (%) Sp (%) Sn (%) MCC RF 783 0 2784 0 100 100 100 1 255 1 447 1 99.7 99.6 99.8 0.9 ANN 766 7 2580 204 94.1 99.1 92.7 0.9 256 0 307 141 79.9 100 68.5 0.6 SVM 31 752 2783 1 78.9 4 100 0.2 12 244 447 1 65.2 4.7 99.8 0.2 Table 3: k-fold cross-validation results. Level 1 Level 2 Predictor Fold TP FP TN FN Acc (%) Sn (%) Sp (%) MCC TP FP TN FN Acc (%) Sn (%) Sp (%) MCC RF 723 60 2784 0 98.1 92.3 100 0.95 254 2 448 0 99.7 99.2 100 0.9 ANN 5 653 130 2780 4 96.2 83.4 99.9 0.8 246 10 428 20 95.7 96.1 95.7 0.9 SVM 31 752 2783 1 78.8 4 100 0.2 6 250 448 0 64.5 2.3 100 0.1 RF 706 77 2784 0 97.8 99.4 100 0.9 253 3 0 448 99.5 98.8 100 0.9 ANN 10 776 7 2580 240 94.1 99.1 92.7 0.8 256 0 307 141 79.9 100 68.5 0.7 SVM 31 752 2783 1 78.8 4 100 0.2 12 244 447 1 65.19 4.7 99.8 0.2 Table 4: Jackknife results. Angiogenesis Tumor angiogenesis Model TP FP TN FN Acc (%) Sn (%) Sp (%) MCC TP FP TN FN Acc (%) Sp (%) Sn (%) MCC RF 781 26 2784 0 99.3 100 100 1 255 1 447 1 99.7 99.6 99.8 0.9 ANN 653 130 2780 4 96.3 83.3 99.9 0.8 246 10 428 20 95.7 96.1 95.5 0.9 SVM 783 0 2784 0 100 100 100 1 6 250 448 0 64.5 2.3 100 0.1 + − [47]. Symbols used to represent these equations are N , N , 4.2. Validation Methods. Testing is another important factor + − N , and N . Explanation of these representations is given in for the validation of the predicting models [22, 31, 42, 45]. − + Table 1. The validation phase encompasses four most commonly used tests discussed below. Hence, these metrics are also calculated as + 4.2.1. Self-Consistency. The self-consistency test is the most > Sn = 1 − , > + N trivial and intuitive of the tests. A trained model is simply > − tested on the dataset that was used to train it. Capability of > N > + Sp = 1 − , − a model to learn from a given dataset is underscored with this basic but useful evaluating benchmark. Good results + − N + N − + > merely indicate that the classifier has the ability to find Accuracy = 1 − , + − N + N > obscure patterns within the training data. Self-consistency + + − − > 1 − N /N + N /N testing was performed on angiogenesis and tumor angiogen- ðÞ ðÞ ðÞ > − + MCC = pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : esis datasets upon which the proposed model was trained. − − + + − − 1+ N − N /N 1+ N − N /N ðÞ ðÞ ðÞ ðÞ ðÞ ðÞ + + − + Results obtained from self-consistency tests are illustrated ð25Þ in Table 2 showing the overall performance of the proposed Applied Bionics and Biomechanics 9 Table 5: Independent set results. Angiogenesis Tumor angiogenesis Model TP FP TN FN Acc (%) Sn (%) Sp (%) MCC TP FP TN FN Acc (%) Sp (%) Sn (%) MCC RF 211 27 833 0 94.5 88.7 100 0.9 70 0 142 0 100 100 100 1 ANN 227 14 827 3 98.4 94.2 99.6 0.9 59 12 141 0 94.3 83.1 100 0.9 SVM 3 238 833 7 77.2 1.2 99.2 0.02 5 66 131 10 64.2 7.0 92.9 0.01 Receiver operating characteristic (ROC)-curve Receiver operating characteristic (ROC)-curve 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate False positive rate Angiogenesis: self consistency by SVM Angiogenesis: 5-fold cross-validation by SVM Angiogenesis: self consistency by ANN Angiogenesis: 5-fold cross-validation by ANN Angiogenesis: self consistency by random forest Angiogenesis: 5-fold cross-validation by random forest Figure 6: Comparison based on self-consistency. Figure 7: Comparison through 5-folds. model using random forest (RF), artificial neural network 4.2.3. Jackknife Testing. Jackknife testing is the most rigorous (ANN), and support vector machine (SVM) classifier. testing methodology. In each iteration, it leaves out a single The results indicate that the random forest classifier has sample while the model is trained on the rest. After sufficient the best capability to learn and decipher obscure patterns training, the model is tested with the left-out sample. This that peculiarly characterize each sample. process exhaustively proceeds for all data samples. Hence, this test is repeated N times, where N represents the size of the overall dataset. In every iteration, the testing data sample 4.2.2. Cross-Validation. The cross-validation technique is is different, so all samples are tested exactly once. This tech- used when unknown data for testing is not readily available nique is the most rigorous which also makes it slower [45, 58]. The dataset is randomly divided into multiple par- [59–63]. After successfully training and testing, the number titions or folds spanning over a comprehensive sample space of true positive, false positive, true negative, and false nega- hence rendering cross-validation as a rigorous test. Parti- tive was obtained [55]. tions are devised in a manner such that they are disjointed Since the sample is tested exactly once, therefore the from each other and are comparable in size. A partition is overall accuracy obtained for this test remains unique [31, left out while the model is trained on the rest of the data. 40, 45, 46]. Once the model is fully trained, the left-out partition is used RF results illustrated in Table 4 for angiogenesis and as unknown data to test the model. These steps are recapit- tumor angiogenesis proteins portray higher accuracies and ulated for each fold. The overall accuracy of the model for are reported as 99.3% and 99.7%, respectively, in compari- the cross-validation test is reported by taking the mean of son with other classifiers. accuracy yielded against each fold. Cross-validation tests were performed by partitioning the benchmark dataset into 5-folds and 10-folds. Table 3 4.2.4. Independent Set Testing. Independent test evaluates depicts the results of the test. how well a model performs on unknown data. Initially, the The random forest exhibits the best results at both levels data is partitioned such that the larger partition is used for with an accuracy of 99.7% for the identification of angiogen- training and the left-out partition is used as unknown data esis proteins and an accuracy of 99.5% for the identification for testing. Once the model is completely trained, then inde- of tumor angiogenesis proteins. pendent set testing is performed using the left-out data. An True positive rate True positive rate 10 Applied Bionics and Biomechanics Receiver operating characteristic (ROC)-curve Receiver operating characteristic (ROC)-curve 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate False positive rate Angiogenesis: 10-fold cross-validation by SVM Angiogenesis: independent-testing by SVM Angiogenesis: 10-fold cross-validation by ANN Angiogenesis: independent-testing by ANN Angiogenesis: 10-fold cross-validation by random forest Angiogenesis: independent-testing by random forest Figure 8: Comparison based on 10-folds. Figure 10: Independent testing comparison. Receiver operating characteristic (ROC)-curve Receiver operating characteristic (ROC)-curve 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate False positive rate Angiogenesis: jackknife test by SVM Tumor_angiogenesis: self test by SVM Angiogenesis: jackknife test by ANN Tumor_angiogenesis: self test by ANN Angiogenesis: jackknife test by random forest Tumor_angiogenesis: self test by random forest Figure 9: Jackknife testing comparison. Figure 11: Comparison based on self-consistency. independent set needs to be formulated intelligibly such that Working with classification models renders performance the training data encompasses comprehensive obscure pat- measurement as an essential task quantified using classifica- terns and the test data thoroughly queries the ability of the tion scores. But this type of performance is not suitable while model to decipher these patterns. Otherwise, testing results dealing with flawed datasets with heavy class imbalance. In may be ambiguous. Results obtained from independent test- such cases, ROC (Receiver Operating Characteristic) curves ing illustrate the overall accuracies of RF, ANN, and SVM provide a graphical view along with quantitative analysis of classifiers after independent testing as presented in Table 5. the overall scenario. ROC is a prevalently used performance The random forest shows the best results as compared to evaluation method for evaluating any classification model. The ROC curve is plotted by mapping the True Positive Rate ANN and SVM classifiers at both levels for the identification of angiogenesis as well as tumor angiogenesis proteins while (TPR) against the False Positive Rate (FPR). It depicts the the performance of the ANN classifier is better than that of accuracy with which the model is capable of distinguishing the SVM classifier. among classes. TPR is plotted along the y-axis while FPR is True positive rate True positive rate True positive rate True positive rate Applied Bionics and Biomechanics 11 Receiver operating characteristic (ROC)-curve Receiver operating characteristic (ROC)-curve 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate False positive rate Tumor_angiogenesis: 5-fold by SVM Tumor_angiogenesis: jackknife test by SVM Tumor_angiogenesis: 5-fold by ANN Tumor_angiogenesis: jackknife test by ANN Tumor_angiogenesis: 5-fold by random forest Tumor_angiogenesis: jackknife test by random forest Figure 12: Comparison based on 5-folds. Figure 14: Comparison of jackknife testing. Receiver operating characteristic (ROC)-curve Receiver operating characteristic (ROC)-curve 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate False positive rate Tumor_angiogenesis: 10-fold by SVM Tumor_angio: independent-test by SVM Tumor_angiogenesis: 10-fold by ANN Tumor_angio: independent-test by ANN Tumor_angiogenesis: 10-fold by random forest Tumor_angio: independent-test by random forest Figure 13: Comparison based on 10-folds. Figure 15: Comparison based on independent testing. plotted along the x-axis. Estimation of the area under the parison based on testing performed in the previous section. curve is a measure of the model’s performance. The best Figures 6–10 depict that RF shows the best results in com- possible accuracy is 1, and the worst is 0.5. A good measure parison with ANN and SVM. The RF curve encompasses of separability means that the model has accuracy near 1, an area close to 1 implying that the model has the best mea- and similarly, accuracy near 0 indicates that the model sure of separability. Graphical representations accentuate has the worst measures of separability. Consequently, an that RF and ANN both exhibit better results as compared accuracy of less than 0.5 indicates that the model will per- to SVM. However, in the case of jackknife testing, SVM clas- form exactly the opposite of what a model was recom- sifier accuracy is high as compared to that of ANN as illus- mended to do. trated in Figure 10. A similar comparison is performed for classifiers at the Various testing techniques were applied to gauge the effectiveness of the classifiers as discussed earlier. To priori- second level which predicts tumor angiogenesis proteins. tize the classifiers based on efficiency, a comparison is Figures 11–15 illustrate the results of various test techniques depicted through a ROC curve. Figure 6 represents the com- performed on the tumor angiogenesis dataset. These figures True positive rate True positive rate True positive rate True positive rate 12 Applied Bionics and Biomechanics tion and sequence of the primary structures. The meticu- connote that the RF classifier exhibits better results in com- parison with the ANN and SVM classifier supported by the lously collected data helps the model to produce better fact that the area under the RF curve is approximately results. The in silico nature of the model makes it an alluring approaching 1. opportunity as it is timely and cost-effective. Biologists and scientists can greatly benefit from the proposed tool for the characterization of proteins and understand their role in 5. Webserver angiogenesis and tumor angiogenesis processes. Further- Formulation of the robust dataset and feature extraction more, the model can prove to be effective in identifying the biomarkers that cause a tumor. Additionally, it augments methodology forms the foundation of a computationally intelligent model for efficient prediction of uncategorized the work of biologists and scientists in research aimed at finding new treatments and discovering new drugs. proteomic sequences. However, the availability of such a tool is also of extreme importance so that the research commu- Tumor-causing angiogenesis proteins are important bio- nity could benefit from it [45]. To make a novel predictor markers for the onset of cancer. Timely identification of these proteins can help in the treatment and possible cure for the forbearance of all users and biologists around the globe, there is a need for a user-friendly and publically acces- of the disease. This study proposes a robust in silico tech- sible webserver. In the final step of Chou’s 5-step rule, a web- nique for the identification of tumor angiogenesis using a server is devised for this purpose [48]. The webserver two-level predictor. The first level indicates whether a pro- enables scientists and biologists to easily access and to utilize tein is an angiogenesis protein or not while the second level identifies whether the given protein is responsible for tumor such prediction applications without getting into the com- plex mathematical details. The webserver for the proposed angiogenesis or not. A mature feature extraction technique work will soon be made available. Meanwhile, its code has was used to gather features for the benchmark dataset. Clas- been made available along with a readme file at https:// sifiers like RF, SVM, and ANN were trained using the resul- github.com/RabiaKhan-94/Thesis_WebServer.git which can tant feature vectors. Once the models are thoroughly trained, they are rigorously tested using test methods like k-fold be easily set up by an intermediate-level Python developer. cross-validation, self-consistency, independent set testing, and jackknife testing. The random forest classifier showed 6. Discussion and Conclusion 99.3% accuracy for angiogenesis and 99.7% for tumor angio- This study proposes a prediction model for the classification genesis, and ANN showed an overall 96.23% accuracy for angiogenesis and 95% for tumor angiogenesis. On the other of angiogenesis and tumor angiogenesis. A robust well- defined methodology was adopted for dataset collection. hand, SVM showed 78.65% accuracy for angiogenesis and Duplicate and redundant data were removed, and homolo- 65.19% for tumor angiogenesis. gous sequences up to 60% were excluded. Variable-length proteomic sequences were transformed into fixed-length 7. Future Works feature vectors using a position- and composition-based technique. Position relative information was further trans- Advanced drug therapies and treatments integrate the use of muted into a succinct form using statistical moments. Three ligands that target tumor angiogenesis proteins to inhibit classifiers random forest (RF), artificial neural network them. Inhibition of these tumor growth factors disrupts its (ANN), and support vector machine (SVM) were used to growth, and in some cases, the tumor even dies out. Tools find the best results. All of these algorithms are powerful, that help the discovery and identification of tumor angio- robust, and well understood. The random forest (RF) and genesis proteins greatly help cancer researchers to identify artificial neural network (ANN) can deal with linear as well these growth factors in a timely and cost-effective manner. as complicated nonlinear problems. The current study One such tumor growth factor has been uncovered; there reveals that RF showed the best results among these classifi- is an incessant need to identify ligands that can inhibit them. cation approaches. As a result of cross-validation, RF exhib- In silico models that simulate ligand bindings with tumor ited an accuracy of 97.8% for angiogenesis proteins and an growth factors can also greatly enhance tumor research. Fur- accuracy of 99.5% for tumor angiogenesis, where ANN ther, in the future, the proposed model can be made more showed an accuracy of 99.1% for angiogenesis and 79.9% adaptive by incorporating updated data and using deep for tumor angiogenesis. Additionally, the accuracy of SVM learning features. for angiogenesis was 78.8%, and for tumor angiogenesis, it was 65.19%. The current study has shown different perfor- Data Availability mances for all approaches. Consequently, it concludes that the results exhibited by RF are better than ANN and SVM. Data is available at https://github.com/RabiaKhan-94/ On the other hand, the random forest takes less time for Angio_Webserver. training as compared to the neural network. Another impor- tant strength of RF is that it is less susceptible to overfitting which is not the case with a neural network. The robustness Conflicts of Interest of the feature extraction technique plays a significant role in the overall accuracy of the model. Feature extraction The authors declare that they have no conflicts of interest to uncovers obscure features more pertinent to the composi- report regarding the present study. Applied Bionics and Biomechanics 13 [15] S. Baritaki, S. Huerta-Yepez, T. Sakai, D. A. Spandidos, and Acknowledgments B. Bonavida, “Chemotherapeutic drugs sensitize cancer cells This project was funded by the Deanship of Scientific to TRAIL-mediated apoptosis: up-regulation of DR5 and inhi- Research (DSR), King Abdulaziz University (https://www bition of Yin Yang 1,” Molecular Cancer Therapeutics, vol. 6, no. 4, pp. 1387–1399, 2007. .kau.edu.sa/), Jeddah (under grant no. G:160-611-1441). The authors, therefore, acknowledge with thanks DSR tech- [16] R. Baskar, K. A. Lee, R. Yeo, and K. W. Yeoh, “Cancer and nical and financial support. radiation therapy: current advances and future directions,” International Journal of Medical Sciences, vol. 9, no. 3, pp. 193–199, 2012. References [17] L. Zhang, M. Bochkur Dratver, T. Yazal et al., “Mebendazole potentiates radiation therapy in triple-negative breast cancer,” [1] J. L. Blanco, A. B. Porto-Pazos, A. Pazos, and C. Fernandez- International Journal of Radiation Oncology � Biology � Phys- Lozano, “Prediction of high anti-angiogenic activity peptides ics, vol. 103, no. 1, pp. 195–207, 2019. in silico using a generalized linear model and feature selec- [18] N. Utku, “New approaches to treat cancer - what they can and tion,” Scientific Reports, vol. 8, no. 1, pp. 1–11, 2018. cannot do,” Biotechnology Healthcare, vol. 8, no. 4, pp. 25–27, [2] J. Hardy, “Les petites br??lures,” Soins, vol. 24, no. 6, pp. 3–5, [19] J. Blakeley, “Drug delivery to brain tumors,” Current Neurol- [3] H. Shen and X. Wei, “A qualitative analysis of a free boundary ogy and Neuroscience Reports, vol. 8, no. 3, pp. 235–241, 2008. problem modeling tumor growth with angiogenesis,” Nonlin- [20] P. Mobadersany, S. Yousefi, M. Amgad et al., “Predicting can- ear Analysis: Real World Applications, vol. 47, pp. 106–126, cer outcomes from histology and genomics using convolu- tional networks,” Proceedings of the National Academy of [4] N C Institute, “Angiogenesis inhibitors,” Angiogenesis Inhibi- Sciences of the United States of America, vol. 115, no. 13, tors, 2019, https://www.cancer.gov/about-cancer/treatment/ pp. E2970–E2979, 2018. types/immunotherapy/angiogenesis-inhibitors-fact-sheet. [21] S. P. S. Baker and A. Korhonen, Cancer hallmark text classifi- [5] V. Laengsri, C. Nantasenamat, N. Schaduangrat, P. Nuchnoi, cation using ConvNets, BioTxtM, 2016. V. Prachayasittikul, and W. Shoombuatong, “TargetAntiAn- [22] W. Hussain, Y. D. Khan, N. Rasool, S. A. Khan, and K. C. gio: a sequence-based tool for the Prediction and analysis of Chou, “SPrenylC-PseAAC: A sequence-based model devel- anti-angiogenic peptides,” International Journal of Molecular oped via Chou's 5-steps rule and general PseAAC for identify- Sciences, vol. 20, no. 12, p. 2950, 2019. ing S-prenylation sites in proteins,” Journal of Theoretical [6] D. J. Bharali, M. Rajabi, and S. A. Mousa, “Application of Biology, vol. 468, pp. 1–11, 2019. nanotechnology to target tumor angiogenesis in cancer thera- [23] Y. D. Khan, N. Rasool, W. Hussain, S. A. Khan, and K. C. peutics,” in Angiogenesis Strategies in Cancer Therapeutics, Chou, “IPhosY-PseAAC: identify phosphotyrosine sites by Elsevier Inc., 2016. incorporating sequence statistical moments into PseAAC,” [7] W. Liang, Y. Zheng, J. Zhang, and X. Sun, “Multiscale model- Molecular Biology Reports, vol. 45, no. 6, pp. 2501–2509, 2018. ing reveals angiogenesis-induced drug resistance in brain tumors and predicts a synergistic drug combination targeting [24] S. Naseer, R. F. Ali, Y. D. Khan, and P. Dominic, “iGluK-deep: computational identification of lysine glutarylation sites using EGFR and VEGFR pathways,” BMC Bioinformatics, vol. 20, Suppl 7, 2019. deep neural networks with general pseudo amino acid compo- sitions,” Journal of Biomolecular Structure and Dynamics, pp. [8] F. Bray, J. Ferlay, I. Soerjomataram, R. L. Siegel, L. A. Torre, 1–14, 2021. and A. Jemal, “Global cancer statistics 2018: GLOBOCAN esti- mates of incidence and mortality worldwide for 36 cancers in [25] M. K. Mahmood, A. Ehsan, Y. D. Khan, and K.-C. Chou, 185 countries,” CA: a Cancer Journal for Clinicians, vol. 68, “iHyd-LysSite (EPSV): identifying hydroxylysine sites in pro- no. 6, pp. 394–424, 2018. tein using statistical formulation by extracting enhanced posi- tion and sequence variant feature technique,” Current [9] CancerQuest, “Angiogenesis,” Angiogenesis, 2019. Genomics, vol. 21, pp. 536–545, 2020. [10] Y. Feng, Y. Dai, Z. Gong et al., “Association between angiogen- [26] S. Naseer, W. Hussain, Y. D. Khan, and N. Rasool, “Optimiza- esis and cytotoxic signatures in the tumor microenvironment of gastric cancer,” Oncotargets and Therapy, vol. Volume 11, tion of serine phosphorylation prediction in proteins by com- paring human engineered features and deep representations,” pp. 2725–2733, 2018. Analytical Biochemistry, vol. 615, article 114069, 2021. [11] R. K. Jain, E. di Tomaso, D. G. Duda, J. S. Loeffler, A. G. Sor- ensen, and T. T. Batchelor, “Angiogenesis in brain tumours,” [27] S. Naseer, W. Hussain, Y. D. Khan, and N. Rasool, “NPalmi- Nature Reviews. Neuroscience, vol. 8, no. 8, pp. 610–622, 2007. toylDeep-PseAAC: a predictor of N-palmitoylation sites in proteins using deep representations of proteins and PseAAC [12] Cancer Research UK, “Cancer Research UK,” Worldwide can- via modified 5-steps rule,” Current Bioinformatics, vol. 16, cer statistics, 2018. pp. 294–305, 2021. [13] T. A. Elbayoumi and V. P. Torchilin, “Tumor-targeted nano- [28] A. A. Shah and Y. D. Khan, “Identification of 4- medicines: enhanced antitumor EfficacyIn vivoof doxorubi- carboxyglutamate residue sites based on position based statis- cin-loaded, long-circulating liposomes modified with cancer- tical feature and multiple classification,” Scientific Reports, specific monoclonal antibody,” Clinical Cancer Research, vol. 10, pp. 1–10, 2020. vol. 15, no. 6, pp. 1973–1980, 2009. [14] C. Y. Huang, D. T. Ju, C. F. Chang, P. Muralidhar Reddy, and [29] M. S. Rahman, S. Shatabda, S. Saha, M. Kaykobad, and M. S. B. K. Velmurugan, “A review on the effects of current chemo- Rahman, “DPP-PseAAC: a DNA-binding protein prediction therapy drugs and natural agents in treating non-small cell model using Chou’s general PseAAC,” Journal of Theoretical lung cancer,” BioMedicine, vol. 7, no. 4, pp. 23–23, 2017. Biology, vol. 452, pp. 22–34, 2018. 14 Applied Bionics and Biomechanics [46] A. S. Ettayapuram Ramaprasad, S. Singh, R. P. S. Gajendra, [30] D. S. Cao, Q. S. Xu, and Y. Z. Liang, “Propy: a tool to generate various modes of Chou’s PseAAC,” Bioinformatics, vol. 29, and S. Venkatesan, “AntiAngioPred: a server for prediction no. 7, pp. 960–962, 2013. of anti-angiogenic peptides,” PLoS One, vol. 10, no. 9, pp. 7– 12, 2015. [31] P. Tripathi and P. N. Pandey, “A novel alignment-free method to classify protein folding types by combining spectral graph [47] P. Sudha, D. Ramyachitra, and P. Manikandan, “Enhanced clustering with Chou’s pseudo amino acid composition,” Jour- artificial neural network for protein fold recognition and struc- nal of Theoretical Biology, vol. 424, pp. 49–54, 2017. tural class prediction,” Gene Reports, vol. 12, pp. 261–275, [32] F. Javed and M. Hayat, “Predicting subcellular localization of multi-label proteins by incorporating the sequence features [48] J. Ahmad and M. Hayat, “MFSC: multi-voting based feature into Chou's PseAAC,” Genomics, vol. 111, no. 6, pp. 1325– selection for classification of Golgi proteins by adopting the 1332, 2019. general form of Chou’s PseAAC components,” Journal of The- oretical Biology, vol. 463, pp. 99–109, 2019. [33] L. Zhang and L. Kong, “IRSpot-ADPM: identify recombina- tion spots by incorporating the associated dinucleotide prod- [49] A. H. Butt, N. Rasool, and Y. D. Khan, “Predicting membrane uct model into Chou’s pseudo components,” Journal of proteins and their types by extracting various sequence fea- Theoretical Biology, vol. 441, pp. 1–8, 2018. tures into Chou’s general PseAAC,” Molecular Biology Reports, vol. 45, no. 6, pp. 2295–2306, 2018. [34] C. Huang and J. Q. Yuan, “Predicting protein subchloroplast locations with both single and multiple sites via three different [50] Y. Xu, X. J. Shao, L. Y. Wu, N. Y. Deng, and K. C. Chou, modes of Chou’s pseudo amino acid compositions,” Journal of “ISNO-AAPair: incorporating amino acid pairwise coupling Theoretical Biology, vol. 335, no. 22, pp. 205–212, 2013. into PseAAC for predicting cysteine S-nitrosylation sites in [35] K. C. Chou, “Some remarks on protein attribute prediction proteins,” PeerJ, vol. 2013, no. 1, pp. 1–18, 2013. and pseudo amino acid composition,” Journal of Theoretical [51] P. M. Feng, H. Ding, W. Chen, and H. Lin, “Naïve Bayes clas- Biology, vol. 273, no. 1, pp. 236–247, 2011. sifier with feature selection to identify phage virion proteins,” Computational and Mathematical Methods in Medicine, [36] K. C. Chou, “Prediction of protein cellular attributes using pseudo-amino acid composition,” Proteins: Structure, Func- vol. 2013, Article ID 530696, 6 pages, 2013. tion, and Genetics, vol. 43, no. 3, pp. 246–255, 2001. [52] A. H. Butt, S. A. Khan, H. Jamil, N. Rasool, and Y. D. Khan, “A prediction model for membrane proteins using moments [37] X. Fu, W. Zhu, B. Liao, L. Cai, L. Peng, and J. Yang, “Improved DNA-binding protein identification by incorporating evolu- based features,” BioMed Research International, vol. 2016, tionary information into the Chou’s PseAAC,” IEEE Access, Article ID 8370132, 7 pages, 2016. vol. 6, pp. 66545–66556, 2018. [53] X. Cui, Z. Yu, B. Yu, M. Wang, B. Tian, and Q. Ma, “UbiSi- [38] J. Jia, Z. Liu, X. Xiao, B. Liu, and K. C. Chou, “PSuc-Lys: predict tePred: A novel method for improving the accuracy of ubiqui- lysine succinylation sites in proteins with PseAAC and ensem- tination sites prediction by using LASSO to select the optimal ble random forest approach,” Journal of Theoretical Biology, Chou's pseudo components,” Chemometrics and Intelligent vol. 394, pp. 223–230, 2016. Laboratory Systems, vol. 184, pp. 28–43, 2019. [39] Y. D. Khan, F. Ahmed, and S. A. Khan, “Situation recognition [54] M. A. Akmal, W. Hussain, N. Rasool, Y. D. Khan, S. A. Khan, using image moments and recurrent neural networks,” Neural and K. C. Chou, “Using Chou’s 5-steps rule to predict O- Computing and Applications, vol. 24, no. 7–8, pp. 1519–1529, linked serine glycosylation sites by blending position relative features and statistical moment,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 5963, no. c, [40] M. A. Akmal, N. Rasool, and Y. D. Khan, “Prediction of N- pp. 1–1, 2020. linked glycosylation sites using position relative features and statistical moments,” PLoS One, vol. 12, no. 8, pp. 1–21, [55] A. O. Almagrabi, Y. D. Khan, and S. A. Khan, “iPhosD- PseAAC: identification of phosphoaspartate sites in proteins using statistical moments and PseAAC,” Biocell, vol. 45, [41] H. Cao, S. Bernard, R. Sabourin, and L. Heutte, “Random for- no. 5, pp. 1287–1298, 2021. est dissimilarity based multi-view learning for radiomics appli- cation,” Pattern Recognition, vol. 88, pp. 185–197, 2019. [56] M. Awais, W. Hussain, N. Rasool, and Y. D. Khan, “iTSP- PseAAC: identifying tumor suppressor proteins by using fully [42] C. Kathuria, D. Mehrotra, and N. K. Misra, “Predicting the connected neural network and PseAAC,” Current Bioinfor- protein structure using random forest approach,” Procedia Computer Science, vol. 132, pp. 1654–1662, 2018. matics, vol. 16, no. 5, pp. 700–709, 2021. [57] W. Hussain, N. Rasool, and Y. D. Khan, “A sequence-based [43] M. Ballings, D. Van Den Poel, N. Hespeels, and R. Gryp, “Eval- predictor of Zika virus proteins developed by integration of uating multiple classifiers for stock price direction prediction,” PseAAC and statistical moments,” Combinatorial Chemistry Expert Systems with Applications, vol. 42, no. 20, pp. 7046– 7056, 2015. & High Throughput Screening, vol. 23, no. 8, pp. 797–804, [44] S. Muthusamy, L. P. Manickam, V. Murugesan, [58] Y. D. Khan, E. Alzahrani, W. Alghamdi, and M. Z. Ullah, C. Muthukumaran, and A. Pugazhendhi, “Pectin extraction from Helianthus annuus (sunflower) heads using RSM and “Sequence-based identification of allergen proteins devel- oped by integration of PseAAC and statistical moments ANN modelling by a genetic algorithm approach,” Interna- tional Journal of Biological Macromolecules, vol. 124, via 5-step rule,” Current Bioinformatics, vol. 15, pp. 1046– 1055, 2020. pp. 750–758, 2019. [45] L. Jiang, J. Zhang, P. Xuan, and Q. Zou, “BP neural network [59] Y. D. Khan, N. S. Khan, S. Naseer, and A. H. Butt, “iSUMOK- could help improve pre-miRNA identification in various spe- PseAAC: prediction of lysine sumoylation sites using statistical cies,” BioMed Research International, vol. 2016, Article ID moments and Chou’s PseAAC,” PeerJ, vol. 9, article e11581, 9565689, 11 pages, 2016. 2021. Applied Bionics and Biomechanics 15 [60] S. J. Malebary, R. Khan, and Y. D. Khan, “ProtoPred: advanc- ing oncological research through identification of proto- oncogene proteins,” IEEE Access, vol. 9, pp. 68788–68797, [61] S. J. Malebary and Y. D. Khan, “Evaluating machine learning methodologies for identification of cancer driver genes,” Scien- tific Reports, vol. 11, no. 1, pp. 1–13, 2021. [62] N. Albugami, “Prediction of Saudi Arabia SARS-COV 2 diver- sifications in protein strain against China strain,” VAWKUM Transactions on Computer Sciences, vol. 8, no. 1, pp. 64–67, [63] S. J. Malebary and Y. Daanial Khan, “Identification of antimi- crobial peptides using Chou’s 5 step rule,” Computers, Mate- rials & Continua, vol. 67, no. 3, pp. 2863–2881, 2021. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Applied Bionics and Biomechanics Hindawi Publishing Corporation

iTAGPred: A Two-Level Prediction Model for Identification of Angiogenesis and Tumor Angiogenesis Biomarkers

Loading next page...
 
/lp/hindawi-publishing-corporation/itagpred-a-two-level-prediction-model-for-identification-of-Cuvgj1IyGP
Publisher
Hindawi Publishing Corporation
Copyright
Copyright © 2021 Khalid Allehaibi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
ISSN
1176-2322
eISSN
1754-2103
DOI
10.1155/2021/2803147
Publisher site
See Article on Publisher Site

Abstract

Hindawi Applied Bionics and Biomechanics Volume 2021, Article ID 2803147, 15 pages https://doi.org/10.1155/2021/2803147 Research Article iTAGPred: A Two-Level Prediction Model for Identification of Angiogenesis and Tumor Angiogenesis Biomarkers 1 2 3 Khalid Allehaibi, Yaser Daanial Khan , and Sher Afzal Khan Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia Department of Computer Science, University of Management and Technology, Lahore, Pakistan Department of Computer Sciences, Abdul Wali Khan University Mardan, Pakistan Correspondence should be addressed to Sher Afzal Khan; sher.afzal@awkum.edu.pk Received 1 June 2021; Accepted 2 September 2021; Published 27 September 2021 Academic Editor: Jose Merodio Copyright © 2021 Khalid Allehaibi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. A crucial biological process called angiogenesis plays a vital role in migration, growth, and wound healing of endothelial cells and other processes that are controlled by chemical signals. Angiogenesis is the process that controls the growth of blood vessels within tissues while angiogenesis proteins play a significant role in the proper working of this process. The balancing of these signals is necessary for the proper working of angiogenesis. Unbalancing of these signals increases blood vessel formation, which causes abnormal growth or several diseases including cancer. The proposed work focuses on developing a two-layered prediction model using different classifiers like random forest (RF), neural network, and support vector machine. The first level performs in silico identification of angiogenesis proteins based on the primary structure. In the case the protein is an angiogenesis protein, then the second level predicts whether the protein is linked with tumor angiogenesis or not. The performance of the model is evaluated through various validation techniques. The model was evaluated using k-fold cross-validation, independent, self-consistency, and jackknife testing. The overall accuracy using an RF classifier for angiogenesis at the first level was 97.8% and for tumor angiogenesis at the second level was 99.5%, ANN showed 94.1% accuracy for angiogenesis and 79.9% for tumor angiogenesis, and the accuracy of SVM for angiogenesis was 78.8% and for tumor angiogenesis was 65.19%. 1. Introduction the growth of new blood vessels. Without the angiogenesis process, abnormal or tumor cells cannot grow beyond 1- 2 mm in size [6, 7]. But this abnormal angiogenesis process The biological process in which new blood vessels develop from preexisting blood vessels is called angiogenesis [1]. It not only causes cancer but also is a precursor of several dis- is a normal process that plays a vital role in the migration, eases like leukemia, hematologic diseases, muscular degener- growth, and healing of endothelial cells. Angiogenesis itself ation, and eye diseases [8–10]. is controlled by chemical signals. Usually, the consequences Cancer is ranked as the leading cause of death in the 21st of these chemical signals remain balanced which means that century around the world. According to a survey report pub- new blood vessels only develop on a need basis. But some- lished in 2015 by the World Health Organization (WHO), times these signals can be unbalanced and may increase cancer is the first and second major reason for death before blood vessel formation, which in return causes abnormal the age of 70 in 91 countries around the globe [7]. Further- growth or diseases [2, 3]. Angiogenesis plays a vital role in more, according to the cancer statistics report 2018 by the the development and growth of cancer cells [4, 5]. Just like International Agency for Research on Cancer and Cancer normal cell growth, tumor cells also need oxygen and other Research UK, 9.6 million people around the world are dying nutrients to grow and expand. These elements are present in due to cancer [7, 11]. This ratio is predicted to increase in the blood. Tumor cells send chemical signals that stimulate the coming years. 2 Applied Bionics and Biomechanics as they are targets of such inhibitors [22]. Most of the cancer Researchers, scientists, and biologists all around the world are searching for different techniques for developing research revolves around finding ligands and substances that different drugs and systems to fight against this deadly dis- will bind with tumor angiogenesis proteins and inhibit its role [23]. Scientists use various methodologies for the identi- ease [12]. Until now, a lot of researchers have contributed their knowledge to develop different systems for tumor pre- fication of protein attributes [24–28]. In silico identification diction at different stages of its life cycle. Different strategies techniques have evolved and received acclaim over the past were proposed to control this disease like chemotherapy [13, few years as they provide robust and fast results and are 14], radiation therapy [15, 16], surgeries, and bone marrow cost-effective [29, 30]. Scientists have used various mathe- matical and computational models to identify attributes of transplant also known as cord blood and vaccines [17]. Can- cer can attack the brain that is the most crucial part of the proteins based on the composition and positioning of amino human body. It has the most delicate and complex structure, acid residues [31]. A position-based mathematical model, so it is difficult to inject drugs to cure it. But different namely, position-specific scoring matrix (PSSM), was intro- approaches can deliver drugs like high-dose chemotherapy, duced in 1982 [32]. Numerous prediction models have been designed that incorporate the use of PSSM for the identifica- blood-brain barriers, and disruption [18]. Many therapies for tumors revolve around the attempt to suppress the tion of proteomic attributes. However, since PSSM did not tumor angiogenesis process. Scientists have discovered many incorporate the composition relevant information into the ligands that can bind to tumor angiogenesis proteins such model, therefore it lacked a major aspect that determines that their function is inhibited. Hence, identification of proteomic attributes. In 2001, Chou introduced the pseudo amino acid composition model that encompassed position angiogenesis and tumor angiogenesis proteins is crucial in finding novel and effective tumor therapies. as well as composition information into the model and Formerly, several mathematical [3] and computational hence provided better results [33]. Many generalizations models have been developed for the classification or identifi- and variants have since been proposed to provide even better cation of various proteomic and genomic attributes [19]. results [31]. The choice of the most appropriate classifier pl The proposed work establishes a computational model based ays a pivotal role in the design of such methodologies. A on position and combinational information of a primary multitude of classifiers have been engaged for the prediction sequence that attempts to accurately identify angiogenesis of posttranslational modification sites including random for- and tumor angiogenesis proteins. Since tumor angiogenesis est, support vector machine, neural networks, and deep proteins are also characterized as angiogenesis proteins, the learning. In [34], the authors incorporate adapted normal similarity of their obscure features can often lead to an distribution biprofile, Bayes, with PseAAC to formulate a ambiguous outcome. Ambiguity among seemingly similar prediction model. The accuracy is further improved using angiogenesis and tumor angiogenesis proteins is resolved kernel sparse representation classification and minimum by a two-layer classification model. The initial layer distin- redundancy and maximum relevance algorithm [35]. Subse- guishes between angiogenesis and nonangiogenesis proteins quently, an improved depiction uses a deep learning algo- while the second layer deciphers if a protein identified as an rithm formulated by [36]. Deep learning has emerged as angiogenesis protein is tumor causing or not. The two- an encouraging model for the resolution of a multitude of layered model helps alleviate ambiguity and yield more problems [37–39]. The proposed work presents a two- accurate and assiduous results. layered model based on position and composition relative The rest of the paper is organized as follows. Section 2 features and statistical moments [31] for the identification illuminates the importance of angiogenesis uncovered in of angiogenesis and tumor angiogenesis proteins which are the previous research and also discusses the state-of-the- probed on various classifiers to accrue the best results. art models used for in silico identification of proteomic attributes. Section 3 discusses the methodology adopted 2. Materials and Methods for the proposed in silico identification model. Section 4 illustrates the accuracy of the model obtained through Angiogenesis has been identified as a critical process that well-defined rigorous testing methodologies. Section 5 pro- needs to be subjugated to disrupt the progression of cancer. vides a general discussion regarding the performance of the Angiogenesis proteins especially the ones that lead to tumor proposed model. angiogenesis have a crucial significance in this process. Since they promote the development of new blood vessels within 1.1. Current State of the Art. The crucial role of angiogenesis the cancerous tissue, therefore they are considered an in tumor progression was first discovered by Judah Folk- important biomarker for early detection of cancer. man in 1971 [20]. Angiogenesis is a crucial process of vas- Tumors also use the same process for their growth; how- cular system growth through the sprouting and splitting of ever, it is possible to uniquely identify the growth factors blood vessels [21]. Tumor cells also require a constant flow that are responsible for its growth. In terms of proteomic of blood for their growth for which they simulate the features, angiogenesis and tumor angiogenesis have mutual growth of blood vessels through secretion of various tumor properties. Therefore, to fulfill the arduous challenge of dis- angiogenesis proteins or growth factors. Cancer treatment tinctly identifying tumor angiogenesis proteins, a two- therapies are aimed at finding inhibitors for such growth layered approach is adopted as shown in Figure 1. factors. Identification of angiogenesis and tumor angiogene- The first layer of the model detects whether or not a pro- sis proteins bears enormous significance in cancer research tein is an angiogenesis protein, using the primary structure Applied Bionics and Biomechanics 3 Input sequence Yes Identify sequences No Layer 1 Non-angiogenesis as angiogenesis Yes Determine process No Non-tumor Layer 2 as angiogenesis tumor Yes Show result End Figure 1: Flowchart of the proposed system. of that protein. In the case it is an angiogenesis protein, then the second layer of the model is invoked to decide whether Data collection from uniProt and data Preprocessing the angiogenesis protein can potentially cause cancer or not. The proposed workflow is shown in Figure 2, consisting of the following five-step approach; initially, we will collect Feature extraction the well-reviewed and experimentally tested dataset consist- ing of angiogenesis proteins preprocessed to remove redun- dancies. Further, feature extractions are performed to Training transform the biological data into its equivalent mathemati- cal matrix. In the third step, the obtained feature matrix is used to train the model for further prediction. In the fourth Model evaluation step, the model is evaluated for its correctness, sensitivity, specificity, and MCC. In the fifth step, we developed the webserver. Webserver 2.1. Dataset Collection. The dataset was collected from the Figure 2: The workflow of the proposed model is shown which UniProt database using meticulously designed search parame- includes five steps: data collection and its preprocessing, feature ters. UniProt is a Universal Protein Resource that contains extraction, training, model evaluation, and the construction of the huge information about the sequence of proteins and their webserver. biological functions [22]. A dataset containing positive samples was composed for both angiogenesis and tumor tumor angiogenesis datasets was performed by setting the angiogenesis using the UniProt keyword “Angiogenesis.” sequence identity parameters at 60%. Ultimately, 761 posi- Similarly, negative samples were also collected. UniProt has tive and 2776 negative clusters were formed for the angio- no keyword for “Tumor Angiogenesis” proteins. Nonethe- genesis dataset. Similarly, 256 positive and 448 negative less, they comprise within the set of angiogenesis proteins; clusters were formed for the tumor angiogenesis dataset. A therefore, tumor angiogenesis proteins were manually representative sequence was selected from each cluster to curated from the acquired dataset. Each sample within the form the final dataset. dataset was manually analyzed for annotated proteomic properties and published evidence within the database to + − A = A ∪ A : ð1Þ form a set of tumor angiogenesis proteins. However, ambig- uous samples were left out. After the collection of data from UniProt, the CD hit suite (http://weizhong-lab.ucsd.edu/ The above equation shows the benchmark dataset used cdhit_suite/cgi-bin/index.cgi) was used to reduce the homol- in this work, where A represents the positive data samples ogy of data samples. Clustering of the angiogenesis and of angiogenesis protein and A shows the negative data. 4 Applied Bionics and Biomechanics ance and is calculated based on the Hahn polynomial. Cen- Also, the positive tumor angiogenesis samples are repre- sented as T , and negative tumor angiogenesis proteins are tral moments abide information regarding asymmetry, represented as T as shown in the equation below: mean, and variance. The central moments are derived for the centroid of collective data making these moments scale + − T = T ∪ T : ð2Þ variant and location invariant. Subsequently, raw moments are scale and location variants and represent properties like 2.2. Feature Extraction. A robust and efficient methodology asymmetry, variance, and mean. for the transformation of biological sequences into a numer- ′ A matrix P with m × m dimensions is formulated for ical notation for incorporation into a machine learning algo- a two-dimensional residual protein representation where pffiffiffiffi rithm is the most pivotal concept in the design of such = d L e. predictive models [31, 40]. This conversion must keep intact 2 3 the original information or features of the sequence for R R R ⋯ R 11 12 13 1m analysis in some numerical form. For this purpose, each pri- 6 7 6 7 R R R ⋯ R mary sequence within the collected data is converted into a 21 22 23 2m 6 7 P = : ð5Þ 6 7 fixed-size vector. A feature vector of static length is formed 6 7 ⋮⋮ ⋮ ⋮ ⋮ 4 5 which represents a primary sequence and remains essentially invariant upon the scale of the sequence [41]. Incorporation R ⋯⋯ ⋯ R m1 mn of such a transformation model is ideal as most of the state- of-the-art classifiers work with vectors [22, 42, 43]. A vector The vector P is easily transformed into matrix P by described in a model may also lose complete information of using a simple mapping function explained in [47]. The the pattern sequence [44]. For this problem, Chou’s PseAAC primary sequence is fitted into a two-dimensional matrix was proposed which is used by many scientists for the con- so that it could be formulated into the Hahn polynomial struction of genomic and proteomic prediction models and which is orthogonal. The same two-dimensional notation their applications [45, 46]. Later, this model was improved was used for deriving raw and central moments. The to provide a better correlation perspective among residues Hahn moment is computed using the Hahn polynomial that reflect onto feature coefficients. as given below. Let P be a sequence of proteins of length L, which is rep- resented as v,u H ðÞ r, N =ðÞ N + U − 1 ðÞ N − 1 n n ðÞ −n ðÞ −rðÞ 2N + v + u − n − 1 1 P = R R R R R R R , ð3Þ i i i 1 2 3⋯ 16 17 18⋯ L × 〠 −1 × : ðÞ N + u − 1 N − 1 i! ðÞ ðÞ i i i=0 where R is an arbitrary residue of a polypeptide chain with i ð6Þ length L. Feature extraction yields a vector with numerous numer- Central moments are computed using the equation ical coefficients. This transformation from a variable-length given below. polypeptide chain into a fixed-length feature vector is illus- trated in the following equation: k k s t μ = 〠〠ðÞ p − x ðÞ q − y P : ð7Þ pq st ΔðÞ P =½ Ψ Ψ ⋯ Ψ ⋯ Ψ , ð4Þ p=1 q=1 1 2 u Ω where Δ is the transformation function, Ψ is an arbitrary The following equation is used to compute the raw coefficient, and Ω is the constant length of the feature vector moments. [22, 31]. k k s t 2.3. Statistical Moments. The proposed methodology M = 〠〠 p q P : ð8Þ st pq develops on the use of statistical moments to form a numer- p=1 q=1 ical representation such that the obscured information within the primary structure of proteins stays intact. These In equations (7) and (8), s and t represent the order of moments form a succinct numerical form such that the orig- raw moments. Orthogonality of these moments renders its inal data can be reconstructed without any significant loss of use assiduous as their inverse functions can be used to information. Moments can be obtained up to several orders; reconstruct data. Detailed explanation and use of these each provides a deeper perspective into specific aspects of notations can be found in [48]. data like positioning, eccentricity, skewness, and peculiarity [31]. Mathematicians and statisticians have devised many 2.4. Frequency Vector Determination. The cumulative fre- moments generating coefficients incarnated based on well- quency of occurrence of each specific amino acid residue is defined distribution functions and polynomials [35, 44]. furnished into a frequency vector. Information about the In the proposed work, Hahn moments, raw moments, distribution of amino acid residues within the primary and central moments are organized to form a feature set. sequence is summarized into this frequency vector which is The Hahn moment bears location- and scale-oriented vari- represented as Applied Bionics and Biomechanics 5 th FV = f f f ⋯ f , ð9Þ Any i element in the above matrix is computed as 1, 2, 3, 20 where f refers to the frequency of occurrence of an arbitrary v = 〠 P , ð13Þ i k distinct amino acid residue. k=1 2.5. Position Relative Incidence Matrix (PRIM) Calculation. where P is the position of occurrence of a native amino acid The primary sequence of the proteins forms the basis of for- while n is its frequency of occurrence. mulation of feature vectors of primary structures which are All the above-defined features are aggregated to form a otherwise obscure. Information pertaining to position rela- feature vector. The dimensionality of P , X , and PRIM tive incidence of arbitrary protein residues is formulated as X is reduced by computing their Hahn, central, and RPRIM a matrix of size ð20 × 20Þ. The Position Relative Incidence raw moments. Ultimately, a fixed-size feature vector is Matrix (PRIM) is illustrated as formed to represent primary structures of varied lengths. 2 3 X X ⋯ X ⋯ 3. Prediction Algorithm 1,1 1,2 1,j 1,20 6 7 6 X ⋯ 7 X X ⋯ X After extraction of feature vectors from positive as well as 2,1 2,2 2,j 2,20 6 7 6 7 negative sequences, the data is used to train classifiers. A 6 7 ⋮ ⋮ ⋮ ⋮ 6 7 diverse set of currently widespread classifiers were used for X = : PRIM 6 7 X X ⋯ X 6 X ⋯ 7 the purpose which includes random forest, neural network, i,1 i,2 i,j i,20 6 7 and support vector machine. Comparison of results yielded 6 7 ⋮ ⋮ ⋮ 6 ⋮ 7 from each classifier work enables the identification of the 6 7 6 7 X X ⋯ X ⋯ most suitable classifier with the highest accuracy. N,1 N,2 N,j N,20 ð10Þ 3.1. Random Forest. The random forest (RF) classifier was trained at two levels for the prediction of angiogenesis and tumor angiogenesis proteins. At the first level, the classifier The sum of the relative position of the jth protein resi- was used to identify angiogenesis and nonangiogenesis pro- due corresponding to the first occurrence of the ith residue teins while at the second level the angiogenesis protein was is computed in the above matrix given as X . The matrix ij passed through another classifier to identify if the protein contains all the possible permutations for such occurrences is tumor causing or not. The random forest is a very power- as explained in [48]. ful classifier used for classification and regression problems [49, 50]. Initially, it converts the whole data into decision 2.6. Determination of Reverse Position Relative Incidence trees [23, 51]. Furthermore, a random forest classifier is Matrix (RPRIM). More obscure features of the primary applied to each tree to predict a class. The class with the sequence are uncovered with the help of the Reverse Posi- highest votes becomes the models’ prediction result [41] as tion Relative Incidence Matrix (RPRIM). The RPRIM is illustrated in Figure 3. obtained by forming the PRIM of the reversed primary sequence. X is illustrated as RPRIM 3.2. Artificial Neural Network (ANN). Subsequently, the arti- ficial neural network (ANN) was also similarly employed at 2 3 R R R two levels. ANN has interconnected layers of neurons [52]. 1,1 1,2⋯ 1,j⋯ 1,20 6 7 The connectionist architecture of the backpropagation net- 6 7 R R R R 2,j⋯ 2,1 2,2⋯ 2,20 work is illustrated in Figure 4. The ANN mechanism used 6 7 6 7 is based on a feedforward network and uses the backpropa- 6 7 ⋮ ⋮ ⋮ ⋮ 6 7 X = , ð11Þ RPRIM gation algorithm to reduce error. An input layer is clamped 6 7 R R R R 6 7 i,1 i,2⋯ i,20 i,j⋯ to the input feature vectors. It also has a hidden layer that 6 7 6 7 receives selected numbers of neurons from the input layer ⋮ ⋮ ⋮ ⋮ 6 7 6 7 and forms the main processing unit of the whole network. 6 7 R R R R N,20 N,1 N,2⋯ N,j⋯ The activation unit of ANN sums all preceding weighted inputs in addition to bias values [23, 31]. The output of the 3-layer feedforward network with error backpropagation where R is an arbitrary element of X . i,j RPRIM is represented by 2.7. Accumulative Absolute Position Incidence Vector (AAPI ! ! h k V) Calculation. The AAPIV matrix is used to calculate the O = f 〠 W × f 〠 W I , ð14Þ sum all the positions at which each native amino acid occurs m ym xy a y=1 x=1 within the primary sequence; hence, it bears a length of 20 and is denoted as where the input layer has k neurons and the hidden layer has h neurons. Partial output calculated by the mth neuron in AAPIV =½ v , v , v ,⋯v : ð12Þ 1 2 3 20 the network is denoted by O . Supposing that the arbitrary m . . . . . 6 Applied Bionics and Biomechanics Input data Subset Subset Subset Tree 1 Tree 2 Tree n …… Class 1 Class 2 Class 3 Maximum voting Final class Figure 3: Random forest classifier architecture. Error back propagation Input 1 Input 2 Error Input 3 Output layer Input 4 Input N Hidden layer Input layer Figure 4: Architecture of ANN. node receives an input I , then W represents the weight of a xy ∈ =0:5〠 O − P , ð16Þ ðÞ i i the edge connecting node x to node y. Similarly, W repre- ym i=1 sents the weight of the yth node connected to an arbitrary output layer neuron m. The classical sigma function which where O is the target output, P is the actual calculated output i i determines the activation of neurons is denoted as f in by the network, and o is the number of neurons in the output layer. The gradient descent method is used to minimize the fxðÞ = : ð15Þ error rate. The error generated at the output layer is sent back −x ðÞ 1+ e to the input layer. The set of all the weights is represented by a vector V. The backpropagation procedure selects a differential Actual activated levels in the output units are compared ΔV such that it lessens the error. This is continued iteratively with the target output for every training iteration. The error until convergence is achieved as shown below: rate hence observed is denoted by ∈ and is calculated by the difference between the expected output and actual activated VtðÞ +1 =VtðÞ+ΔVtðÞ, ð17Þ output given as Applied Bionics and Biomechanics 7 Class A where ∂ ∈ ΔV = η − jV =VtðÞ: ð18Þ Hyper ∂W plane This equation shows a change in weight at time t +1,and Class B apositive constant η signifies the learning rate usually set between 0 and 1. The change in weights is expressed as ∂ ∈ ΔV = −η : ð19Þ u,v ∂W u,v Figure 5: Architectural diagram of SVM. th Here, ΔV shows the minimal ∈ weight among the u u,v th th and v neurons in the i iteration. This procedure is followed All the classifiers were implemented using Python ver- in both backward and forward passes of input signals. It is a sion 3.6 using SciKit Learn API. Subsequently, results gath- lightweight procedure that consumes less memory space, ered using this framework are rigorously analyzed in terms and it is extensively used for the training of ANN. Patterns of their performance parameters. are repetitively offered to the network to train it and to make A major design issue regarding the design of a new it capable of minimizing the mean square error (MSE) as prediction model is to set up some parameters to measure shown in its accuracy. Researchers have predominantly used four descriptive metrics for performance analysis. These metrics n k o o are as follows: MSE = 〠〠 P − O : ð20Þ ðÞ i i 2n j=1 i=1 (1) Sp measures the specificity which quantifies the abil- ity of the model to identify positive samples accu- th The actual output received at the i neuron of the output rately [46] o o layer is represented as O ,and P represents the expected i i (2) Sn measures the sensitivity, which represents the value where the total number of input samples is n and there accuracy in predicting negative data samples are k output neurons. (3) Acc is used to measure the overall accuracy of the 3.3. Support Vector Machine (SVM). A support vector model machine (SVM) is a machine learning classifier that is used in regression-related problems. SVM works by attempting (4) MCC is for measuring the stability of the model to fit in a hyperplane in an N-dimensional space where N (5) The following formulation is used to quantify these represents the number of feature elements that represents metrics. the samples distinctly. Hyperplanes are simple decision boundaries that classify the data points, and these data points are present on both sides of the hyperplane, which TN ideally partitions different classes. The hyperplane is most SpecificityðÞ Sp = , ð21Þ ðÞ TN + FP optimally adjusted by means of support vectors. Figure 5 illustrates points on either side of the hyperplane belonging TN to different classes, namely, class A and class B. SenstivityðÞ Sn = , ð22Þ ðÞ TP + FN 4. Results and Discussion ðÞ TP + TN AccuracyðÞ Acc = × 100, ð23Þ 4.1. Evaluation of the Model. In the current study, the dataset ðÞ TP + FP + TN + FN was constructed on two levels. The first level uses 785 positive TP TN − FP FN ðÞðÞ ðÞðÞ and 2776 negative samples regarding angiogenesis proteins MCC = , pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi whereas the second level encompasses 256 positive and 448 ðÞ TP + FNðÞ TN + FPðÞ TP + FPðÞ TN + FN negative samples for tumor angiogenesis proteins. A feature ð24Þ vector input matrix (FIM) was formed for both angiogenesis and tumor angiogenesis datasets separately. Every single row where true negatives are represented by TN, true positives of FIM is a feature vector that represents a single data sample. are represented by TP, false positives are represented by Also, an Expected Output Matrix (EOM) was formed corre- FP, and false negatives are represented by FN [43, 53, 54]. sponding to FIM. All the classifiers were trained using both But unfortunately, the formation of equations (21), (22), FIM and EOM. FIM was given as an input for training the (23), and (24) is somewhat cryptic for biologists [55]. model where EOM was used to compute errors and retrain Another more intuitive format has been suggested by scien- until convergence is achieved [23, 31, 43, 45]. tists in [56, 57], and their modifiers were introduced in 8 Applied Bionics and Biomechanics Table 1: New symbol description for Chou’s fourth step. Symbols Explanation Represents the total number of true positives in the dataset The total number of true positives in the dataset projected incorrectly The total numbers of true negatives in the dataset N The total number of negatives projected incorrectly Table 2: Self-consistency results for angiogenesis and tumor angiogenesis. Angiogenesis Tumor angiogenesis Predictor TP FP TN FN Acc (%) Sp (%) Sn (%) MCC TP FP TN FN Acc (%) Sp (%) Sn (%) MCC RF 783 0 2784 0 100 100 100 1 255 1 447 1 99.7 99.6 99.8 0.9 ANN 766 7 2580 204 94.1 99.1 92.7 0.9 256 0 307 141 79.9 100 68.5 0.6 SVM 31 752 2783 1 78.9 4 100 0.2 12 244 447 1 65.2 4.7 99.8 0.2 Table 3: k-fold cross-validation results. Level 1 Level 2 Predictor Fold TP FP TN FN Acc (%) Sn (%) Sp (%) MCC TP FP TN FN Acc (%) Sn (%) Sp (%) MCC RF 723 60 2784 0 98.1 92.3 100 0.95 254 2 448 0 99.7 99.2 100 0.9 ANN 5 653 130 2780 4 96.2 83.4 99.9 0.8 246 10 428 20 95.7 96.1 95.7 0.9 SVM 31 752 2783 1 78.8 4 100 0.2 6 250 448 0 64.5 2.3 100 0.1 RF 706 77 2784 0 97.8 99.4 100 0.9 253 3 0 448 99.5 98.8 100 0.9 ANN 10 776 7 2580 240 94.1 99.1 92.7 0.8 256 0 307 141 79.9 100 68.5 0.7 SVM 31 752 2783 1 78.8 4 100 0.2 12 244 447 1 65.19 4.7 99.8 0.2 Table 4: Jackknife results. Angiogenesis Tumor angiogenesis Model TP FP TN FN Acc (%) Sn (%) Sp (%) MCC TP FP TN FN Acc (%) Sp (%) Sn (%) MCC RF 781 26 2784 0 99.3 100 100 1 255 1 447 1 99.7 99.6 99.8 0.9 ANN 653 130 2780 4 96.3 83.3 99.9 0.8 246 10 428 20 95.7 96.1 95.5 0.9 SVM 783 0 2784 0 100 100 100 1 6 250 448 0 64.5 2.3 100 0.1 + − [47]. Symbols used to represent these equations are N , N , 4.2. Validation Methods. Testing is another important factor + − N , and N . Explanation of these representations is given in for the validation of the predicting models [22, 31, 42, 45]. − + Table 1. The validation phase encompasses four most commonly used tests discussed below. Hence, these metrics are also calculated as + 4.2.1. Self-Consistency. The self-consistency test is the most > Sn = 1 − , > + N trivial and intuitive of the tests. A trained model is simply > − tested on the dataset that was used to train it. Capability of > N > + Sp = 1 − , − a model to learn from a given dataset is underscored with this basic but useful evaluating benchmark. Good results + − N + N − + > merely indicate that the classifier has the ability to find Accuracy = 1 − , + − N + N > obscure patterns within the training data. Self-consistency + + − − > 1 − N /N + N /N testing was performed on angiogenesis and tumor angiogen- ðÞ ðÞ ðÞ > − + MCC = pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : esis datasets upon which the proposed model was trained. − − + + − − 1+ N − N /N 1+ N − N /N ðÞ ðÞ ðÞ ðÞ ðÞ ðÞ + + − + Results obtained from self-consistency tests are illustrated ð25Þ in Table 2 showing the overall performance of the proposed Applied Bionics and Biomechanics 9 Table 5: Independent set results. Angiogenesis Tumor angiogenesis Model TP FP TN FN Acc (%) Sn (%) Sp (%) MCC TP FP TN FN Acc (%) Sp (%) Sn (%) MCC RF 211 27 833 0 94.5 88.7 100 0.9 70 0 142 0 100 100 100 1 ANN 227 14 827 3 98.4 94.2 99.6 0.9 59 12 141 0 94.3 83.1 100 0.9 SVM 3 238 833 7 77.2 1.2 99.2 0.02 5 66 131 10 64.2 7.0 92.9 0.01 Receiver operating characteristic (ROC)-curve Receiver operating characteristic (ROC)-curve 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate False positive rate Angiogenesis: self consistency by SVM Angiogenesis: 5-fold cross-validation by SVM Angiogenesis: self consistency by ANN Angiogenesis: 5-fold cross-validation by ANN Angiogenesis: self consistency by random forest Angiogenesis: 5-fold cross-validation by random forest Figure 6: Comparison based on self-consistency. Figure 7: Comparison through 5-folds. model using random forest (RF), artificial neural network 4.2.3. Jackknife Testing. Jackknife testing is the most rigorous (ANN), and support vector machine (SVM) classifier. testing methodology. In each iteration, it leaves out a single The results indicate that the random forest classifier has sample while the model is trained on the rest. After sufficient the best capability to learn and decipher obscure patterns training, the model is tested with the left-out sample. This that peculiarly characterize each sample. process exhaustively proceeds for all data samples. Hence, this test is repeated N times, where N represents the size of the overall dataset. In every iteration, the testing data sample 4.2.2. Cross-Validation. The cross-validation technique is is different, so all samples are tested exactly once. This tech- used when unknown data for testing is not readily available nique is the most rigorous which also makes it slower [45, 58]. The dataset is randomly divided into multiple par- [59–63]. After successfully training and testing, the number titions or folds spanning over a comprehensive sample space of true positive, false positive, true negative, and false nega- hence rendering cross-validation as a rigorous test. Parti- tive was obtained [55]. tions are devised in a manner such that they are disjointed Since the sample is tested exactly once, therefore the from each other and are comparable in size. A partition is overall accuracy obtained for this test remains unique [31, left out while the model is trained on the rest of the data. 40, 45, 46]. Once the model is fully trained, the left-out partition is used RF results illustrated in Table 4 for angiogenesis and as unknown data to test the model. These steps are recapit- tumor angiogenesis proteins portray higher accuracies and ulated for each fold. The overall accuracy of the model for are reported as 99.3% and 99.7%, respectively, in compari- the cross-validation test is reported by taking the mean of son with other classifiers. accuracy yielded against each fold. Cross-validation tests were performed by partitioning the benchmark dataset into 5-folds and 10-folds. Table 3 4.2.4. Independent Set Testing. Independent test evaluates depicts the results of the test. how well a model performs on unknown data. Initially, the The random forest exhibits the best results at both levels data is partitioned such that the larger partition is used for with an accuracy of 99.7% for the identification of angiogen- training and the left-out partition is used as unknown data esis proteins and an accuracy of 99.5% for the identification for testing. Once the model is completely trained, then inde- of tumor angiogenesis proteins. pendent set testing is performed using the left-out data. An True positive rate True positive rate 10 Applied Bionics and Biomechanics Receiver operating characteristic (ROC)-curve Receiver operating characteristic (ROC)-curve 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate False positive rate Angiogenesis: 10-fold cross-validation by SVM Angiogenesis: independent-testing by SVM Angiogenesis: 10-fold cross-validation by ANN Angiogenesis: independent-testing by ANN Angiogenesis: 10-fold cross-validation by random forest Angiogenesis: independent-testing by random forest Figure 8: Comparison based on 10-folds. Figure 10: Independent testing comparison. Receiver operating characteristic (ROC)-curve Receiver operating characteristic (ROC)-curve 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate False positive rate Angiogenesis: jackknife test by SVM Tumor_angiogenesis: self test by SVM Angiogenesis: jackknife test by ANN Tumor_angiogenesis: self test by ANN Angiogenesis: jackknife test by random forest Tumor_angiogenesis: self test by random forest Figure 9: Jackknife testing comparison. Figure 11: Comparison based on self-consistency. independent set needs to be formulated intelligibly such that Working with classification models renders performance the training data encompasses comprehensive obscure pat- measurement as an essential task quantified using classifica- terns and the test data thoroughly queries the ability of the tion scores. But this type of performance is not suitable while model to decipher these patterns. Otherwise, testing results dealing with flawed datasets with heavy class imbalance. In may be ambiguous. Results obtained from independent test- such cases, ROC (Receiver Operating Characteristic) curves ing illustrate the overall accuracies of RF, ANN, and SVM provide a graphical view along with quantitative analysis of classifiers after independent testing as presented in Table 5. the overall scenario. ROC is a prevalently used performance The random forest shows the best results as compared to evaluation method for evaluating any classification model. The ROC curve is plotted by mapping the True Positive Rate ANN and SVM classifiers at both levels for the identification of angiogenesis as well as tumor angiogenesis proteins while (TPR) against the False Positive Rate (FPR). It depicts the the performance of the ANN classifier is better than that of accuracy with which the model is capable of distinguishing the SVM classifier. among classes. TPR is plotted along the y-axis while FPR is True positive rate True positive rate True positive rate True positive rate Applied Bionics and Biomechanics 11 Receiver operating characteristic (ROC)-curve Receiver operating characteristic (ROC)-curve 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate False positive rate Tumor_angiogenesis: 5-fold by SVM Tumor_angiogenesis: jackknife test by SVM Tumor_angiogenesis: 5-fold by ANN Tumor_angiogenesis: jackknife test by ANN Tumor_angiogenesis: 5-fold by random forest Tumor_angiogenesis: jackknife test by random forest Figure 12: Comparison based on 5-folds. Figure 14: Comparison of jackknife testing. Receiver operating characteristic (ROC)-curve Receiver operating characteristic (ROC)-curve 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate False positive rate Tumor_angiogenesis: 10-fold by SVM Tumor_angio: independent-test by SVM Tumor_angiogenesis: 10-fold by ANN Tumor_angio: independent-test by ANN Tumor_angiogenesis: 10-fold by random forest Tumor_angio: independent-test by random forest Figure 13: Comparison based on 10-folds. Figure 15: Comparison based on independent testing. plotted along the x-axis. Estimation of the area under the parison based on testing performed in the previous section. curve is a measure of the model’s performance. The best Figures 6–10 depict that RF shows the best results in com- possible accuracy is 1, and the worst is 0.5. A good measure parison with ANN and SVM. The RF curve encompasses of separability means that the model has accuracy near 1, an area close to 1 implying that the model has the best mea- and similarly, accuracy near 0 indicates that the model sure of separability. Graphical representations accentuate has the worst measures of separability. Consequently, an that RF and ANN both exhibit better results as compared accuracy of less than 0.5 indicates that the model will per- to SVM. However, in the case of jackknife testing, SVM clas- form exactly the opposite of what a model was recom- sifier accuracy is high as compared to that of ANN as illus- mended to do. trated in Figure 10. A similar comparison is performed for classifiers at the Various testing techniques were applied to gauge the effectiveness of the classifiers as discussed earlier. To priori- second level which predicts tumor angiogenesis proteins. tize the classifiers based on efficiency, a comparison is Figures 11–15 illustrate the results of various test techniques depicted through a ROC curve. Figure 6 represents the com- performed on the tumor angiogenesis dataset. These figures True positive rate True positive rate True positive rate True positive rate 12 Applied Bionics and Biomechanics tion and sequence of the primary structures. The meticu- connote that the RF classifier exhibits better results in com- parison with the ANN and SVM classifier supported by the lously collected data helps the model to produce better fact that the area under the RF curve is approximately results. The in silico nature of the model makes it an alluring approaching 1. opportunity as it is timely and cost-effective. Biologists and scientists can greatly benefit from the proposed tool for the characterization of proteins and understand their role in 5. Webserver angiogenesis and tumor angiogenesis processes. Further- Formulation of the robust dataset and feature extraction more, the model can prove to be effective in identifying the biomarkers that cause a tumor. Additionally, it augments methodology forms the foundation of a computationally intelligent model for efficient prediction of uncategorized the work of biologists and scientists in research aimed at finding new treatments and discovering new drugs. proteomic sequences. However, the availability of such a tool is also of extreme importance so that the research commu- Tumor-causing angiogenesis proteins are important bio- nity could benefit from it [45]. To make a novel predictor markers for the onset of cancer. Timely identification of these proteins can help in the treatment and possible cure for the forbearance of all users and biologists around the globe, there is a need for a user-friendly and publically acces- of the disease. This study proposes a robust in silico tech- sible webserver. In the final step of Chou’s 5-step rule, a web- nique for the identification of tumor angiogenesis using a server is devised for this purpose [48]. The webserver two-level predictor. The first level indicates whether a pro- enables scientists and biologists to easily access and to utilize tein is an angiogenesis protein or not while the second level identifies whether the given protein is responsible for tumor such prediction applications without getting into the com- plex mathematical details. The webserver for the proposed angiogenesis or not. A mature feature extraction technique work will soon be made available. Meanwhile, its code has was used to gather features for the benchmark dataset. Clas- been made available along with a readme file at https:// sifiers like RF, SVM, and ANN were trained using the resul- github.com/RabiaKhan-94/Thesis_WebServer.git which can tant feature vectors. Once the models are thoroughly trained, they are rigorously tested using test methods like k-fold be easily set up by an intermediate-level Python developer. cross-validation, self-consistency, independent set testing, and jackknife testing. The random forest classifier showed 6. Discussion and Conclusion 99.3% accuracy for angiogenesis and 99.7% for tumor angio- This study proposes a prediction model for the classification genesis, and ANN showed an overall 96.23% accuracy for angiogenesis and 95% for tumor angiogenesis. On the other of angiogenesis and tumor angiogenesis. A robust well- defined methodology was adopted for dataset collection. hand, SVM showed 78.65% accuracy for angiogenesis and Duplicate and redundant data were removed, and homolo- 65.19% for tumor angiogenesis. gous sequences up to 60% were excluded. Variable-length proteomic sequences were transformed into fixed-length 7. Future Works feature vectors using a position- and composition-based technique. Position relative information was further trans- Advanced drug therapies and treatments integrate the use of muted into a succinct form using statistical moments. Three ligands that target tumor angiogenesis proteins to inhibit classifiers random forest (RF), artificial neural network them. Inhibition of these tumor growth factors disrupts its (ANN), and support vector machine (SVM) were used to growth, and in some cases, the tumor even dies out. Tools find the best results. All of these algorithms are powerful, that help the discovery and identification of tumor angio- robust, and well understood. The random forest (RF) and genesis proteins greatly help cancer researchers to identify artificial neural network (ANN) can deal with linear as well these growth factors in a timely and cost-effective manner. as complicated nonlinear problems. The current study One such tumor growth factor has been uncovered; there reveals that RF showed the best results among these classifi- is an incessant need to identify ligands that can inhibit them. cation approaches. As a result of cross-validation, RF exhib- In silico models that simulate ligand bindings with tumor ited an accuracy of 97.8% for angiogenesis proteins and an growth factors can also greatly enhance tumor research. Fur- accuracy of 99.5% for tumor angiogenesis, where ANN ther, in the future, the proposed model can be made more showed an accuracy of 99.1% for angiogenesis and 79.9% adaptive by incorporating updated data and using deep for tumor angiogenesis. Additionally, the accuracy of SVM learning features. for angiogenesis was 78.8%, and for tumor angiogenesis, it was 65.19%. The current study has shown different perfor- Data Availability mances for all approaches. Consequently, it concludes that the results exhibited by RF are better than ANN and SVM. Data is available at https://github.com/RabiaKhan-94/ On the other hand, the random forest takes less time for Angio_Webserver. training as compared to the neural network. Another impor- tant strength of RF is that it is less susceptible to overfitting which is not the case with a neural network. The robustness Conflicts of Interest of the feature extraction technique plays a significant role in the overall accuracy of the model. Feature extraction The authors declare that they have no conflicts of interest to uncovers obscure features more pertinent to the composi- report regarding the present study. Applied Bionics and Biomechanics 13 [15] S. Baritaki, S. Huerta-Yepez, T. Sakai, D. A. Spandidos, and Acknowledgments B. Bonavida, “Chemotherapeutic drugs sensitize cancer cells This project was funded by the Deanship of Scientific to TRAIL-mediated apoptosis: up-regulation of DR5 and inhi- Research (DSR), King Abdulaziz University (https://www bition of Yin Yang 1,” Molecular Cancer Therapeutics, vol. 6, no. 4, pp. 1387–1399, 2007. .kau.edu.sa/), Jeddah (under grant no. G:160-611-1441). The authors, therefore, acknowledge with thanks DSR tech- [16] R. Baskar, K. A. Lee, R. Yeo, and K. W. Yeoh, “Cancer and nical and financial support. radiation therapy: current advances and future directions,” International Journal of Medical Sciences, vol. 9, no. 3, pp. 193–199, 2012. References [17] L. Zhang, M. Bochkur Dratver, T. Yazal et al., “Mebendazole potentiates radiation therapy in triple-negative breast cancer,” [1] J. L. Blanco, A. B. Porto-Pazos, A. Pazos, and C. Fernandez- International Journal of Radiation Oncology � Biology � Phys- Lozano, “Prediction of high anti-angiogenic activity peptides ics, vol. 103, no. 1, pp. 195–207, 2019. in silico using a generalized linear model and feature selec- [18] N. Utku, “New approaches to treat cancer - what they can and tion,” Scientific Reports, vol. 8, no. 1, pp. 1–11, 2018. cannot do,” Biotechnology Healthcare, vol. 8, no. 4, pp. 25–27, [2] J. Hardy, “Les petites br??lures,” Soins, vol. 24, no. 6, pp. 3–5, [19] J. Blakeley, “Drug delivery to brain tumors,” Current Neurol- [3] H. Shen and X. Wei, “A qualitative analysis of a free boundary ogy and Neuroscience Reports, vol. 8, no. 3, pp. 235–241, 2008. problem modeling tumor growth with angiogenesis,” Nonlin- [20] P. Mobadersany, S. Yousefi, M. Amgad et al., “Predicting can- ear Analysis: Real World Applications, vol. 47, pp. 106–126, cer outcomes from histology and genomics using convolu- tional networks,” Proceedings of the National Academy of [4] N C Institute, “Angiogenesis inhibitors,” Angiogenesis Inhibi- Sciences of the United States of America, vol. 115, no. 13, tors, 2019, https://www.cancer.gov/about-cancer/treatment/ pp. E2970–E2979, 2018. types/immunotherapy/angiogenesis-inhibitors-fact-sheet. [21] S. P. S. Baker and A. Korhonen, Cancer hallmark text classifi- [5] V. Laengsri, C. Nantasenamat, N. Schaduangrat, P. Nuchnoi, cation using ConvNets, BioTxtM, 2016. V. Prachayasittikul, and W. Shoombuatong, “TargetAntiAn- [22] W. Hussain, Y. D. Khan, N. Rasool, S. A. Khan, and K. C. gio: a sequence-based tool for the Prediction and analysis of Chou, “SPrenylC-PseAAC: A sequence-based model devel- anti-angiogenic peptides,” International Journal of Molecular oped via Chou's 5-steps rule and general PseAAC for identify- Sciences, vol. 20, no. 12, p. 2950, 2019. ing S-prenylation sites in proteins,” Journal of Theoretical [6] D. J. Bharali, M. Rajabi, and S. A. Mousa, “Application of Biology, vol. 468, pp. 1–11, 2019. nanotechnology to target tumor angiogenesis in cancer thera- [23] Y. D. Khan, N. Rasool, W. Hussain, S. A. Khan, and K. C. peutics,” in Angiogenesis Strategies in Cancer Therapeutics, Chou, “IPhosY-PseAAC: identify phosphotyrosine sites by Elsevier Inc., 2016. incorporating sequence statistical moments into PseAAC,” [7] W. Liang, Y. Zheng, J. Zhang, and X. Sun, “Multiscale model- Molecular Biology Reports, vol. 45, no. 6, pp. 2501–2509, 2018. ing reveals angiogenesis-induced drug resistance in brain tumors and predicts a synergistic drug combination targeting [24] S. Naseer, R. F. Ali, Y. D. Khan, and P. Dominic, “iGluK-deep: computational identification of lysine glutarylation sites using EGFR and VEGFR pathways,” BMC Bioinformatics, vol. 20, Suppl 7, 2019. deep neural networks with general pseudo amino acid compo- sitions,” Journal of Biomolecular Structure and Dynamics, pp. [8] F. Bray, J. Ferlay, I. Soerjomataram, R. L. Siegel, L. A. Torre, 1–14, 2021. and A. Jemal, “Global cancer statistics 2018: GLOBOCAN esti- mates of incidence and mortality worldwide for 36 cancers in [25] M. K. Mahmood, A. Ehsan, Y. D. Khan, and K.-C. Chou, 185 countries,” CA: a Cancer Journal for Clinicians, vol. 68, “iHyd-LysSite (EPSV): identifying hydroxylysine sites in pro- no. 6, pp. 394–424, 2018. tein using statistical formulation by extracting enhanced posi- tion and sequence variant feature technique,” Current [9] CancerQuest, “Angiogenesis,” Angiogenesis, 2019. Genomics, vol. 21, pp. 536–545, 2020. [10] Y. Feng, Y. Dai, Z. Gong et al., “Association between angiogen- [26] S. Naseer, W. Hussain, Y. D. Khan, and N. Rasool, “Optimiza- esis and cytotoxic signatures in the tumor microenvironment of gastric cancer,” Oncotargets and Therapy, vol. Volume 11, tion of serine phosphorylation prediction in proteins by com- paring human engineered features and deep representations,” pp. 2725–2733, 2018. Analytical Biochemistry, vol. 615, article 114069, 2021. [11] R. K. Jain, E. di Tomaso, D. G. Duda, J. S. Loeffler, A. G. Sor- ensen, and T. T. Batchelor, “Angiogenesis in brain tumours,” [27] S. Naseer, W. Hussain, Y. D. Khan, and N. Rasool, “NPalmi- Nature Reviews. Neuroscience, vol. 8, no. 8, pp. 610–622, 2007. toylDeep-PseAAC: a predictor of N-palmitoylation sites in proteins using deep representations of proteins and PseAAC [12] Cancer Research UK, “Cancer Research UK,” Worldwide can- via modified 5-steps rule,” Current Bioinformatics, vol. 16, cer statistics, 2018. pp. 294–305, 2021. [13] T. A. Elbayoumi and V. P. Torchilin, “Tumor-targeted nano- [28] A. A. Shah and Y. D. Khan, “Identification of 4- medicines: enhanced antitumor EfficacyIn vivoof doxorubi- carboxyglutamate residue sites based on position based statis- cin-loaded, long-circulating liposomes modified with cancer- tical feature and multiple classification,” Scientific Reports, specific monoclonal antibody,” Clinical Cancer Research, vol. 10, pp. 1–10, 2020. vol. 15, no. 6, pp. 1973–1980, 2009. [14] C. Y. Huang, D. T. Ju, C. F. Chang, P. Muralidhar Reddy, and [29] M. S. Rahman, S. Shatabda, S. Saha, M. Kaykobad, and M. S. B. K. Velmurugan, “A review on the effects of current chemo- Rahman, “DPP-PseAAC: a DNA-binding protein prediction therapy drugs and natural agents in treating non-small cell model using Chou’s general PseAAC,” Journal of Theoretical lung cancer,” BioMedicine, vol. 7, no. 4, pp. 23–23, 2017. Biology, vol. 452, pp. 22–34, 2018. 14 Applied Bionics and Biomechanics [46] A. S. Ettayapuram Ramaprasad, S. Singh, R. P. S. Gajendra, [30] D. S. Cao, Q. S. Xu, and Y. Z. Liang, “Propy: a tool to generate various modes of Chou’s PseAAC,” Bioinformatics, vol. 29, and S. Venkatesan, “AntiAngioPred: a server for prediction no. 7, pp. 960–962, 2013. of anti-angiogenic peptides,” PLoS One, vol. 10, no. 9, pp. 7– 12, 2015. [31] P. Tripathi and P. N. Pandey, “A novel alignment-free method to classify protein folding types by combining spectral graph [47] P. Sudha, D. Ramyachitra, and P. Manikandan, “Enhanced clustering with Chou’s pseudo amino acid composition,” Jour- artificial neural network for protein fold recognition and struc- nal of Theoretical Biology, vol. 424, pp. 49–54, 2017. tural class prediction,” Gene Reports, vol. 12, pp. 261–275, [32] F. Javed and M. Hayat, “Predicting subcellular localization of multi-label proteins by incorporating the sequence features [48] J. Ahmad and M. Hayat, “MFSC: multi-voting based feature into Chou's PseAAC,” Genomics, vol. 111, no. 6, pp. 1325– selection for classification of Golgi proteins by adopting the 1332, 2019. general form of Chou’s PseAAC components,” Journal of The- oretical Biology, vol. 463, pp. 99–109, 2019. [33] L. Zhang and L. Kong, “IRSpot-ADPM: identify recombina- tion spots by incorporating the associated dinucleotide prod- [49] A. H. Butt, N. Rasool, and Y. D. Khan, “Predicting membrane uct model into Chou’s pseudo components,” Journal of proteins and their types by extracting various sequence fea- Theoretical Biology, vol. 441, pp. 1–8, 2018. tures into Chou’s general PseAAC,” Molecular Biology Reports, vol. 45, no. 6, pp. 2295–2306, 2018. [34] C. Huang and J. Q. Yuan, “Predicting protein subchloroplast locations with both single and multiple sites via three different [50] Y. Xu, X. J. Shao, L. Y. Wu, N. Y. Deng, and K. C. Chou, modes of Chou’s pseudo amino acid compositions,” Journal of “ISNO-AAPair: incorporating amino acid pairwise coupling Theoretical Biology, vol. 335, no. 22, pp. 205–212, 2013. into PseAAC for predicting cysteine S-nitrosylation sites in [35] K. C. Chou, “Some remarks on protein attribute prediction proteins,” PeerJ, vol. 2013, no. 1, pp. 1–18, 2013. and pseudo amino acid composition,” Journal of Theoretical [51] P. M. Feng, H. Ding, W. Chen, and H. Lin, “Naïve Bayes clas- Biology, vol. 273, no. 1, pp. 236–247, 2011. sifier with feature selection to identify phage virion proteins,” Computational and Mathematical Methods in Medicine, [36] K. C. Chou, “Prediction of protein cellular attributes using pseudo-amino acid composition,” Proteins: Structure, Func- vol. 2013, Article ID 530696, 6 pages, 2013. tion, and Genetics, vol. 43, no. 3, pp. 246–255, 2001. [52] A. H. Butt, S. A. Khan, H. Jamil, N. Rasool, and Y. D. Khan, “A prediction model for membrane proteins using moments [37] X. Fu, W. Zhu, B. Liao, L. Cai, L. Peng, and J. Yang, “Improved DNA-binding protein identification by incorporating evolu- based features,” BioMed Research International, vol. 2016, tionary information into the Chou’s PseAAC,” IEEE Access, Article ID 8370132, 7 pages, 2016. vol. 6, pp. 66545–66556, 2018. [53] X. Cui, Z. Yu, B. Yu, M. Wang, B. Tian, and Q. Ma, “UbiSi- [38] J. Jia, Z. Liu, X. Xiao, B. Liu, and K. C. Chou, “PSuc-Lys: predict tePred: A novel method for improving the accuracy of ubiqui- lysine succinylation sites in proteins with PseAAC and ensem- tination sites prediction by using LASSO to select the optimal ble random forest approach,” Journal of Theoretical Biology, Chou's pseudo components,” Chemometrics and Intelligent vol. 394, pp. 223–230, 2016. Laboratory Systems, vol. 184, pp. 28–43, 2019. [39] Y. D. Khan, F. Ahmed, and S. A. Khan, “Situation recognition [54] M. A. Akmal, W. Hussain, N. Rasool, Y. D. Khan, S. A. Khan, using image moments and recurrent neural networks,” Neural and K. C. Chou, “Using Chou’s 5-steps rule to predict O- Computing and Applications, vol. 24, no. 7–8, pp. 1519–1529, linked serine glycosylation sites by blending position relative features and statistical moment,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 5963, no. c, [40] M. A. Akmal, N. Rasool, and Y. D. Khan, “Prediction of N- pp. 1–1, 2020. linked glycosylation sites using position relative features and statistical moments,” PLoS One, vol. 12, no. 8, pp. 1–21, [55] A. O. Almagrabi, Y. D. Khan, and S. A. Khan, “iPhosD- PseAAC: identification of phosphoaspartate sites in proteins using statistical moments and PseAAC,” Biocell, vol. 45, [41] H. Cao, S. Bernard, R. Sabourin, and L. Heutte, “Random for- no. 5, pp. 1287–1298, 2021. est dissimilarity based multi-view learning for radiomics appli- cation,” Pattern Recognition, vol. 88, pp. 185–197, 2019. [56] M. Awais, W. Hussain, N. Rasool, and Y. D. Khan, “iTSP- PseAAC: identifying tumor suppressor proteins by using fully [42] C. Kathuria, D. Mehrotra, and N. K. Misra, “Predicting the connected neural network and PseAAC,” Current Bioinfor- protein structure using random forest approach,” Procedia Computer Science, vol. 132, pp. 1654–1662, 2018. matics, vol. 16, no. 5, pp. 700–709, 2021. [57] W. Hussain, N. Rasool, and Y. D. Khan, “A sequence-based [43] M. Ballings, D. Van Den Poel, N. Hespeels, and R. Gryp, “Eval- predictor of Zika virus proteins developed by integration of uating multiple classifiers for stock price direction prediction,” PseAAC and statistical moments,” Combinatorial Chemistry Expert Systems with Applications, vol. 42, no. 20, pp. 7046– 7056, 2015. & High Throughput Screening, vol. 23, no. 8, pp. 797–804, [44] S. Muthusamy, L. P. Manickam, V. Murugesan, [58] Y. D. Khan, E. Alzahrani, W. Alghamdi, and M. Z. Ullah, C. Muthukumaran, and A. Pugazhendhi, “Pectin extraction from Helianthus annuus (sunflower) heads using RSM and “Sequence-based identification of allergen proteins devel- oped by integration of PseAAC and statistical moments ANN modelling by a genetic algorithm approach,” Interna- tional Journal of Biological Macromolecules, vol. 124, via 5-step rule,” Current Bioinformatics, vol. 15, pp. 1046– 1055, 2020. pp. 750–758, 2019. [45] L. Jiang, J. Zhang, P. Xuan, and Q. Zou, “BP neural network [59] Y. D. Khan, N. S. Khan, S. Naseer, and A. H. Butt, “iSUMOK- could help improve pre-miRNA identification in various spe- PseAAC: prediction of lysine sumoylation sites using statistical cies,” BioMed Research International, vol. 2016, Article ID moments and Chou’s PseAAC,” PeerJ, vol. 9, article e11581, 9565689, 11 pages, 2016. 2021. Applied Bionics and Biomechanics 15 [60] S. J. Malebary, R. Khan, and Y. D. Khan, “ProtoPred: advanc- ing oncological research through identification of proto- oncogene proteins,” IEEE Access, vol. 9, pp. 68788–68797, [61] S. J. Malebary and Y. D. Khan, “Evaluating machine learning methodologies for identification of cancer driver genes,” Scien- tific Reports, vol. 11, no. 1, pp. 1–13, 2021. [62] N. Albugami, “Prediction of Saudi Arabia SARS-COV 2 diver- sifications in protein strain against China strain,” VAWKUM Transactions on Computer Sciences, vol. 8, no. 1, pp. 64–67, [63] S. J. Malebary and Y. Daanial Khan, “Identification of antimi- crobial peptides using Chou’s 5 step rule,” Computers, Mate- rials & Continua, vol. 67, no. 3, pp. 2863–2881, 2021.

Journal

Applied Bionics and BiomechanicsHindawi Publishing Corporation

Published: Sep 27, 2021

References