Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Classifying Antimicrobial and Multifunctional Peptides with Bayesian Network Models

Classifying Antimicrobial and Multifunctional Peptides with Bayesian Network Models Bayesian network models are finding success in characterizing enzyme- catalyzed reactions, slow conformational changes, predicting enzyme inhibition, and genomics. In this work, we apply them to statisti- cal modeling of peptides by simultaneously identifying amino acid se- quence motifs and using a motif-based model to clarify the role motifs may play in antimicrobial activity. We construct models of increasing sophistication, demonstrating how chemical knowledge of a peptide system may be embedded without requiring new derivation of model fitting equations after changing model structure. These models are used to construct classifiers with good performance (94% accuracy, Matthews correlation coefficient of 0.87) at predicting antimicrobial activity in peptides, while at the same time being built of interpretable parameters. We demonstrate use of these models to identify peptides that are potentially both antimicrobial and antifouling, and show that the background distribution of amino acids could play a greater role in activity than sequence motifs do. This provides an advancement in the type of peptide activity modeling that can be done and the ease in which models can be constructed. University of Washington, Department of Chemical Engineering, Seattle, WA, USA. Tel: 01 206 616 6509; E-mail: sjiang@uw.edu University of Rochester, Department of Chemical Engineering, Rochester, NY, USA. Tel: 01 585 276 7395. E-mail: andrew.white@rochester.edu arXiv:1804.06327v1 [stat.AP] 17 Apr 2018 1 Introduction Bayesian networks are a statistical modeling framework that are ideally suited for use in combination with a quantitative structure-property relation- ship (QSPR) modeling framework due to their ability to encode chemical 10,26 knowledge and design interpretable models. QSPR modeling (alterna- tively known as QSAR for “Quantitative Structure-Activity Relationship”) is a term used to refer to a suite of statistical modeling techniques. Its pur- pose is not always specifically to identify structural features and link them to activity, but to identify and/or exploit trends among the features of a chemical dataset in order to make statistical predictions. This broad class of modeling methods makes use of input data such as chemical descriptors and peptide sequences. Recent reviews may be found in Cherkasov et al. , Ja- 30 40 worska et al. , and Nongonierma and FitzGerald . The power of Bayesian network models lies in their ability to treat sophisticated models with general training techniques. Due to the generality of the training algorithms, even datasets like the massive ENCODE database can be analyzed. A recent review of them may be found in Ghahramani. Recent examples of applied Bayesian networks include predicting microcanonical melting points, mod- eling a terrorist network, and extrapolating clinical trial results beyond the original demographic. When applying a Bayesian network model to small drug-like molecules, one could specify that the molecule must have a molecular weight below a cut-off, and that at least two QSPR descriptors must be in a certain range. Such constraints are difficult to embed into linear discriminant analysis, for example, and require a new derivation of the model fitting procedure. There is no such requirement in Bayesian networks, due to the generality of their training procedures. This generality also means that fast algorithms have been developed that make use of such constraints to reduce the training space. The combination of the constraints and speed allows models to be constructed that are easily interpretable. In particular, Bayesian models have recently been employed in specifically biological studies of a wide range of important 13,14,17,62 topics, including enzyme-catalyzed reactions, slow conformational 35 2 change, and inhibition of HIV-1 reverse transcriptase. Another benefit of Bayesian networks is their ability to do multimodal modeling. Multimodal modeling is the combination of multimodal data into a unified model. For example, combining the sequence data and chemical descriptors of a peptide is a challenging task. In a Bayesian network, two 2 parts of the model may deal with the different data types and be connected through probability distributions. This has an advantage over other model combination techniques, such as consensus modeling, in that both models may be trained independently and later combined, instead of a step-by-step procedure. Applying these types of models to chemistry problems will open new ways of combining data such as bioavailability descriptors, sequence models, and simulation results. In recent years, a number of studies of antimicrobial peptides (AMPs) have been made using machine learning techniques. The Antimicrobial Pep- tide Database (APD) was designed to collect known peptides with an- timicrobial properties. Past studies involving AMP classification include: Fjell et al. who created AMPer using a hidden Markov model approach, 9 33 Bradshaw et al. who developed AntiBP (later improved by Kumar et al. ) to classify antimicrobial peptides using sequence information, and Thomas et al. who created the Collection of Anti-Microbial Peptides, a database of AMPs with built-in tools for prediction and analysis. Finally, Xiao et al. used machine learning techniques to classify AMPs by target (bacteria, viruses, etc.) in addition to anti-microbial activity alone. However, antimicrobial activity in vitro is not necessarily indicative of broad applicability for other uses. In complex media, fouling (non-specific binding at the surface of a material) leads to a loss of activity. Thus, in order for an AMP to be of use in applications such as biomedical devices 6,34,46 and marine coatings, it is necessary to ensure that AMP retains both antimicrobial and antifouling properties in complex media. To this end, we have constructed accurate Bayesian models that can predict antimicro- bial activity, identify sequence motifs, elucidate important descriptors, and identify potential multi-functional peptides that are both antimicrobial and antifouling. In this work, Bayesian network models are created to predict antimicro- bial activity and identify possible multifunctional peptides. Two datasets are used to train the networks. The first is 351 unique peptides which inhibit growth of gram-positive bacteria from the APD. The second is a collection of approximately 3, 600 sequence fragments from the surface of human pro- teins. The second dataset is hypothesized to contain sequences which resist 56,57 nonspecific interactions in biological systems. Bayesian models are well- suited to small datasets such as these, as they have demonstrable ability to be accurately trained on datasets with as few as 100 points. In the materials and methods section, we describe the construction of 3 Figure 1 Diverse proteins isolated from humans with structure in the protein data bank created a database of 1,162 proteins. The surface was found as described in White et al. . Contiguous surface sequences were found (dark gray) and converted into sequence fragments. All with length greater than 4 were used. the datasets, the training procedure used for the models, and the descriptors used. In the results section, we examine model accuracy, compare our models with a simpler traditional machine learning approach, and finally, identify two peptides that are predicted to have antimicrobial and antifouling properties. 2 Materials and Methods It is critical in the development of a sequence-based statistical prediction method for biological systems that the process be clear and easily repro- ducible. As suggested by Chou and demonstrated in a series of recent 14,17,59,62 articles from that research group, a predictive statistical model for biological systems should use the following guidelines to achieve its goals: construct benchmark datasets for training and testing the predictor, create a suitable mathematical expression that represents the relevant properties for prediction, implement an appropriate model and training algorithm, perform cross-validation tests to evaluate the accuracy of prediction, and establish a web-based public interface for ease of use. In the following, we address each of these steps in turn. In order to build a classification model to identify peptides that could be both antimicrobial and antifouling, two datasets were used. The first is from 4 55 the APD as of September 2017 and contains 482 sequences which show activity against gram-positive bacteria. The 482 sequences were reduced to 351 by removing similar sequences. Here, “similar” sequences were defined as those separated by 2 or fewer single position substitutions (e.g, EDGRT and ADGRS are similar). This definition of sequence similarity has no inherent chemical meaning. It was chosen as a method of removing certain sequences to reduce over-representation of combinatorial studies, where single positions are changed over multiple trials. Although one substitution can be enough 39,48,60 to drastically change the activity of a peptide, it is still necessary to avoid including these similar sequences due to the statistical nature of the model. Even if changing one or two amino acids affects the activity, including many peptides with nearly identical sequences would bias the model toward the (potentially very long) unchanged parts of these peptides. The second dataset, “Human”, is built upon the protein dataset from White et al. All contiguous amino acid sequences of length greater than 4 present on the surface of proteins from that dataset were tabulated as independent sequences, as depicted in Figure 1. This yields 3,600 unique sequences. Another aspect of antimicrobial activity that our datasets do not address is post-translational modification (PTM) of peptides. In fact, 1147 out of 1755 of the peptides in the APD database are known to undergo PTM be- fore activity. However, the goal of this model is not to explain all factors that lead to antimicrobial activity, but to accurately predict potential an- timicrobial activity with as little information as possible, i.e. only sequence and/or chemical descriptor information. To be of interest for applications like screening and other biochemical ex- periments, it is crucial to minimize the false positive rate (FPR) of the pre- dictive model. Reducing false positives prevents the waste of experimental time and resources by precluding the investigation of an incorrectly-predicted candidate peptide. Furthermore, the ability to reject false positives is an im- portant attribute of the model itself, because it indicates that the model is not over-fit. However, to evaluate performance in this regard, negative data must be either gathered or generated; no one has tabulated a list of pep- tides which are not antimicrobial. Torrent et al. approached this problem by using sequences not reported to have activity, which may be a good as- sumption since AMPs are likely rare. We use the same approach here to evaluate model performance. A decoy dataset was generated by replacing each residue in the APD dataset with a randomly selected amino acid drawn from the distribution of amino acids among all entries in the Protein Data 5 ALogP* HB Acceptors HB Donors Charged Groups 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 score score score score Polar Groups Non-polar Groups Aromatic Groups Net Charge 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 score score score score Figure 2 Histogram plots of descriptors of the Antimicrobial Peptide Database (APD) and Human datasets. “HB” stands for hydrogen bond. A ranking of 100 indicates a high value of the descriptor relative to all pep- tides. ALogP is not calculated for the APD due to its poor accuracy at long peptide lengths found in the APD dataset. The Human aromatic histogram is skewed because most sequences in the “Human” dataset have no aromatic groups because they are drawn only from the surface of human proteins. Bank. The following descriptors were considered: ALogP, number of hydrogen bond acceptors, number of hydrogen bond donors, number of charged groups, number of polar goups, number of non-polar groups, number of aromatic groups, and net charge. We investigated these descriptors with the goal of finding data that would distinguish our two datasets from one another and from the space of all peptides in general. All descriptors, except ALogP, were calculated using the peplib R plugin. ALogP was calculated according to 27 49 Ghose and Crippen as implemented in the Chemistry Development Kit. To estimate the distribution of ALogP values, the calculation was performed on all possible amino acid sequences of lengths from 1 to 3, and on a ran- dom sample of 5,000 peptides, drawn uniformly from the datasets found in 50 11 Sweeney et al. and Chen et al. for each length up to 10. Using a proce- dure described in the Supporting Information, chemical descriptors are first converted into rankings that span 0 to 100. The rankings represent the value of the descriptor relative to all possible peptides in the dataset. Except for 0 0.02 0.02 0 Human APD Human APD 0 0.02 0.02 0 0.03 0.03 Human APD Human APD 0 0.03 0.03 0 0.02 0.02 Human APD Human APD 0 0.03 0.03 0 0 Human APD Human APD the ALogP descriptor, the descriptors used in this work are all additive, i.e. they are cumulative sums of the descriptors of all individual amino acids in a given peptide. The results of these calculations were tabulated as a single csv file for each dataset, each line of which consists of a peptide (specified as a single-letter amino acid sequence) paired with the list of chemical de- scriptors calculated for that sequence. In the chemical descriptor model, only the descriptor values are used for training. The motif model uses only the sequences themselves as input data. Withheld testing data was selected as a random 20% of input data for both models. The descriptor rankings were calculated on the two datasets and are shown in Figure 2. Each box shows the histogram of the descriptor rankings. A flat histogram indicates the ranks are distributed identically to all possible peptides and thus the descriptor is likely unrelated to the activity. As was desired, all the descriptors chosen distinguish both datasets from all possi- ble peptides and one another. The APD dataset shows an abnormally low number of charged residues and a high number of non-polar groups. Consis- 8,23,24 tent with past analysis of AMPs, the APD dataset has a lower number of charged residues with a skew toward positively charged. The number of charged residues is high on human protein surfaces, as seen previously. This is also reflected in the high water solubility (low ALogP values). There is no dominant net charge in one direction or another for human protein surfaces. The number of aromatic residues is low, which is expected, since the num- ber of aromatic residues is low across all proteins generally, and hydrophobic aromatic side chains mostly occur on the interior of human proteins. 2.1 Model Description The chemical descriptor model (henceforth the “QSPR model”) was treated as a one-dimensional, two-state Gaussian mixture with respect to each de- scriptor. In Gaussian mixture modeling, the underlying distribution of a set of observations is estimated by fitting a function made up of a sum of k Gaussian kernels. The hyperparameters are the means, heights, and vari- ance matrices, which are fitted for each kernel. Some background on this 61 38 technique can be found in Yu et al. and McNicholas and Murphy. This model is one-dimensional in that each chemical descriptor was fitted to its own separate mixture-of-Gaussians distribution, with no correlation between distributions. The number of Gaussian kernels was varied from 1 to 10, with the best performance given by the 3-kernel set (see discussion and Supporting 7 a) Class =1 =0 Trained Data Observed Determinstic Stochastic descriptor distributions Figure 3 A graphical representation of a 2-state classifier that fits 3 observed descriptors to either distribution 0 or 1. A class of 1 indicates activity. In this work, the “Trained” nodes are our mixture-of-Gaussians distributions, and the three “Observed” nodes correspond to the three chemical descriptors chosen. Information). This portion of the model was implemented using the PyMC3 package for Python 3. This two–state mixture model was used to classify sequences as antimicrobial using the descriptors in Figure 2, and is shown in graphical representation in Figure 3. Three descriptors were chosen based on inspection of Figure 2: net charge, number of non-polar groups, and number of charged groups, as these had the most distinct histograms between the two datasets used. Next, a motif model was constructed which classifies sequences based on the existence of motifs in the sequence. Emphasis was placed on keeping the model interpretable to non-experts. For this reason the following attributes were chosen: (1) There are 0 to k possible motif classes that may be observed. (2) Each peptide belongs to only one motif class. (3) Motifs may not be par- tially expressed. (4) Non-motif residues are drawn from a “background” distribution that is shared among all peptides in the dataset. (5) Motifs are of fixed length w, and motif distributions are sparse, i.e. usually only one amino acid exists in each position. (6) The probability of a motif starting at a sequence position is independent of the position. (7) Motif distributions should be sparse, i.e. they should have few non-negligible entries. This model formulation is in part inspired by previous work in motif identifica- 4 47 tion, such as Bailey and Elkan and Schwartz and Gygi. The features that 8 3,58 are unique to this description relative to past motif models are the regu- larization of motifs, a tied background distribution, independent motif start probability, and the ability to deal with variable length sequences. The regu- larization method chosen (L1 regularization, see Section 2.2) forces the motifs to be sparse so that each motif position only has one or two possible residues. This makes motif interpretation more intuitive. The tied background distri- bution reduces the number of model parameters by (k − 1)(A − 1), where A is the number of amino acids. This background distribution can also be used as the motif model by itself if we let k = 0. This “Background-only” model is a limiting case of the motif model, but is otherwise specified and trained in an identical fashion. Such a change greatly complicates traditional algebraic analysis of the model, but is simple to include with this formulation. The uni- form motif start probability reduces the number of parameters by k(l − 1), where l is the length of the sequences. The ability to deal with variable length sequences without pre-alignment is a significant feature and is what allows modeling of the highly heterogeneous APD. The model description is thus-far complex, but the trade-off is that the parameters that are derived from this model are intuitive and few. The motif model was implemented directly in Python as an extension module written in C++. Complete model specifications can be found in the Supporting Information. The last classifier considered combines features from the previous two. Its graphical representation is shown in Figure 5a. In this model, the “Class” node shown in Figure 3 is connected to the “Member” node shown in Fig- ure 4a. This indicates incorporation of the descriptor distributions shown in Figure 3 into a new “QSPR” block in the graph in Figure 4a. Thus, the model takes in both sequence information and descriptors. The best- performing motif number (k = 8) and motif width (w = 3) from the motif model, and the same descriptors (net charge, number of charged groups, and number of non-polar groups) from the QSPR model were used. The complete model specification is given in the Supporting Information. 9 Figure 4 Panel a is the graphical representation of the motif model used. The middle parts of the graph are repeated as many times as necessary to fit the length of a sequence (length 4 depicted). Panel b is an inset showing how the motif indicator governs whether the probability for a given amino acid is drawn from the background or motif distribution. Panel c is the prediction accuracy as a function of the motif width and motif number. A motif length of 0 indicates the performance of the background-only model. The maximum prediction accuracy (79%) of the motif model was the same as the background-only model. 10 (b) Performance of the combined (a) Graph of combined QSPR/Motif QSPR+motif model on the APD model. The “omitted motif block” dataset with distributions from the 3- refers to Figure 4a. The “Member” kernel QSPR model and the 8-motifs, value drawn from the model shown length-3 motif model. The x-axis is in Figure 3 is used as the membership the weight assigned to the Motif half value in the rest of the model (Figure of the model (0 is no influence and 1 4a.) is total influence). Figure 5 Graphical representation and classification performance of the com- bined QSPR/motif model. 2.2 Model Training and Validation The QSPR model was trained using the built-in Metropolis-Hastings sam- pling algorithm for Hybrid Monte Carlo as implemented in the PyMC3 package (v 3.0) for Python3. See the Supporting Information for complete model specification. The model parameters were initialized uniformly and trained for 3000 steps. In all cases, leave-one-out cross-validation was per- formed via the PyMC3 built-in implementation of Vehtari et al. The motif model was trained using Gibbs sampling with constrained, per- coordinate infinite horizon stochastic gradient descent using L1 regulariza- tion with a squared-difference loss function. An analysis of this method can be found in Mcmahan and Streeter. A mathematical description follows. (t) (t) Given a D-dimensional vector of amino acid distributions, let n ¯ = NX 11 be the vector of expected counts from each distribution at timestep t, with N (t) the total number of observations, and let m be the observed counts given by the observation (draw) made at step t, with regularization term ||x|| , and let ν represent some uniformly random noise vector. We define the loss (t) function L, the learning rate η , and the regularization term ||x|| as: (t) (t) L = n ¯ − m + ||x|| , (t) η =s , and t−1 2 ∂L (t) (1) ∂X t=0 ||x|| = λ |n ¯ |. i=1 Using the definitions in Equation 1, and given an initial distribution vec- (0) (t) tor X , the update to X at timestep t is given as ∂L (t+1) (t) (t) X = X − η + ν, with (t) ∂X (2) ∂L (t) (t) = 2N n ¯ − m + λ (t) ∂X In general, L1 regularization is defined as ||x|| = λ |y −f(x )|, where 1 i i i=1 the y term refers to the “target value,” and λ is an adjustable parameter. In our case, y = 0 ∀i. This indicates a low belief in any value above zero for our motif distributions, and induces sparsity in the final distributions. The ν term is a stochastic noise term that uniformly adds an observation at random at each update step. This helps overcome overfitting by exploring more of the sample space. Typical cross-validation of this motif model is not sufficient to evaluate all aspects of its performance. Its purpose is not only classification, but also identification of motifs among peptides, which are relatively rare. We eval- uate prediction accuracy via withheld testing data from the APD dataset, but it is important to also validate the intended motif-capturing behavior of the model separately. However, because the entries in the APD do not have labeled motifs, and are not guaranteed to all contain motifs, it is impossible to validate this model’s ability to capture motif information from these pos- itive cases. Thus, to evaluate the ability to identify motifs, trial runs were 12 performed with arbitrarily constructed, small datasets with imposed mo- tifs. One or more fixed motifs (e.g. QAFR, IEKG, etc.) were selected, and background members consisting of uniformly-distributed amino acids were appended and prepended to the motifs randomly. For example, test peptides containing the QAFR motif might be ARQAFROI, or IQFARGMO. During training, these datasets had a random 20% withheld as testing data. The artificially-constructed motifs were captured accurately by the model with as few as 500 iterations over the data set. Figures S5-S8 show the fitted motif distributions for the data with the imposed motif ARND. Combining the two models requires no additional training step because the two halves of the combined model were trained previously. The trained distributions from the two halves of the model were used to evaluate likeli- hoods for the positive and negative datasets used previously. These likeli- hoods are normalized by dividing the likelihoods produced by each half of the model by the highest likelihood in that half. Then, weights W ranging from 0 to 1 were assigned to the motif model, with 1 − W being assigned to the QSPR model. The sum of these two weighted likelihoods for a given peptide is the likelihood produced by the combined model for that peptide. For each weight, a receiver operating characteristic (ROC) curve is generated. The ROC curve is generated by calculating false positive rate (FPR) and true positive rate (TPR) of classification on the training data as the cutoff value for likelihood is varied between 0 and 100% of the maximum likelihood pro- duced by the model. A data point that scores a likelihood above the cutoff value indicates a positive (i.e., the model predicts it to be antimicrobial). The FPR is the fraction of such points from the non-antimicrobial (negative) testing set, and the TPR is the fraction from the antimicrobial (positive) testing set. The accuracy values displayed in Figure 5b are the accuracy calculated on the withheld testing dataset at the optimal cutoff for the ROC curve at each weighting on the x-axis. 3 Results: Bayesian Network Models that Pre- dict Activtiy We have created three increasingly sophisticated models for predicting activ- ity. Figure 6 shows example ROC curves for the best parameter sets for the QSPR and motif models. A summary of results using the best parameter 13 (b) The ROC Curve for the mo- (a) The ROC Curve for the 3-kernel tif model with 8 possible motifs of Gaussian mixture QSPR model. length 3. Figure 6 Example receiver operating characteristic (ROC) curves for the two model types. TPR is the true positive rate, and FPR is the false positive rate. The best cutoff was defined as the point which minimized 2 2 (2(FPR) + (1 − TPR) ). This objective function was chosen to put an emphasis on a lower FPR. sets for the three models are shown in Table 1. Performance of each model was evaluated based on the accuracy produced by the model at the point on the ROC curve (generated using withheld testing 2 2 data) that minimized the value (2(FPR) + (1 − TPR) ). This choice was made to emphasize a lower FPR. In the case of the Gaussian mixture model, the kernel number was varied between 2 and 10, and accuracy was evaluated for all three of the chosen chemical descriptors with that kernel number. This choice was made to simplify model specification and to limit the number of training sessions. While the same kernel number may not be optimal for each individual descriptor, the model produced sufficient accuracy with this simplification. For the motif model, motif count varied between k = 2 and k = 10, and motif length varied between w = 3 and w = 8. The “background- only” (i.e. k = w = 0) model was also evaluated. Some QSPR trials with different kernel numbers produced very similar performances. In particular, the 3-kernel and 6-kernel trials both produced accuracy above 80%. Thus, it was necessary to decide which trial’s distri- butions to use for testing. The 3-kernel data was used after considering the distributions depicted in Figure S9. The 3-kernel and 6-kernel QSPR mod- els performed nearly identically in terms of ROC and prediction accuracy (Figure S9a and S9b), but comparing Figure S9c and S9d shows us that 14 the 3-kernel model’s distribution has a shape that is more indicative of the underlying structure of the raw data. The motif model had optimal performance with k = 8 and w = 3, and the resultant ROC curve is shown in Figure 6b. A heatmap of accuracies with dif- ferent k and w values is shown in Figure 4c. Surprisingly, the model’s optimal performance with motifs is of identical accuracy with 0 motifs (background only), with both the k = 8, w = 3 and k = 0, w = 0 models having an ac- curacy of 79% with optimal cutoff. The ROC curve of the background-only motif model is shown in Figure S10. This may indicate that motifs are not important for antimicrobial activity, but merely the overall distribution of amino acids present, or it could indicate that motifs play a more complicated role in antimicrobial behavior than this model is able to capture. The parameters that gave the best performance for the two individual models were also used for the combined model. The weights assigned to each half of the model were varied continuously from 0 to 1 to determine the optimal hyperparameter. Likelihoods from the two models were re-weighted by dividing all likelihoods for one model type by the highest likelihood pro- duced by that model, to achieve comparable magnitudes from the two models. These results are shown in Figure 5b. With its best performance, the com- bined model outperformed both of the individual models, with a weight of 21% for the motif part of the model and 79% for the QSPR part producing a classification accuracy of 94%. The ability to interpret the model may be seen in Figure 7. Figure 7a shows the probability distribution from the second motif from the k = 8, w = 3 model. The regularization operates as expected and the motifs are sparse; only one or two amino acids have non-negligible probability for a given position. Figure 7b shows all the motifs predicted by the model. The “Predict” column shows the count of peptides for which the given motif had the highest likelihood of appearing. This should be interpreted as the “best match” motif for a given peptide. It does not imply that the peptide con- tains that motif, but only that the shown motif gives the highest likelihood for that peptide among all motifs predicted by the model. The next column contains the number sequences which actually contain each motif, obtained by exhaustive analysis. We see the model has correctly assigned each motif to the corresponding sequence based on the close match between the predict and found columns. There are relatively few examples of the motifs discov- ered by the model, but it does capture some common ones. For example, exhaustive analysis shows that GLL is the third most common 3-letter motif 15 a) b) c) Motif Predict Found Background Distribution 0.12 CIA 32 2 [LK][LK][CP] 0.10 12 32 GCC 75 2 0.08 CKC 32 4 0.06 GGG 98 13 0.04 GLL 19 29 0.02 KLL 34 13 0.00 LLL 49 1 Figure 7 Panel a shows the probability that a given amino acid appears in each of positions 1 through 3 in the motif “[LK][LK][CP]”. Due to the sparsity from regularization, the majority of the motifs predicted by the model are sparse (only one amino acid with non-negligible probability for that motif position). Panel b is the list of motifs predicted by the model. The “Predict” column is the number of sequences which are more likely to contain the corresponding motif than any other motif. The “Found” column is the number of sequences that actually contain the motif. Panel c is the background distribution of amino acids from the motif model. The y-axis is probability of observing a randomly chosen amino acid from a random peptide in this dataset. 16 in the APD dataset, and it is captured by the model, as shown in Figure 7b. GGG is also among the 10 most common motifs found from brute-force analysis, and is also captured by the model. However, the model did not capture all the most common motifs. Finally, the background distribution is shown in Figure 7c. This may be considered the amino acid composition of the APD without the motifs observed in sequences. It is not uniform, and different from the Human database and as mentioned above, contributes significantly to the performance of the classifier. This is not unexpected since amino acid composition is a well-used descriptor for analyzing peptides and proteins. Furthermore, the identical accuracy of the best-case motif model and “background-only” model, as well as the failure of the model to capture all the most common motifs, indicates that either motifs are unimportant in the antimicrobial activity of peptides, or the model is insufficient to capture the important aspects of peptide motifs. In order to compare our model with a traditional machine learning method, we also evaluated the performance of a linear support vector machine (SVM) on the chemical descriptors used in the QSPR model. The datasets used were the same. We utilized the builtin SVM method of the scikit-learn Python 3 package, with 3000 training steps. After convergence, the SVM had an accuracy of 84% with its optimal cutoff, which is not as good as the QSPR, or QSPR + motif models, which were 87% and 94%, respectively. 4 Discussion: Identifying Multi-Functional Pep- tides As shown above, it is possible to construct accurate models that can pre- dict peptide activity. These tools can be further used to find peptides which have multiple activities or multiple functions. As stated in the introduction, the Human dataset contains peptides which are likely antifouling. To find a peptide that is both antimicrobial and antifouling, we can identify a peptide from the APD that scores as active according to a model trained on the Hu- man dataset. The opposite procedure is possible using the models trained above, where we find a human protein fragment that is likely antimicrobial. However, there is no experimental evidence that such a fragment is antifoul- ing. Choosing a peptide from the APD that is human-like will guarantee at a minimum that it has antimicrobial activity, and that the model predicts it 17 Model FPR TPR Accuracy MCC QSPR 8.1% 83% 87% 0.75 Motif 17% 75% 79% 0.58 QSPR + Motif 3.4% 90% 94% 0.87 Table 1 Summary of models that best predicted antimicrobial activity. The QSPR model is depicted in Figure 3, the motif model in Figure 4, and the QSPR + motif in Figure 5a. FPR and TPR are false positive and true positive rates of classification, respectively. MCC is the Matthews correlation coefficient. is similar in character to sequences found on the surfaces of human proteins. There is no evidence showing that motifs are relevant for antifouling, but past research has shown that strong net neutral partial charges, hydrophilic- 41,58 ity, and low self-interaction are, so a QSPR model was used. The QSPR model described above was used. The model was fit to the Human dataset with the same procedure as was used on the APD. The 8- kernel QSPR model performed the best for this dataset, with an accuracy of 67%. Overall, the model performance on the Human dataset was worse than on the APD, as can be seen by the ROC curve in Figure S11. This is likely due to the multimodality in the Human dataset, as well as its broad definition of activity (being present on the surface of a human protein). However, the model still achieved a low FPR, as desired. This shows that the model works regardless of the dataset it is trained against. After analyzing the descriptors of the APD against the optimal cutoff for the 8-cluster QSPR model trained on the Human dataset, approximately 30% of the APD peptides were found to be human-like. A peptide was designated as “human-like” if it scored above the cutoff used to produce the optimal accuracy on the ROC curve of the QSPR model fitted to the Human data. This low percentage shows AMPs are generally different from human proteins surfaces, which are thought to be optimized for minimal nonspecific 56,57 interactions. After omitting sequences less than 30 amino acids long, the most human- like AMPs are WKSESLCTPGCVTGALQTCFLQTLTCNCKISK (APD num- ber AP00206) and ITSISLCTPGCKTGALMGCNMKTATCHCSIHVSK (APD number AP00205). The first is subtilin, an antibiotic produced by the bac- 36,55 terium and model organism Bacillus subtilis. The second, nisin A, is 18 produced by the bacterium Lactococcus lactis, a species used in the produc- 44,55 tion of cheeses. They are similar to sequences from the “Human” dataset due to their low number of non-polar groups, high number of charged groups, and slightly negative net charge. Due to the connection between low protein adsorption and human protein surfaces, these two sequences may be good candidates for stable (non-fouling) antimicrobial surface coatings. Both of these sequences underwent PTM before antimicrobial behavior was observed, yet the model was still able to predict their antimicrobial nature without this information. This demonstrates that it is possible to predict antimicrobial potential for a given sequence without knowledge of PTMs. Thus, this model has the advantage of needing little information while still providing high clas- sification accuracy, but it also has the disadvantage of being unable to predict whether PTMs are necessary for activity. Although this model cannot in- dicate whether PTM will be necessary for a given peptide, it could still be used to screen or evaluate candidate sequences for PTM experiments. 5 Conclusions The application of Bayesian network models to QSPR peptide modeling techniques has been introduced utilizing open-source statistical modeling software. These models are flexible and may encode sophisticated chem- ical knowledge, as seen from the motif model presented. This flexibility also allows models to be constructed with easy to interpret parameters, as demonstrated by the motif and combined QSPR + motif models, where reg- ularization forced each motif position to only contain one amino acid, as opposed to previous models where motif positions have non-negligible prob- 3,58 ability assigned to each of the 20 amino acids . These models show good classification performance with a maximum of 94% at predicting whether a peptide is active against gram-positive bacteria, given only the sequence of amino acids and their chemical descriptors. This is as good as more opaque and complex strategies such as multilayer artificial neural networks and N-gram representation random forest modeling, and better than a linear SVM, with the advantage of chemically meaningful interpretations. Addi- tionally, these models were used to identify potentially multifunctional pep- tides that are both antifouling and antimicrobial. Finally, due to the identical performance of the best-case motif model and the “background-only” model, we can conclude that either motifs are unimportant to the antimicrobial ac- 19 tivity of peptides, or their importance is more complicated than this model is able to capture. Bayesian network models provide a significant advance in the type of peptide activity modeling that can be done, and the ease in which such models can be constructed and combined. 6 Acknowledgement This work was supported by the Office of Naval Research (N00014-10-1-0600) and the National Science Foundation (CBET-0854298). References [1] Murray Aitkin, Duy Vu, and Brian Francis. Statistical modelling of a terrorist network. J. R. Stat. Soc. Ser. A (Statistics Soc., 180(3): 751–768, jun 2017. ISSN 09641998. doi: 10.1111/rssa.12233. URL http://doi.wiley.com/10.1111/rssa.12233. [2] I W Althaus, A J Gonzales, J J Chou, D L Romero, M R Deibel, Kuo-Chen Chou, F J Kezdy, L Resnick, M E Busso, and A G So. The quinoline U-78036 is a potent inhibitor of HIV-1 reverse transcriptase. J. Biol. Chem., 268(20):14875–14880, 1993. URL http://www.jbc.org/content/268/20/14875. [3] T. L. Bailey and C. Elkan. Fitting a mixture model by ex- pectation maximization to discover motifs in biopolymers. Pro- ceedings International Conference on Intelligent Systems for Molec- ular Biology ; ISMB. International Conference on Intelligent Sys- tems for Molecular Biology, 2:28–36, 1994. ISSN 1553-0833. URL http://view.ncbi.nlm.nih.gov/pubmed/7584402. [4] Timothy L. Bailey and Charles Elkan. Unsupervised Learn- ing of Multiple Motifs in Biopolymers Using Expectation Max- imization. Machine Learning, 21(1):51–80, October 1995. ISSN 08856125. doi: 10 . 1023 / A : 1022617714621. URL http://dx.doi.org/10.1023/A:1022617714621. 20 [5] Tadas Baltruˇsaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Mul- timodal Machine Learning: A Survey and Taxonomy. pages 1–20, may 2017. URL http://arxiv.org/abs/1705.09406. [6] Indrani Banerjee, Ravindra C. Pangule, and Ravi S. Kane. Antifouling Coatings: Recent Developments in the Design of Surfaces That Prevent Fouling by Proteins, Bacteria, and Marine Organisms. Adv. Mater., 23 (6):690–718, feb 2011. ISSN 09359648. doi: 10.1002/adma.201001215. URL http://doi.wiley.com/10.1002/adma.201001215. [7] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. The pro- tein data bank. Nucleic Acids Res., 28:235–242., 2000. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC102472/. [8] Sara Bobone, Alessandro Piazzon, Barbara Orioni, Jens Z. Pedersen, Yong H. Nan, Kyung-Soo Hahm, Song Y. Shin, and Lorenzo Stella. The thin line between cell-penetrating and antimicrobial peptides: the case of Pep-1 and Pep-1-K. J. Peptide Sci., 17(5):335–341, May 2011. doi: 10.1002/psc.1340. URL http://dx.doi.org/10.1002/psc.1340. [9] Jeremy P Bradshaw, BK Sharma, GPS Raghava, BC Schutte, TL Casa- vant, PB McCray, V Brusic, and VB Bajic. Analysis and predic- tion of antibacterial peptides. BioDrugs, 17(4):233–240, jul 2003. ISSN 1173-8804. doi: 10 . 2165 / 00063030-200317040-00002. URL http://link.springer.com/10.2165/00063030-200317040-00002. [10] Frank R. Burden and David A. Winkler. Robust QSAR Models Us- ing Bayesian Regularized Neural Networks. J. Med. Chem., 42(16): 3183–3187, aug 1999. ISSN 0022-2623. doi: 10.1021/jm980697n. URL http://pubs.acs.org/doi/abs/10.1021/jm980697n. [11] Xianwen Chen, Lige Ren, Soochong Kim, Nicholas Carpino, James L. Daniel, Satya P. Kunapuli, Alexander Y. Tsygankov, and Dehua Pei. Determination of the substrate specificity of protein-tyrosine phos- phatase TULA-2 and identification of Syk as a TULA-2 substrate. The Journal of biological chemistry, 285(41):31268–31276, October 2010. ISSN 1083-351X. doi: 10 . 1074 / jbc . M110 . 114181. URL http://dx.doi.org/10.1074/jbc.M110.114181. 21 [12] Artem Cherkasov, Eugene N. Muratov, Denis Fourches, Alexandre Varnek, Igor I. Baskin, Mark Cronin, John Dearden, Paola Gramat- ica, Yvonne C. Martin, Roberto Todeschini, Viviana Consonni, Vic- tor E. Kuz’Min, Richard Cramer, Romualdo Benigni, Chihae Yang, James Rathman, Lothar Terfloth, Johann Gasteiger, Ann Richard, and Alexander Tropsha. QSAR modeling: Where have you been? Where are you going to? Journal of Medicinal Chemistry, 57(12):4977–5010, 2014. ISSN 15204804. doi: 10.1021/jm4004285. [13] Jiang Shou-Ping Liu Wei-Min Fee Chih-Hao Chou, Kuo- Chen. Graph theory of enzyme kinetics i.steady-state reaction systems. SCIENCE CHINA Mathematics, 22 (3):341, 1979. doi: 10 . 1360 / ya1979-22-3-341. URL http://math.scichina.com:8081/sciAe/EN/abstract/article_380187.shtml. [14] Kuo-Chen Chou. Graphic Rules in Steady and Non- steady State Enzyme Kinetics. (20):12074–12079. URL http://www.jbc.org/content/264/20/12074.full.pdf. [15] Kuo-Chen Chou. Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol., 273(1):236– 247, 2011. ISSN 00225193. doi: 10.1016/j.jtbi.2010.12.024. URL http://www.sciencedirect.com/science/article/pii/S002251931000679X. [16] Kuo-Chen Chou and David W. Elrod. Protein subcellular lo- cation prediction. Protein Eng., 12(2):107–118, February 1999. ISSN 1741-0134. doi: 10 . 1093 / protein / 12 . 2 . 107. URL http://dx.doi.org/10.1093/protein/12.2.107. [17] Kuo-Chen Chou and S Fors´en. Graphical rules for enzyme-catalysed rate laws. Biochem. J., 187(3):829–35, jun 1980. ISSN 0264-6021. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC1162468. [18] The ENCODE Project Consortium. An Integrated Encyclopedia of DNA Elements in the Human Genome. Nature, 489:57—74, Septem- ber 2012. doi: doi:10.1038/nature11247. [19] Sergio Davis, Claudia Loyola, and Joaqu´ın Peralta. Bayesian statistical modelling of microcanonical melting times at the superheated regime. pages 1–20, aug 2017. URL http://arxiv.org/abs/1708.05210. 22 [20] Simon Duane, A.D. Kennedy, Brian J. Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Phys. Lett. B, 195(2):216–222, sep 1987. ISSN 03702693. doi: 10.1016/0370-2693(87)91197-X. URL http://linkinghub.elsevier.com/retrieve/pii/037026938791197X. [21] Christopher D. Fjell, Robert E.W. Hancock, and Artem Cherkasov. AMPer: a database and an automated discovery tool for an- timicrobial peptides. Bioinformatics, 23(9):1148–1155, may 2007. ISSN 1460-2059. doi: 10 . 1093 / bioinformatics / btm068. URL https://academic.oup.com/bioinformatics/article/23/9/1148/272556. [22] Christopher D. Fjell, H˚ avard Jenssen, Kai Hilpert, Warren A. Cheung, Nelly Pant´e, Robert E. W. Hancock, and Artem Cherkasov. Identifi- cation of Novel Antibacterial Peptides by Chemoinformatics and Ma- chine Learning. J. Med. Chem., 52(7):2006–2015, March 2009. doi: 10.1021/jm8015365. URL http://dx.doi.org/10.1021/jm8015365. [23] Christopher D. Fjell, Jan A. Hiss, Robert E. W. Hancock, and Gisbert Schneider. Designing antimicrobial peptides: form follows function. Nat. Rev. Drug. Discov., 11(1):37–51, January 2012. ISSN 1474-1776. doi: 10.1038/nrd3591. URL http://dx.doi.org/10.1038/nrd3591. [24] V. Frecer, B. Ho, and J. L. Ding. De novo design of potent an- timicrobial peptides. Antimicrob. Agents, 48(9):3349–3357, Septem- ber 2004. ISSN 0066-4804. doi: 10 . 1128 / aac . 48 . 9 . 3349. URL http://dx.doi.org/10.1128/aac.48.9.3349. [25] Margaret Gamalo-Siebers, Jasmina Savic, Cynthia Basu, Xin Zhao, Mathangi Gopalakrishnan, Aijun Gao, Guochen Song, Simin Bay- gani, Laura Thompson, H. Amy Xia, Karen Price, Ram Tiwari, and Bradley P. Carlin. Statistical modeling for Bayesian extrapolation of adult clinical trial information in pediatric drug evaluation. Pharm. Stat., 16(4):232–249, jul 2017. ISSN 15391604. doi: 10.1002/pst.1807. URL http://doi.wiley.com/10.1002/pst.1807. [26] Zoubin Ghahramani. Probabilistic machine learning and artificial intel- ligence. Nature, 521(1):452–459, May 2015. doi: 10.1038/nature14541. URL https://www.nature.com/articles/nature14541. 23 [27] A. K. Ghose and G. M. Crippen. Atomic physicochemical parameters for three-dimensional-structure-directed quantitative structure-activity relationships. 2. Modeling dispersive and hydrophobic interactions. J. Chem. Inf. Comput. Sci., 27(1):21–35, February 1987. ISSN 0095-2338. URL http://view.ncbi.nlm.nih.gov/pubmed/3558506. [28] Mark Hewitt, Mark T. D. Cronin, Judith C. Madden, Philip H. Rowe, Clara Johnson, Anrdrea Obi, and Steven J. Enoch. Consensus QSAR Models: Do the Benefits Outweigh the Complexity? J. Chem. Inf. Model., 47(4):1460–1468, July 2007. doi: 10.1021/ci700016d. URL http://dx.doi.org/10.1021/ci700016d. [29] Michael M. Hoffman, Jason Ernst, Steven P. Wilder, Anshul Kundaje, Robert S. Harris, Max Libbrecht, Belinda Giardine, Paul M. Ellenbo- gen, Jeffrey A. Bilmes, Ewan Birney, Ross C. Hardison, Ian Dunham, Manolis Kellis, and William S. Noble. Integrative annotation of chro- matin elements from ENCODE data. Nucl. Acids Res., 41(2):827–841, January 2013. ISSN 1362-4962. doi: 10.1093/nar/gks1284. URL http://dx.doi.org/10.1093/nar/gks1284. [30] Joanna Jaworska, Nina Nikolova-Jeliazkova, and Tom Aldenberg. QSAR Applicability Domain Estimation by Projection of the Training Set in Descriptor Space: A Review. ATLA, (33):445–459, 2005. [31] Koivisto , Mikko and Sood , Kismat. Exact Bayesian Structure Discov- ery in Bayesian Networks. J. Mach. Learn. Res., 5:549–573, 2004. URL http://www.jmlr.org/papers/volume5/koivisto04a/koivisto04a.pdf. [32] Matthias Kormaksson, James G Booth, Maria E Figueroa, and Ari Mel- nick. Integrative model-based clustering of microarray methylation and expression data. Ann. Appl. Stat., 6(3):1327–1347, 2012. [33] Manish Kumar, Ruchi Verma, Gajendra P. S. Raghava, Y Sugiura, SH Seah, TW Tan, V Brusic, and VB Bajic. AntiBP2: improved ver- sion of antibacterial peptide prediction. J. Biol. Chem., 281(9):5357– 5363, mar 2006. ISSN 0021-9258. doi: 10.1074/jbc.M511061200. URL http://www.jbc.org/lookup/doi/10.1074/jbc.M511061200. [34] Qilin Li, Shaily Mahendra, Delina Y. Lyon, Lena Brunet, Michael V. Liga, Dong Li, and Pedro J.J. Alvarez. Antimicrobial nanoma- terials for water disinfection and microbial control: Potential 24 applications and implications. Water Res., 42(18):4591–4602, nov 2008. ISSN 0043-1354. doi: 10.1016/J.WATRES.2008.08.015. URL http://www.sciencedirect.com/science/article/pii/S0043135408003333. [35] S X Lin and K E Neet. Demonstration of a slow con- formational change in liver glucokinase by fluorescence spec- troscopy. J. Biol. Chem., 265(17):9670–9675, jun 1990. URL http://www.ncbi.nlm.nih.gov/pubmed/2351663. [36] W Liu and J N Hansen. The antimicrobial effect of a struc- tural variant of subtilin against outgrowing Bacillus cereus T spores and vegetative cells occurs by different mechanisms. Appl. Env- iron. Microbiol., 59(2):648–51, feb 1993. ISSN 0099-2240. URL https://www.ncbi.nlm.nih.gov/pubmed/8434932. [37] H. B. McMahan and M. Streeter. Adaptive Bound Optimization for On- line Convex Optimization. ArXiv e-prints, pages 1–19, February 2010. URL https://arxiv.org/pdf/1002.4908.pdf. [38] Paul David McNicholas and Thomas Brendan Murphy. Parsimo- nious Gaussian mixture models. Stat. Comput., 18(3):285–296, sep 2008. ISSN 0960-3174. doi: 10 . 1007 / s11222-008-9056-0. URL http://link.springer.com/10.1007/s11222-008-9056-0. [39] R D Newcomb, P M Campbell, D L Ollis, E Cheah, R J Rus- sell, and J G Oakeshott. A single amino acid substitution con- verts a carboxylesterase to an organophosphorus hydrolase and con- fers insecticide resistance on a blowfly. Proc. Natl. Acad. Sci. U. S. A., 94(14):7464–7468, jul 1997. ISSN 0027-8424. URL http://www.ncbi.nlm.nih.gov/pubmed/9207114. [40] Alice B. Nongonierma and Richard J. FitzGerald. Learnings from quan- titative structureactivity relationship (QSAR) studies with respect to food protein-derived bioactive peptides: a review. RSC Adv., 6(79): 75400–75413, 2016. ISSN 2046-2069. doi: 10.1039/C6RA12738J. URL http://xlink.rsc.org/?DOI=C6RA12738J. [41] Ann K. Nowinski, Andrew D. White, Andrew J. Keefe, and Shaoyi Jiang. Biologically Inspired Stealth Peptide-Capped Gold Nanoparticles. 25 Langmuir, 30(7):1864–1870, feb 2014. ISSN 0743-7463. doi: 10.1021/ la404980g. URL http://pubs.acs.org/doi/10.1021/la404980g. [42] Manal Othman, Sujay Ratna, Anant Tewari, Anthony M. Kang, Kather- ine Du, and Iosif I. Vaisman. Classification and Prediction of Antimi- crobial Peptides Using N-gram Representation and Machine Learning. In Proc. 8th ACM Int. Conf. Bioinformatics, Comput. Biol. Heal. In- formatics - ACM-BCB ’17, pages 605–605, New York, New York, USA, 2017. ACM Press. ISBN 9781450347228. doi: 10.1145/3107411.3108215. URL http://dl.acm.org/citation.cfm?doid=3107411.3108215. [43] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander- plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- esnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [44] L A Rogers. THE INHIBITING EFFECT OF STREPTO- COCCUS LACTIS ON LACTOBACILLUS BULGARICUS. J. Bacteriol., 16(5):321–325, nov 1928. ISSN 0021-9193. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC375033. [45] John Salvatier, Thomas V. Wiecki, and Christopher Fonnesbeck. Prob- abilistic programming in Python using PyMC3. PeerJ Comput. Sci., 2:e55, apr 2016. ISSN 2376-5992. doi: 10.7717/peerj-cs.55. URL https://peerj.com/articles/cs-55. [46] Mario Salwiczek, Yue Qu, James Gardiner, Richard A. Strugnell, Trevor Lithgow, Keith M. McLean, and Helmut Thissen. Emerging rules for effective antimicrobial coatings. Trends Biotechnol., 32(2):82–90, feb 2014. ISSN 0167-7799. doi: 10.1016/J.TIBTECH.2013.09.008. URL http://www.sciencedirect.com/science/article/pii/S0167779913002072. [47] Daniel Schwartz and Steven P Gygi. An iterative statistical approach to the identification of protein phosphorylation motifs from large-scale data sets. Nat. Biotechnol., 23(11):1391–1398, nov 2005. ISSN 1087-0156. doi: 10.1038/nbt1146. URL http://www.nature.com/articles/nbt1146. [48] D M Stalker, W R Hiatt, and L Comai. A single amino acid substitution in the enzyme 5-enolpyruvylshikimate-3-phosphate syn- thase confers resistance to the herbicide glyphosate. J. Biol. 26 Chem., 260(8):4724–4728, apr 1985. ISSN 0021-9258. URL http://www.ncbi.nlm.nih.gov/pubmed/2985565. [49] Christoph Steinbeck, Yongquan Han, Stefan Kuhn, Oliver Horlacher, Edgar Luttmann, and Egon Willighagen. The chemistry devel- opment kit (CDK): An Open-Source java library for chemo- and bioinformatics. J. Chem. Inf. Comput. Sci., 43(2):493–500, Febru- ary 2003. ISSN 0095-2338. doi: 10 . 1021 / ci025584y. URL http://dx.doi.org/10.1021/ci025584y. [50] Michael C. Sweeney, Anne-Sophie S. Wavreille, Junguk Park, Jonathan P. Butchar, Susheela Tridandapani, and Dehua Pei. Decoding protein-protein interactions through combinatorial chemistry: sequence specificity of SHP-1, SHP-2, and SHIP SH2 domains. Biochem, 44(45): 14932–14947, November 2005. ISSN 0006-2960. doi: 10.1021/bi051408h. URL http://dx.doi.org/10.1021/bi051408h. [51] Shaini Thomas, Shreyas Karnik, Ram Shankar Barai, V. K. Jayara- man, and Susan Idicula-Thomas. CAMP: a useful resource for re- search on antimicrobial peptides. Nucleic Acids Res., 38(suppl 1):D774– D780, jan 2010. ISSN 0305-1048. doi: 10.1093/nar/gkp1021. URL http://www.ncbi.nlm.nih.gov/pubmed/19923233. [52] Marc Torrent, David Andreu, Victo`ria M. Nogu´es, and Ester Boix. Connecting Peptide Physicochemical and Antimicrobial Prop- erties by a Rational Prediction Model. PLoS ONE, 6(2):e16968, February 2011. doi: 10 . 1371 / journal . pone . 0016968. URL http://dx.doi.org/10.1371/journal.pone.0016968. [53] Aki Vehtari, Andrew Gelman, and Jonah Gabry. Prac- tical Bayesian model evaluation using leave-one-out cross- validation and WAIC. Stat. Comput., 27(5):1413–1432, sep 2017. ISSN 0960-3174. doi: 10 . 1007 / s11222-016-9696-4. URL http://link.springer.com/10.1007/s11222-016-9696-4. [54] Guangshun Wang. Post-translational Modifications of Nat- ural Antimicrobial Peptides and Strategies for Peptide En- gineering. Curr. Biotechnol., 1(1):72–79, feb 2012. URL http://www.ncbi.nlm.nih.gov/pubmed/24511461. 27 [55] Guangshun Wang, Xia Li, and Zhe Wang. APD2: the updated antimi- crobial peptide database and its application in peptide design. Nucl. Acids Res., 37(suppl 1):D933–D937, January 2009. ISSN 1362-4962. doi: 10.1093/nar/gkn823. URL http://dx.doi.org/10.1093/nar/gkn823. [56] Andrew D White, Wenjun Huang, and Shaoyi Jiang. Role of Nonspe- cific Interactions in Molecular Chaperones through Model-based Bioin- formatics. Biophs. J., 103:2485–2491, 2012. [57] Andrew D White, Ann K Nowinski, Wenjun Huang, Andrew J Keefe, Fang Sun, and Shaoyi Jiang. Decoding nonspecific interactions from nature. Chem. Sci., 3:3488–3494, 2012. [58] Andrew D. White, Andrew J. Keefe, Ann K. Nowinski, Qing Shao, Kyle Caldwell, and Shaoyi Jiang. Standardizing and sim- plifying analysis of peptide library data. J. Chem. Inf. Model., 53(2):493–499, January 2013. doi: 10 . 1021 / ci300484q. URL http://dx.doi.org/10.1021/ci300484q. [59] Xuan Xiao, Pu Wang, Wei-Zhong Lin, Jian-Hua Jia, and Kuo-Chen Chou. iAMP-2L: A two-level multi-label classifier for identifying an- timicrobial peptides and their functional types. Anal. Biochem., 436(2): 168–177, may 2013. ISSN 00032697. doi: 10.1016/j.ab.2013.01.019. URL http://linkinghub.elsevier.com/retrieve/pii/S0003269713000390. [60] N Yadav, R E McDevitt, S Benard, and S C Falco. Single amino acid substitutions in the enzyme acetolactate synthase confer resis- tance to th e herbicide sulfometuron methyl. Proc. Natl. Acad. Sci. U. S. A., 83(12):4418–4422, jun 1986. ISSN 0027-8424. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC323744/. [61] G. Yu, G. Sapiro, and S. Mallat. Solving inverse problems with piecewise linear estimators: From gaussian mixture models to structured sparsity. IEEE Trans. Image Process., 21(5):2481–2499, May 2012. ISSN 1057- 7149. doi: 10.1109/TIP.2011.2176743. [62] G P Zhou and M H Deng. An extension of Chou’s graphic rules for deriving enzyme kinetic equations to systems involving parallel reaction pathways. Biochem. J., 222(1):169–176, aug 1984. ISSN 0264-6021. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1144157/. 28 Classifying Antimicrobial and Multifunctional Peptides Supporting Information † ∗,‡ ∗,† Rainier Barrett, Shaoyi Jiang, and Andrew D White †Department of Chemical Engineering, University of Rochester, 4001 Wegmans Hall, Rochester, NY 14627, USA ‡Department of Chemical Engineering, University of Washington, 440 Benjamin Hall IRB 616 NE Northlake Pl. Seattle, WA 98105 E-mail: sjiang@uw.edu; andrew.white@rochester.edu Phone: 585-276-7395 Department of Chemical Engineering University of Rochester 206 Gavett Hall Rochester, NY 14627, USA Email: andrew.white@rochester.edu Department of Chemical Engineering University of Washington Box 351750 Seattle, WA 98195, USA To whom correspondence should be addressed arXiv:1804.06327v1 [stat.AP] 17 Apr 2018 Converting Descriptors into Ranks In order to use a structural descriptor in a Bayesian network model, it must be converted into a form that may be described by a probability distribution. The probability that a descriptor f(·) equals a value x in a compound c is given by: X X Pr(f(c) = x) = w δ(f(c) − x), Z = w (1) i i i i where Z is the partition coefficient, w is the un-normalized probability (weights) of observing the ith compound in the chemical space, and δ is an indicator or delta-function. The weighting param- eters may be adjusted, for example, to account for synthetic difficulty, recognizing the fact that the experimentally active compounds are likely chosen with bias. In the case of peptide libraries, the weights are unity because peptide libraries have little to no synthetic bias. The partition coefficient for a peptide library is A , where A is the size of the alphabet (generally 20 for amino acids) and l is the length of the amino acid sequence. Constructing probability distributions for group-wise additive descriptors follows two approaches. When the chemical space is small (l ≤ 3 for peptides), all descriptor values may be enumerated to create a probability distribution. When the chemical space is large (l > 3 for peptides), the prob- ability distribution may be approximated as a sum of l normal distributions. In the case of peptide libraries, l is the length and the normal distributions are identical. In the case of combinatorial organic libraries, l is the number of positions that may be exchanged and the normal distributions may not be identical between positions. The approximation is accurate provided the number of 0’s is low (e.g., the number of sulfur atoms in the peptides will not fit into this approximation). Examples of this approximation may be seen in Figures S2-S4. The mean of the normal distri- butions is the mean (μ) of the descriptor calculated on the combinatorial components (e.g, amino acids) and the variance (σ ) is calculated likewise. The sum of the l normal distributions will have a mean of lμ and variance lσ . For non-group-wise distributions, the probability distribution may be estimated by sampling from the combinatorial library where the sampling is done according to 2 the weights w . Once the probability distribution over the chemical space of the library is calculated, descriptors for the active/training compounds are transformed to incorporate information about this probability distribution. This is done by converting the descriptor into a rank between 0 and 100, where the rank of the descriptor relative to the chemical space. The ranks come from quantiling the descriptors calculated over the chemical space. For example, quantiling the number of charged groups over a chemical space with 4 quantiles could yield that the bottom 25% are between 0–5 charge groups, the 25–50% are between 5–6 charge groups, 50–75% are between 6–7 and the top 25% are 7–15. Using this distribution, an active compound with 3 charged groups would be given a rank of 1, because it is in the first quantile. An active compound with 7 charge groups has a rank of 3 and an active compound with 13 charges would also has a rank of 3 . Notice how the unevenness of the original distribution is removed and the ranks correlate to the ranking over the chemical space. The entire transformation process is depicted in Figure S1. This descriptor transformation has three benefits. First, it is immediately obvious if a descrip- tor is at an extreme value. Second, when examining multiple descriptors, their range corresponds exactly to their span of the entire chemical space. Thus, if a descriptor range is 5–95, it is not sig- nificant. If it is within the range of 20–25, then the descriptors occupy a range that only 5% of the chemical space of the library occupies. Third, the effect of length on the peptide descriptors may be removed by only comparing descriptors against uniform length probability distributions. For example, if there are sequences from lengths 3–10 in a library, the descriptors may be calculated relative only to sequences of the same length. Then a rank of 5 is interpreted as in the bottom 5% relative to sequences of the same length. If this is not desired, only the probability distribution on the longest 2 lengths need to be calculated since that corresponds to 99.75% (1 − 1/20 ) of the possible values. 3 f(c) Group Additive Yes No length < 3 Yes No Normal Enumerate Sample Approximation Data P{ f(c)=x } 0-100 Score Figure S1 A flowchart for converting a descriptor, f(c), into a rank from 0–100 that both removes biases from the chemical space of the library and normalizes it for use in Bayesian network models. Histogram Approximation 0 100 200 300 400 MW Figure S2 The dataset is all combinations of the 20 amino acids with length 3. The approximation is the sum of three identical normal distributions parameterized to the molecular weight of the 20 amino acids. The approximation works well, even at this low length. Histogram Approximation 0 1 2 3 nAromaticGroups Figure S3 The dataset is all combinations of the 20 amino acids with length 3. The approximation is the sum of three identical normal distributions parameterized to an aromatic indicator function on the 20 amino acids (1 for aromatic, 0 for non-aromatic). The approximation doesn’t works well, due to the high number of zero values, as mentioned in the text. Probability Probability 0.0 0.3 0.6 0.000 0.004 Histogram Approximation -3 -2 -1 0 1 2 3 netCharge Figure S4 The dataset is all combinations of the 20 amino acids with length 3. The approximation is the sum of three identical normal distributions parameterized to the charge of the 20 amino acids. The approximation works well, even at this low length. Model specifications The grphical models are specified in the GitHub repository found at https://github.com/RainierBarrett/pymc3_qspr. This software is made freely available under the GNU General Public License. Motif Model Verification Figures S5-S8 are the plots of the motif class distribution trained on the artificial dataset with the imposed motif “ARND”. As described in the main text, the dataset of peptides with imposed motifs was artificially constructed by adding uniformly-distributed amino acids before and/or after the imposed motif randomly. Notice the sparseness – the model clearly distinguishes which amino acid is most likely to be in which motif position, with no knowledge of what the motif will be or where it will occur. The model accurately captures the motifs in this simple case. These figures were produced by fitting the motif model with 1000 training steps with the same training method as described in the main text. Figure S9 shows a comparison between two equally-accurate kernel numbers for the QSPR model. The receiver operating characteristc (ROC) curves and predicted rank distribution for the number of non-polar groups descriptor generated by the model is shown for each case. The Ma- terials and Methods section of the main text details how the ROC curves are generated. We can see that while the ROC curves (and thus, performance) are nearly identical, the three-kernel gen- Probability 0.0 0.2 0.4 erated distribution is more representative of the qualitative features of the true histogram of the rankings found from the APD dataset. Figure S10 displays the ROC curve of the background-only motif model. Note its similarity to the best-case motif model (Figure 6b, main text). Figure S11 shows the ROC curve generated by the QSPR model on the human dataset. Note the dissimilarity between this ROC curve and that of the QSPR model trained on the APD (Figure S9a). Figure S5 Predicted probabilities for amino acid ocurrence in the first position in a single motif class model. 6 Figure S6 Predicted probabilities for amino acid ocurrence in the second position in a single motif class model. Figure S7 Predicted probabilities for amino acid ocurrence in the third position in a single motif class model. 7 Figure S8 Predicted probabilities for amino acid ocurrence in the fourth position in a single motif class model. Figure S10 The ROC curve generated by the motif model with k = 0, w = 0. TPR is true positive rate and FPR is false positive rate for predictions on withheld testing data. The high degree of accuracy with no motif identification may indicate that motifs are not important for antimicrobial activity, or that the model is not equipped to capture the nature of their importance. 8 (a) The ROC Curve for the 3-kernel Gaussian (b) The ROC Curve for the 6-kernel Gaussian mixture QSPR model. mixture QSPR model. (c) The fitted distribution for number of non- (d) The fitted distribution for number of non- polar groups score in the 3-kernel QSPR model polar groups score in the 6-kernel QSPR model compared with the histogram of the raw data. compared with the histogram of the raw data. Figure S9 A comparison of one descriptor’s prediciton histogram with two different kernel num- bers in the QSPR model, and the corresponding ROC curves. TPR is true positive rate and FPR is false positive rate for predictions on withheld testing data. Though both of these kernel num- bers produced similar performance (note ROC curve similarity), the 3-kernel distribution is more indicative of the underlying properties. 9 Figure S11 ROC curve of the 8-kernel QSPR model trained on the human dataset. FPR is false positive rate, and TPR is true positive rate for predictions made by this model on withheld testing data from the human dataset. 10 References [1] Hyndman, R. J.; Fan, Y. Am. Stat. 1996, 50, 361–365. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Statistics arXiv (Cornell University)

Classifying Antimicrobial and Multifunctional Peptides with Bayesian Network Models

Statistics , Volume 2018 (1804) – Apr 17, 2018

Loading next page...
 
/lp/arxiv-cornell-university/classifying-antimicrobial-and-multifunctional-peptides-with-bayesian-Q10gdR9C45
ISSN
2475-8817
eISSN
ARCH-3347
DOI
10.1002/pep2.24079
Publisher site
See Article on Publisher Site

Abstract

Bayesian network models are finding success in characterizing enzyme- catalyzed reactions, slow conformational changes, predicting enzyme inhibition, and genomics. In this work, we apply them to statisti- cal modeling of peptides by simultaneously identifying amino acid se- quence motifs and using a motif-based model to clarify the role motifs may play in antimicrobial activity. We construct models of increasing sophistication, demonstrating how chemical knowledge of a peptide system may be embedded without requiring new derivation of model fitting equations after changing model structure. These models are used to construct classifiers with good performance (94% accuracy, Matthews correlation coefficient of 0.87) at predicting antimicrobial activity in peptides, while at the same time being built of interpretable parameters. We demonstrate use of these models to identify peptides that are potentially both antimicrobial and antifouling, and show that the background distribution of amino acids could play a greater role in activity than sequence motifs do. This provides an advancement in the type of peptide activity modeling that can be done and the ease in which models can be constructed. University of Washington, Department of Chemical Engineering, Seattle, WA, USA. Tel: 01 206 616 6509; E-mail: sjiang@uw.edu University of Rochester, Department of Chemical Engineering, Rochester, NY, USA. Tel: 01 585 276 7395. E-mail: andrew.white@rochester.edu arXiv:1804.06327v1 [stat.AP] 17 Apr 2018 1 Introduction Bayesian networks are a statistical modeling framework that are ideally suited for use in combination with a quantitative structure-property relation- ship (QSPR) modeling framework due to their ability to encode chemical 10,26 knowledge and design interpretable models. QSPR modeling (alterna- tively known as QSAR for “Quantitative Structure-Activity Relationship”) is a term used to refer to a suite of statistical modeling techniques. Its pur- pose is not always specifically to identify structural features and link them to activity, but to identify and/or exploit trends among the features of a chemical dataset in order to make statistical predictions. This broad class of modeling methods makes use of input data such as chemical descriptors and peptide sequences. Recent reviews may be found in Cherkasov et al. , Ja- 30 40 worska et al. , and Nongonierma and FitzGerald . The power of Bayesian network models lies in their ability to treat sophisticated models with general training techniques. Due to the generality of the training algorithms, even datasets like the massive ENCODE database can be analyzed. A recent review of them may be found in Ghahramani. Recent examples of applied Bayesian networks include predicting microcanonical melting points, mod- eling a terrorist network, and extrapolating clinical trial results beyond the original demographic. When applying a Bayesian network model to small drug-like molecules, one could specify that the molecule must have a molecular weight below a cut-off, and that at least two QSPR descriptors must be in a certain range. Such constraints are difficult to embed into linear discriminant analysis, for example, and require a new derivation of the model fitting procedure. There is no such requirement in Bayesian networks, due to the generality of their training procedures. This generality also means that fast algorithms have been developed that make use of such constraints to reduce the training space. The combination of the constraints and speed allows models to be constructed that are easily interpretable. In particular, Bayesian models have recently been employed in specifically biological studies of a wide range of important 13,14,17,62 topics, including enzyme-catalyzed reactions, slow conformational 35 2 change, and inhibition of HIV-1 reverse transcriptase. Another benefit of Bayesian networks is their ability to do multimodal modeling. Multimodal modeling is the combination of multimodal data into a unified model. For example, combining the sequence data and chemical descriptors of a peptide is a challenging task. In a Bayesian network, two 2 parts of the model may deal with the different data types and be connected through probability distributions. This has an advantage over other model combination techniques, such as consensus modeling, in that both models may be trained independently and later combined, instead of a step-by-step procedure. Applying these types of models to chemistry problems will open new ways of combining data such as bioavailability descriptors, sequence models, and simulation results. In recent years, a number of studies of antimicrobial peptides (AMPs) have been made using machine learning techniques. The Antimicrobial Pep- tide Database (APD) was designed to collect known peptides with an- timicrobial properties. Past studies involving AMP classification include: Fjell et al. who created AMPer using a hidden Markov model approach, 9 33 Bradshaw et al. who developed AntiBP (later improved by Kumar et al. ) to classify antimicrobial peptides using sequence information, and Thomas et al. who created the Collection of Anti-Microbial Peptides, a database of AMPs with built-in tools for prediction and analysis. Finally, Xiao et al. used machine learning techniques to classify AMPs by target (bacteria, viruses, etc.) in addition to anti-microbial activity alone. However, antimicrobial activity in vitro is not necessarily indicative of broad applicability for other uses. In complex media, fouling (non-specific binding at the surface of a material) leads to a loss of activity. Thus, in order for an AMP to be of use in applications such as biomedical devices 6,34,46 and marine coatings, it is necessary to ensure that AMP retains both antimicrobial and antifouling properties in complex media. To this end, we have constructed accurate Bayesian models that can predict antimicro- bial activity, identify sequence motifs, elucidate important descriptors, and identify potential multi-functional peptides that are both antimicrobial and antifouling. In this work, Bayesian network models are created to predict antimicro- bial activity and identify possible multifunctional peptides. Two datasets are used to train the networks. The first is 351 unique peptides which inhibit growth of gram-positive bacteria from the APD. The second is a collection of approximately 3, 600 sequence fragments from the surface of human pro- teins. The second dataset is hypothesized to contain sequences which resist 56,57 nonspecific interactions in biological systems. Bayesian models are well- suited to small datasets such as these, as they have demonstrable ability to be accurately trained on datasets with as few as 100 points. In the materials and methods section, we describe the construction of 3 Figure 1 Diverse proteins isolated from humans with structure in the protein data bank created a database of 1,162 proteins. The surface was found as described in White et al. . Contiguous surface sequences were found (dark gray) and converted into sequence fragments. All with length greater than 4 were used. the datasets, the training procedure used for the models, and the descriptors used. In the results section, we examine model accuracy, compare our models with a simpler traditional machine learning approach, and finally, identify two peptides that are predicted to have antimicrobial and antifouling properties. 2 Materials and Methods It is critical in the development of a sequence-based statistical prediction method for biological systems that the process be clear and easily repro- ducible. As suggested by Chou and demonstrated in a series of recent 14,17,59,62 articles from that research group, a predictive statistical model for biological systems should use the following guidelines to achieve its goals: construct benchmark datasets for training and testing the predictor, create a suitable mathematical expression that represents the relevant properties for prediction, implement an appropriate model and training algorithm, perform cross-validation tests to evaluate the accuracy of prediction, and establish a web-based public interface for ease of use. In the following, we address each of these steps in turn. In order to build a classification model to identify peptides that could be both antimicrobial and antifouling, two datasets were used. The first is from 4 55 the APD as of September 2017 and contains 482 sequences which show activity against gram-positive bacteria. The 482 sequences were reduced to 351 by removing similar sequences. Here, “similar” sequences were defined as those separated by 2 or fewer single position substitutions (e.g, EDGRT and ADGRS are similar). This definition of sequence similarity has no inherent chemical meaning. It was chosen as a method of removing certain sequences to reduce over-representation of combinatorial studies, where single positions are changed over multiple trials. Although one substitution can be enough 39,48,60 to drastically change the activity of a peptide, it is still necessary to avoid including these similar sequences due to the statistical nature of the model. Even if changing one or two amino acids affects the activity, including many peptides with nearly identical sequences would bias the model toward the (potentially very long) unchanged parts of these peptides. The second dataset, “Human”, is built upon the protein dataset from White et al. All contiguous amino acid sequences of length greater than 4 present on the surface of proteins from that dataset were tabulated as independent sequences, as depicted in Figure 1. This yields 3,600 unique sequences. Another aspect of antimicrobial activity that our datasets do not address is post-translational modification (PTM) of peptides. In fact, 1147 out of 1755 of the peptides in the APD database are known to undergo PTM be- fore activity. However, the goal of this model is not to explain all factors that lead to antimicrobial activity, but to accurately predict potential an- timicrobial activity with as little information as possible, i.e. only sequence and/or chemical descriptor information. To be of interest for applications like screening and other biochemical ex- periments, it is crucial to minimize the false positive rate (FPR) of the pre- dictive model. Reducing false positives prevents the waste of experimental time and resources by precluding the investigation of an incorrectly-predicted candidate peptide. Furthermore, the ability to reject false positives is an im- portant attribute of the model itself, because it indicates that the model is not over-fit. However, to evaluate performance in this regard, negative data must be either gathered or generated; no one has tabulated a list of pep- tides which are not antimicrobial. Torrent et al. approached this problem by using sequences not reported to have activity, which may be a good as- sumption since AMPs are likely rare. We use the same approach here to evaluate model performance. A decoy dataset was generated by replacing each residue in the APD dataset with a randomly selected amino acid drawn from the distribution of amino acids among all entries in the Protein Data 5 ALogP* HB Acceptors HB Donors Charged Groups 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 score score score score Polar Groups Non-polar Groups Aromatic Groups Net Charge 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 score score score score Figure 2 Histogram plots of descriptors of the Antimicrobial Peptide Database (APD) and Human datasets. “HB” stands for hydrogen bond. A ranking of 100 indicates a high value of the descriptor relative to all pep- tides. ALogP is not calculated for the APD due to its poor accuracy at long peptide lengths found in the APD dataset. The Human aromatic histogram is skewed because most sequences in the “Human” dataset have no aromatic groups because they are drawn only from the surface of human proteins. Bank. The following descriptors were considered: ALogP, number of hydrogen bond acceptors, number of hydrogen bond donors, number of charged groups, number of polar goups, number of non-polar groups, number of aromatic groups, and net charge. We investigated these descriptors with the goal of finding data that would distinguish our two datasets from one another and from the space of all peptides in general. All descriptors, except ALogP, were calculated using the peplib R plugin. ALogP was calculated according to 27 49 Ghose and Crippen as implemented in the Chemistry Development Kit. To estimate the distribution of ALogP values, the calculation was performed on all possible amino acid sequences of lengths from 1 to 3, and on a ran- dom sample of 5,000 peptides, drawn uniformly from the datasets found in 50 11 Sweeney et al. and Chen et al. for each length up to 10. Using a proce- dure described in the Supporting Information, chemical descriptors are first converted into rankings that span 0 to 100. The rankings represent the value of the descriptor relative to all possible peptides in the dataset. Except for 0 0.02 0.02 0 Human APD Human APD 0 0.02 0.02 0 0.03 0.03 Human APD Human APD 0 0.03 0.03 0 0.02 0.02 Human APD Human APD 0 0.03 0.03 0 0 Human APD Human APD the ALogP descriptor, the descriptors used in this work are all additive, i.e. they are cumulative sums of the descriptors of all individual amino acids in a given peptide. The results of these calculations were tabulated as a single csv file for each dataset, each line of which consists of a peptide (specified as a single-letter amino acid sequence) paired with the list of chemical de- scriptors calculated for that sequence. In the chemical descriptor model, only the descriptor values are used for training. The motif model uses only the sequences themselves as input data. Withheld testing data was selected as a random 20% of input data for both models. The descriptor rankings were calculated on the two datasets and are shown in Figure 2. Each box shows the histogram of the descriptor rankings. A flat histogram indicates the ranks are distributed identically to all possible peptides and thus the descriptor is likely unrelated to the activity. As was desired, all the descriptors chosen distinguish both datasets from all possi- ble peptides and one another. The APD dataset shows an abnormally low number of charged residues and a high number of non-polar groups. Consis- 8,23,24 tent with past analysis of AMPs, the APD dataset has a lower number of charged residues with a skew toward positively charged. The number of charged residues is high on human protein surfaces, as seen previously. This is also reflected in the high water solubility (low ALogP values). There is no dominant net charge in one direction or another for human protein surfaces. The number of aromatic residues is low, which is expected, since the num- ber of aromatic residues is low across all proteins generally, and hydrophobic aromatic side chains mostly occur on the interior of human proteins. 2.1 Model Description The chemical descriptor model (henceforth the “QSPR model”) was treated as a one-dimensional, two-state Gaussian mixture with respect to each de- scriptor. In Gaussian mixture modeling, the underlying distribution of a set of observations is estimated by fitting a function made up of a sum of k Gaussian kernels. The hyperparameters are the means, heights, and vari- ance matrices, which are fitted for each kernel. Some background on this 61 38 technique can be found in Yu et al. and McNicholas and Murphy. This model is one-dimensional in that each chemical descriptor was fitted to its own separate mixture-of-Gaussians distribution, with no correlation between distributions. The number of Gaussian kernels was varied from 1 to 10, with the best performance given by the 3-kernel set (see discussion and Supporting 7 a) Class =1 =0 Trained Data Observed Determinstic Stochastic descriptor distributions Figure 3 A graphical representation of a 2-state classifier that fits 3 observed descriptors to either distribution 0 or 1. A class of 1 indicates activity. In this work, the “Trained” nodes are our mixture-of-Gaussians distributions, and the three “Observed” nodes correspond to the three chemical descriptors chosen. Information). This portion of the model was implemented using the PyMC3 package for Python 3. This two–state mixture model was used to classify sequences as antimicrobial using the descriptors in Figure 2, and is shown in graphical representation in Figure 3. Three descriptors were chosen based on inspection of Figure 2: net charge, number of non-polar groups, and number of charged groups, as these had the most distinct histograms between the two datasets used. Next, a motif model was constructed which classifies sequences based on the existence of motifs in the sequence. Emphasis was placed on keeping the model interpretable to non-experts. For this reason the following attributes were chosen: (1) There are 0 to k possible motif classes that may be observed. (2) Each peptide belongs to only one motif class. (3) Motifs may not be par- tially expressed. (4) Non-motif residues are drawn from a “background” distribution that is shared among all peptides in the dataset. (5) Motifs are of fixed length w, and motif distributions are sparse, i.e. usually only one amino acid exists in each position. (6) The probability of a motif starting at a sequence position is independent of the position. (7) Motif distributions should be sparse, i.e. they should have few non-negligible entries. This model formulation is in part inspired by previous work in motif identifica- 4 47 tion, such as Bailey and Elkan and Schwartz and Gygi. The features that 8 3,58 are unique to this description relative to past motif models are the regu- larization of motifs, a tied background distribution, independent motif start probability, and the ability to deal with variable length sequences. The regu- larization method chosen (L1 regularization, see Section 2.2) forces the motifs to be sparse so that each motif position only has one or two possible residues. This makes motif interpretation more intuitive. The tied background distri- bution reduces the number of model parameters by (k − 1)(A − 1), where A is the number of amino acids. This background distribution can also be used as the motif model by itself if we let k = 0. This “Background-only” model is a limiting case of the motif model, but is otherwise specified and trained in an identical fashion. Such a change greatly complicates traditional algebraic analysis of the model, but is simple to include with this formulation. The uni- form motif start probability reduces the number of parameters by k(l − 1), where l is the length of the sequences. The ability to deal with variable length sequences without pre-alignment is a significant feature and is what allows modeling of the highly heterogeneous APD. The model description is thus-far complex, but the trade-off is that the parameters that are derived from this model are intuitive and few. The motif model was implemented directly in Python as an extension module written in C++. Complete model specifications can be found in the Supporting Information. The last classifier considered combines features from the previous two. Its graphical representation is shown in Figure 5a. In this model, the “Class” node shown in Figure 3 is connected to the “Member” node shown in Fig- ure 4a. This indicates incorporation of the descriptor distributions shown in Figure 3 into a new “QSPR” block in the graph in Figure 4a. Thus, the model takes in both sequence information and descriptors. The best- performing motif number (k = 8) and motif width (w = 3) from the motif model, and the same descriptors (net charge, number of charged groups, and number of non-polar groups) from the QSPR model were used. The complete model specification is given in the Supporting Information. 9 Figure 4 Panel a is the graphical representation of the motif model used. The middle parts of the graph are repeated as many times as necessary to fit the length of a sequence (length 4 depicted). Panel b is an inset showing how the motif indicator governs whether the probability for a given amino acid is drawn from the background or motif distribution. Panel c is the prediction accuracy as a function of the motif width and motif number. A motif length of 0 indicates the performance of the background-only model. The maximum prediction accuracy (79%) of the motif model was the same as the background-only model. 10 (b) Performance of the combined (a) Graph of combined QSPR/Motif QSPR+motif model on the APD model. The “omitted motif block” dataset with distributions from the 3- refers to Figure 4a. The “Member” kernel QSPR model and the 8-motifs, value drawn from the model shown length-3 motif model. The x-axis is in Figure 3 is used as the membership the weight assigned to the Motif half value in the rest of the model (Figure of the model (0 is no influence and 1 4a.) is total influence). Figure 5 Graphical representation and classification performance of the com- bined QSPR/motif model. 2.2 Model Training and Validation The QSPR model was trained using the built-in Metropolis-Hastings sam- pling algorithm for Hybrid Monte Carlo as implemented in the PyMC3 package (v 3.0) for Python3. See the Supporting Information for complete model specification. The model parameters were initialized uniformly and trained for 3000 steps. In all cases, leave-one-out cross-validation was per- formed via the PyMC3 built-in implementation of Vehtari et al. The motif model was trained using Gibbs sampling with constrained, per- coordinate infinite horizon stochastic gradient descent using L1 regulariza- tion with a squared-difference loss function. An analysis of this method can be found in Mcmahan and Streeter. A mathematical description follows. (t) (t) Given a D-dimensional vector of amino acid distributions, let n ¯ = NX 11 be the vector of expected counts from each distribution at timestep t, with N (t) the total number of observations, and let m be the observed counts given by the observation (draw) made at step t, with regularization term ||x|| , and let ν represent some uniformly random noise vector. We define the loss (t) function L, the learning rate η , and the regularization term ||x|| as: (t) (t) L = n ¯ − m + ||x|| , (t) η =s , and t−1 2 ∂L (t) (1) ∂X t=0 ||x|| = λ |n ¯ |. i=1 Using the definitions in Equation 1, and given an initial distribution vec- (0) (t) tor X , the update to X at timestep t is given as ∂L (t+1) (t) (t) X = X − η + ν, with (t) ∂X (2) ∂L (t) (t) = 2N n ¯ − m + λ (t) ∂X In general, L1 regularization is defined as ||x|| = λ |y −f(x )|, where 1 i i i=1 the y term refers to the “target value,” and λ is an adjustable parameter. In our case, y = 0 ∀i. This indicates a low belief in any value above zero for our motif distributions, and induces sparsity in the final distributions. The ν term is a stochastic noise term that uniformly adds an observation at random at each update step. This helps overcome overfitting by exploring more of the sample space. Typical cross-validation of this motif model is not sufficient to evaluate all aspects of its performance. Its purpose is not only classification, but also identification of motifs among peptides, which are relatively rare. We eval- uate prediction accuracy via withheld testing data from the APD dataset, but it is important to also validate the intended motif-capturing behavior of the model separately. However, because the entries in the APD do not have labeled motifs, and are not guaranteed to all contain motifs, it is impossible to validate this model’s ability to capture motif information from these pos- itive cases. Thus, to evaluate the ability to identify motifs, trial runs were 12 performed with arbitrarily constructed, small datasets with imposed mo- tifs. One or more fixed motifs (e.g. QAFR, IEKG, etc.) were selected, and background members consisting of uniformly-distributed amino acids were appended and prepended to the motifs randomly. For example, test peptides containing the QAFR motif might be ARQAFROI, or IQFARGMO. During training, these datasets had a random 20% withheld as testing data. The artificially-constructed motifs were captured accurately by the model with as few as 500 iterations over the data set. Figures S5-S8 show the fitted motif distributions for the data with the imposed motif ARND. Combining the two models requires no additional training step because the two halves of the combined model were trained previously. The trained distributions from the two halves of the model were used to evaluate likeli- hoods for the positive and negative datasets used previously. These likeli- hoods are normalized by dividing the likelihoods produced by each half of the model by the highest likelihood in that half. Then, weights W ranging from 0 to 1 were assigned to the motif model, with 1 − W being assigned to the QSPR model. The sum of these two weighted likelihoods for a given peptide is the likelihood produced by the combined model for that peptide. For each weight, a receiver operating characteristic (ROC) curve is generated. The ROC curve is generated by calculating false positive rate (FPR) and true positive rate (TPR) of classification on the training data as the cutoff value for likelihood is varied between 0 and 100% of the maximum likelihood pro- duced by the model. A data point that scores a likelihood above the cutoff value indicates a positive (i.e., the model predicts it to be antimicrobial). The FPR is the fraction of such points from the non-antimicrobial (negative) testing set, and the TPR is the fraction from the antimicrobial (positive) testing set. The accuracy values displayed in Figure 5b are the accuracy calculated on the withheld testing dataset at the optimal cutoff for the ROC curve at each weighting on the x-axis. 3 Results: Bayesian Network Models that Pre- dict Activtiy We have created three increasingly sophisticated models for predicting activ- ity. Figure 6 shows example ROC curves for the best parameter sets for the QSPR and motif models. A summary of results using the best parameter 13 (b) The ROC Curve for the mo- (a) The ROC Curve for the 3-kernel tif model with 8 possible motifs of Gaussian mixture QSPR model. length 3. Figure 6 Example receiver operating characteristic (ROC) curves for the two model types. TPR is the true positive rate, and FPR is the false positive rate. The best cutoff was defined as the point which minimized 2 2 (2(FPR) + (1 − TPR) ). This objective function was chosen to put an emphasis on a lower FPR. sets for the three models are shown in Table 1. Performance of each model was evaluated based on the accuracy produced by the model at the point on the ROC curve (generated using withheld testing 2 2 data) that minimized the value (2(FPR) + (1 − TPR) ). This choice was made to emphasize a lower FPR. In the case of the Gaussian mixture model, the kernel number was varied between 2 and 10, and accuracy was evaluated for all three of the chosen chemical descriptors with that kernel number. This choice was made to simplify model specification and to limit the number of training sessions. While the same kernel number may not be optimal for each individual descriptor, the model produced sufficient accuracy with this simplification. For the motif model, motif count varied between k = 2 and k = 10, and motif length varied between w = 3 and w = 8. The “background- only” (i.e. k = w = 0) model was also evaluated. Some QSPR trials with different kernel numbers produced very similar performances. In particular, the 3-kernel and 6-kernel trials both produced accuracy above 80%. Thus, it was necessary to decide which trial’s distri- butions to use for testing. The 3-kernel data was used after considering the distributions depicted in Figure S9. The 3-kernel and 6-kernel QSPR mod- els performed nearly identically in terms of ROC and prediction accuracy (Figure S9a and S9b), but comparing Figure S9c and S9d shows us that 14 the 3-kernel model’s distribution has a shape that is more indicative of the underlying structure of the raw data. The motif model had optimal performance with k = 8 and w = 3, and the resultant ROC curve is shown in Figure 6b. A heatmap of accuracies with dif- ferent k and w values is shown in Figure 4c. Surprisingly, the model’s optimal performance with motifs is of identical accuracy with 0 motifs (background only), with both the k = 8, w = 3 and k = 0, w = 0 models having an ac- curacy of 79% with optimal cutoff. The ROC curve of the background-only motif model is shown in Figure S10. This may indicate that motifs are not important for antimicrobial activity, but merely the overall distribution of amino acids present, or it could indicate that motifs play a more complicated role in antimicrobial behavior than this model is able to capture. The parameters that gave the best performance for the two individual models were also used for the combined model. The weights assigned to each half of the model were varied continuously from 0 to 1 to determine the optimal hyperparameter. Likelihoods from the two models were re-weighted by dividing all likelihoods for one model type by the highest likelihood pro- duced by that model, to achieve comparable magnitudes from the two models. These results are shown in Figure 5b. With its best performance, the com- bined model outperformed both of the individual models, with a weight of 21% for the motif part of the model and 79% for the QSPR part producing a classification accuracy of 94%. The ability to interpret the model may be seen in Figure 7. Figure 7a shows the probability distribution from the second motif from the k = 8, w = 3 model. The regularization operates as expected and the motifs are sparse; only one or two amino acids have non-negligible probability for a given position. Figure 7b shows all the motifs predicted by the model. The “Predict” column shows the count of peptides for which the given motif had the highest likelihood of appearing. This should be interpreted as the “best match” motif for a given peptide. It does not imply that the peptide con- tains that motif, but only that the shown motif gives the highest likelihood for that peptide among all motifs predicted by the model. The next column contains the number sequences which actually contain each motif, obtained by exhaustive analysis. We see the model has correctly assigned each motif to the corresponding sequence based on the close match between the predict and found columns. There are relatively few examples of the motifs discov- ered by the model, but it does capture some common ones. For example, exhaustive analysis shows that GLL is the third most common 3-letter motif 15 a) b) c) Motif Predict Found Background Distribution 0.12 CIA 32 2 [LK][LK][CP] 0.10 12 32 GCC 75 2 0.08 CKC 32 4 0.06 GGG 98 13 0.04 GLL 19 29 0.02 KLL 34 13 0.00 LLL 49 1 Figure 7 Panel a shows the probability that a given amino acid appears in each of positions 1 through 3 in the motif “[LK][LK][CP]”. Due to the sparsity from regularization, the majority of the motifs predicted by the model are sparse (only one amino acid with non-negligible probability for that motif position). Panel b is the list of motifs predicted by the model. The “Predict” column is the number of sequences which are more likely to contain the corresponding motif than any other motif. The “Found” column is the number of sequences that actually contain the motif. Panel c is the background distribution of amino acids from the motif model. The y-axis is probability of observing a randomly chosen amino acid from a random peptide in this dataset. 16 in the APD dataset, and it is captured by the model, as shown in Figure 7b. GGG is also among the 10 most common motifs found from brute-force analysis, and is also captured by the model. However, the model did not capture all the most common motifs. Finally, the background distribution is shown in Figure 7c. This may be considered the amino acid composition of the APD without the motifs observed in sequences. It is not uniform, and different from the Human database and as mentioned above, contributes significantly to the performance of the classifier. This is not unexpected since amino acid composition is a well-used descriptor for analyzing peptides and proteins. Furthermore, the identical accuracy of the best-case motif model and “background-only” model, as well as the failure of the model to capture all the most common motifs, indicates that either motifs are unimportant in the antimicrobial activity of peptides, or the model is insufficient to capture the important aspects of peptide motifs. In order to compare our model with a traditional machine learning method, we also evaluated the performance of a linear support vector machine (SVM) on the chemical descriptors used in the QSPR model. The datasets used were the same. We utilized the builtin SVM method of the scikit-learn Python 3 package, with 3000 training steps. After convergence, the SVM had an accuracy of 84% with its optimal cutoff, which is not as good as the QSPR, or QSPR + motif models, which were 87% and 94%, respectively. 4 Discussion: Identifying Multi-Functional Pep- tides As shown above, it is possible to construct accurate models that can pre- dict peptide activity. These tools can be further used to find peptides which have multiple activities or multiple functions. As stated in the introduction, the Human dataset contains peptides which are likely antifouling. To find a peptide that is both antimicrobial and antifouling, we can identify a peptide from the APD that scores as active according to a model trained on the Hu- man dataset. The opposite procedure is possible using the models trained above, where we find a human protein fragment that is likely antimicrobial. However, there is no experimental evidence that such a fragment is antifoul- ing. Choosing a peptide from the APD that is human-like will guarantee at a minimum that it has antimicrobial activity, and that the model predicts it 17 Model FPR TPR Accuracy MCC QSPR 8.1% 83% 87% 0.75 Motif 17% 75% 79% 0.58 QSPR + Motif 3.4% 90% 94% 0.87 Table 1 Summary of models that best predicted antimicrobial activity. The QSPR model is depicted in Figure 3, the motif model in Figure 4, and the QSPR + motif in Figure 5a. FPR and TPR are false positive and true positive rates of classification, respectively. MCC is the Matthews correlation coefficient. is similar in character to sequences found on the surfaces of human proteins. There is no evidence showing that motifs are relevant for antifouling, but past research has shown that strong net neutral partial charges, hydrophilic- 41,58 ity, and low self-interaction are, so a QSPR model was used. The QSPR model described above was used. The model was fit to the Human dataset with the same procedure as was used on the APD. The 8- kernel QSPR model performed the best for this dataset, with an accuracy of 67%. Overall, the model performance on the Human dataset was worse than on the APD, as can be seen by the ROC curve in Figure S11. This is likely due to the multimodality in the Human dataset, as well as its broad definition of activity (being present on the surface of a human protein). However, the model still achieved a low FPR, as desired. This shows that the model works regardless of the dataset it is trained against. After analyzing the descriptors of the APD against the optimal cutoff for the 8-cluster QSPR model trained on the Human dataset, approximately 30% of the APD peptides were found to be human-like. A peptide was designated as “human-like” if it scored above the cutoff used to produce the optimal accuracy on the ROC curve of the QSPR model fitted to the Human data. This low percentage shows AMPs are generally different from human proteins surfaces, which are thought to be optimized for minimal nonspecific 56,57 interactions. After omitting sequences less than 30 amino acids long, the most human- like AMPs are WKSESLCTPGCVTGALQTCFLQTLTCNCKISK (APD num- ber AP00206) and ITSISLCTPGCKTGALMGCNMKTATCHCSIHVSK (APD number AP00205). The first is subtilin, an antibiotic produced by the bac- 36,55 terium and model organism Bacillus subtilis. The second, nisin A, is 18 produced by the bacterium Lactococcus lactis, a species used in the produc- 44,55 tion of cheeses. They are similar to sequences from the “Human” dataset due to their low number of non-polar groups, high number of charged groups, and slightly negative net charge. Due to the connection between low protein adsorption and human protein surfaces, these two sequences may be good candidates for stable (non-fouling) antimicrobial surface coatings. Both of these sequences underwent PTM before antimicrobial behavior was observed, yet the model was still able to predict their antimicrobial nature without this information. This demonstrates that it is possible to predict antimicrobial potential for a given sequence without knowledge of PTMs. Thus, this model has the advantage of needing little information while still providing high clas- sification accuracy, but it also has the disadvantage of being unable to predict whether PTMs are necessary for activity. Although this model cannot in- dicate whether PTM will be necessary for a given peptide, it could still be used to screen or evaluate candidate sequences for PTM experiments. 5 Conclusions The application of Bayesian network models to QSPR peptide modeling techniques has been introduced utilizing open-source statistical modeling software. These models are flexible and may encode sophisticated chem- ical knowledge, as seen from the motif model presented. This flexibility also allows models to be constructed with easy to interpret parameters, as demonstrated by the motif and combined QSPR + motif models, where reg- ularization forced each motif position to only contain one amino acid, as opposed to previous models where motif positions have non-negligible prob- 3,58 ability assigned to each of the 20 amino acids . These models show good classification performance with a maximum of 94% at predicting whether a peptide is active against gram-positive bacteria, given only the sequence of amino acids and their chemical descriptors. This is as good as more opaque and complex strategies such as multilayer artificial neural networks and N-gram representation random forest modeling, and better than a linear SVM, with the advantage of chemically meaningful interpretations. Addi- tionally, these models were used to identify potentially multifunctional pep- tides that are both antifouling and antimicrobial. Finally, due to the identical performance of the best-case motif model and the “background-only” model, we can conclude that either motifs are unimportant to the antimicrobial ac- 19 tivity of peptides, or their importance is more complicated than this model is able to capture. Bayesian network models provide a significant advance in the type of peptide activity modeling that can be done, and the ease in which such models can be constructed and combined. 6 Acknowledgement This work was supported by the Office of Naval Research (N00014-10-1-0600) and the National Science Foundation (CBET-0854298). References [1] Murray Aitkin, Duy Vu, and Brian Francis. Statistical modelling of a terrorist network. J. R. Stat. Soc. Ser. A (Statistics Soc., 180(3): 751–768, jun 2017. ISSN 09641998. doi: 10.1111/rssa.12233. URL http://doi.wiley.com/10.1111/rssa.12233. [2] I W Althaus, A J Gonzales, J J Chou, D L Romero, M R Deibel, Kuo-Chen Chou, F J Kezdy, L Resnick, M E Busso, and A G So. The quinoline U-78036 is a potent inhibitor of HIV-1 reverse transcriptase. J. Biol. Chem., 268(20):14875–14880, 1993. URL http://www.jbc.org/content/268/20/14875. [3] T. L. Bailey and C. Elkan. Fitting a mixture model by ex- pectation maximization to discover motifs in biopolymers. Pro- ceedings International Conference on Intelligent Systems for Molec- ular Biology ; ISMB. International Conference on Intelligent Sys- tems for Molecular Biology, 2:28–36, 1994. ISSN 1553-0833. URL http://view.ncbi.nlm.nih.gov/pubmed/7584402. [4] Timothy L. Bailey and Charles Elkan. Unsupervised Learn- ing of Multiple Motifs in Biopolymers Using Expectation Max- imization. Machine Learning, 21(1):51–80, October 1995. ISSN 08856125. doi: 10 . 1023 / A : 1022617714621. URL http://dx.doi.org/10.1023/A:1022617714621. 20 [5] Tadas Baltruˇsaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Mul- timodal Machine Learning: A Survey and Taxonomy. pages 1–20, may 2017. URL http://arxiv.org/abs/1705.09406. [6] Indrani Banerjee, Ravindra C. Pangule, and Ravi S. Kane. Antifouling Coatings: Recent Developments in the Design of Surfaces That Prevent Fouling by Proteins, Bacteria, and Marine Organisms. Adv. Mater., 23 (6):690–718, feb 2011. ISSN 09359648. doi: 10.1002/adma.201001215. URL http://doi.wiley.com/10.1002/adma.201001215. [7] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. The pro- tein data bank. Nucleic Acids Res., 28:235–242., 2000. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC102472/. [8] Sara Bobone, Alessandro Piazzon, Barbara Orioni, Jens Z. Pedersen, Yong H. Nan, Kyung-Soo Hahm, Song Y. Shin, and Lorenzo Stella. The thin line between cell-penetrating and antimicrobial peptides: the case of Pep-1 and Pep-1-K. J. Peptide Sci., 17(5):335–341, May 2011. doi: 10.1002/psc.1340. URL http://dx.doi.org/10.1002/psc.1340. [9] Jeremy P Bradshaw, BK Sharma, GPS Raghava, BC Schutte, TL Casa- vant, PB McCray, V Brusic, and VB Bajic. Analysis and predic- tion of antibacterial peptides. BioDrugs, 17(4):233–240, jul 2003. ISSN 1173-8804. doi: 10 . 2165 / 00063030-200317040-00002. URL http://link.springer.com/10.2165/00063030-200317040-00002. [10] Frank R. Burden and David A. Winkler. Robust QSAR Models Us- ing Bayesian Regularized Neural Networks. J. Med. Chem., 42(16): 3183–3187, aug 1999. ISSN 0022-2623. doi: 10.1021/jm980697n. URL http://pubs.acs.org/doi/abs/10.1021/jm980697n. [11] Xianwen Chen, Lige Ren, Soochong Kim, Nicholas Carpino, James L. Daniel, Satya P. Kunapuli, Alexander Y. Tsygankov, and Dehua Pei. Determination of the substrate specificity of protein-tyrosine phos- phatase TULA-2 and identification of Syk as a TULA-2 substrate. The Journal of biological chemistry, 285(41):31268–31276, October 2010. ISSN 1083-351X. doi: 10 . 1074 / jbc . M110 . 114181. URL http://dx.doi.org/10.1074/jbc.M110.114181. 21 [12] Artem Cherkasov, Eugene N. Muratov, Denis Fourches, Alexandre Varnek, Igor I. Baskin, Mark Cronin, John Dearden, Paola Gramat- ica, Yvonne C. Martin, Roberto Todeschini, Viviana Consonni, Vic- tor E. Kuz’Min, Richard Cramer, Romualdo Benigni, Chihae Yang, James Rathman, Lothar Terfloth, Johann Gasteiger, Ann Richard, and Alexander Tropsha. QSAR modeling: Where have you been? Where are you going to? Journal of Medicinal Chemistry, 57(12):4977–5010, 2014. ISSN 15204804. doi: 10.1021/jm4004285. [13] Jiang Shou-Ping Liu Wei-Min Fee Chih-Hao Chou, Kuo- Chen. Graph theory of enzyme kinetics i.steady-state reaction systems. SCIENCE CHINA Mathematics, 22 (3):341, 1979. doi: 10 . 1360 / ya1979-22-3-341. URL http://math.scichina.com:8081/sciAe/EN/abstract/article_380187.shtml. [14] Kuo-Chen Chou. Graphic Rules in Steady and Non- steady State Enzyme Kinetics. (20):12074–12079. URL http://www.jbc.org/content/264/20/12074.full.pdf. [15] Kuo-Chen Chou. Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol., 273(1):236– 247, 2011. ISSN 00225193. doi: 10.1016/j.jtbi.2010.12.024. URL http://www.sciencedirect.com/science/article/pii/S002251931000679X. [16] Kuo-Chen Chou and David W. Elrod. Protein subcellular lo- cation prediction. Protein Eng., 12(2):107–118, February 1999. ISSN 1741-0134. doi: 10 . 1093 / protein / 12 . 2 . 107. URL http://dx.doi.org/10.1093/protein/12.2.107. [17] Kuo-Chen Chou and S Fors´en. Graphical rules for enzyme-catalysed rate laws. Biochem. J., 187(3):829–35, jun 1980. ISSN 0264-6021. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC1162468. [18] The ENCODE Project Consortium. An Integrated Encyclopedia of DNA Elements in the Human Genome. Nature, 489:57—74, Septem- ber 2012. doi: doi:10.1038/nature11247. [19] Sergio Davis, Claudia Loyola, and Joaqu´ın Peralta. Bayesian statistical modelling of microcanonical melting times at the superheated regime. pages 1–20, aug 2017. URL http://arxiv.org/abs/1708.05210. 22 [20] Simon Duane, A.D. Kennedy, Brian J. Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Phys. Lett. B, 195(2):216–222, sep 1987. ISSN 03702693. doi: 10.1016/0370-2693(87)91197-X. URL http://linkinghub.elsevier.com/retrieve/pii/037026938791197X. [21] Christopher D. Fjell, Robert E.W. Hancock, and Artem Cherkasov. AMPer: a database and an automated discovery tool for an- timicrobial peptides. Bioinformatics, 23(9):1148–1155, may 2007. ISSN 1460-2059. doi: 10 . 1093 / bioinformatics / btm068. URL https://academic.oup.com/bioinformatics/article/23/9/1148/272556. [22] Christopher D. Fjell, H˚ avard Jenssen, Kai Hilpert, Warren A. Cheung, Nelly Pant´e, Robert E. W. Hancock, and Artem Cherkasov. Identifi- cation of Novel Antibacterial Peptides by Chemoinformatics and Ma- chine Learning. J. Med. Chem., 52(7):2006–2015, March 2009. doi: 10.1021/jm8015365. URL http://dx.doi.org/10.1021/jm8015365. [23] Christopher D. Fjell, Jan A. Hiss, Robert E. W. Hancock, and Gisbert Schneider. Designing antimicrobial peptides: form follows function. Nat. Rev. Drug. Discov., 11(1):37–51, January 2012. ISSN 1474-1776. doi: 10.1038/nrd3591. URL http://dx.doi.org/10.1038/nrd3591. [24] V. Frecer, B. Ho, and J. L. Ding. De novo design of potent an- timicrobial peptides. Antimicrob. Agents, 48(9):3349–3357, Septem- ber 2004. ISSN 0066-4804. doi: 10 . 1128 / aac . 48 . 9 . 3349. URL http://dx.doi.org/10.1128/aac.48.9.3349. [25] Margaret Gamalo-Siebers, Jasmina Savic, Cynthia Basu, Xin Zhao, Mathangi Gopalakrishnan, Aijun Gao, Guochen Song, Simin Bay- gani, Laura Thompson, H. Amy Xia, Karen Price, Ram Tiwari, and Bradley P. Carlin. Statistical modeling for Bayesian extrapolation of adult clinical trial information in pediatric drug evaluation. Pharm. Stat., 16(4):232–249, jul 2017. ISSN 15391604. doi: 10.1002/pst.1807. URL http://doi.wiley.com/10.1002/pst.1807. [26] Zoubin Ghahramani. Probabilistic machine learning and artificial intel- ligence. Nature, 521(1):452–459, May 2015. doi: 10.1038/nature14541. URL https://www.nature.com/articles/nature14541. 23 [27] A. K. Ghose and G. M. Crippen. Atomic physicochemical parameters for three-dimensional-structure-directed quantitative structure-activity relationships. 2. Modeling dispersive and hydrophobic interactions. J. Chem. Inf. Comput. Sci., 27(1):21–35, February 1987. ISSN 0095-2338. URL http://view.ncbi.nlm.nih.gov/pubmed/3558506. [28] Mark Hewitt, Mark T. D. Cronin, Judith C. Madden, Philip H. Rowe, Clara Johnson, Anrdrea Obi, and Steven J. Enoch. Consensus QSAR Models: Do the Benefits Outweigh the Complexity? J. Chem. Inf. Model., 47(4):1460–1468, July 2007. doi: 10.1021/ci700016d. URL http://dx.doi.org/10.1021/ci700016d. [29] Michael M. Hoffman, Jason Ernst, Steven P. Wilder, Anshul Kundaje, Robert S. Harris, Max Libbrecht, Belinda Giardine, Paul M. Ellenbo- gen, Jeffrey A. Bilmes, Ewan Birney, Ross C. Hardison, Ian Dunham, Manolis Kellis, and William S. Noble. Integrative annotation of chro- matin elements from ENCODE data. Nucl. Acids Res., 41(2):827–841, January 2013. ISSN 1362-4962. doi: 10.1093/nar/gks1284. URL http://dx.doi.org/10.1093/nar/gks1284. [30] Joanna Jaworska, Nina Nikolova-Jeliazkova, and Tom Aldenberg. QSAR Applicability Domain Estimation by Projection of the Training Set in Descriptor Space: A Review. ATLA, (33):445–459, 2005. [31] Koivisto , Mikko and Sood , Kismat. Exact Bayesian Structure Discov- ery in Bayesian Networks. J. Mach. Learn. Res., 5:549–573, 2004. URL http://www.jmlr.org/papers/volume5/koivisto04a/koivisto04a.pdf. [32] Matthias Kormaksson, James G Booth, Maria E Figueroa, and Ari Mel- nick. Integrative model-based clustering of microarray methylation and expression data. Ann. Appl. Stat., 6(3):1327–1347, 2012. [33] Manish Kumar, Ruchi Verma, Gajendra P. S. Raghava, Y Sugiura, SH Seah, TW Tan, V Brusic, and VB Bajic. AntiBP2: improved ver- sion of antibacterial peptide prediction. J. Biol. Chem., 281(9):5357– 5363, mar 2006. ISSN 0021-9258. doi: 10.1074/jbc.M511061200. URL http://www.jbc.org/lookup/doi/10.1074/jbc.M511061200. [34] Qilin Li, Shaily Mahendra, Delina Y. Lyon, Lena Brunet, Michael V. Liga, Dong Li, and Pedro J.J. Alvarez. Antimicrobial nanoma- terials for water disinfection and microbial control: Potential 24 applications and implications. Water Res., 42(18):4591–4602, nov 2008. ISSN 0043-1354. doi: 10.1016/J.WATRES.2008.08.015. URL http://www.sciencedirect.com/science/article/pii/S0043135408003333. [35] S X Lin and K E Neet. Demonstration of a slow con- formational change in liver glucokinase by fluorescence spec- troscopy. J. Biol. Chem., 265(17):9670–9675, jun 1990. URL http://www.ncbi.nlm.nih.gov/pubmed/2351663. [36] W Liu and J N Hansen. The antimicrobial effect of a struc- tural variant of subtilin against outgrowing Bacillus cereus T spores and vegetative cells occurs by different mechanisms. Appl. Env- iron. Microbiol., 59(2):648–51, feb 1993. ISSN 0099-2240. URL https://www.ncbi.nlm.nih.gov/pubmed/8434932. [37] H. B. McMahan and M. Streeter. Adaptive Bound Optimization for On- line Convex Optimization. ArXiv e-prints, pages 1–19, February 2010. URL https://arxiv.org/pdf/1002.4908.pdf. [38] Paul David McNicholas and Thomas Brendan Murphy. Parsimo- nious Gaussian mixture models. Stat. Comput., 18(3):285–296, sep 2008. ISSN 0960-3174. doi: 10 . 1007 / s11222-008-9056-0. URL http://link.springer.com/10.1007/s11222-008-9056-0. [39] R D Newcomb, P M Campbell, D L Ollis, E Cheah, R J Rus- sell, and J G Oakeshott. A single amino acid substitution con- verts a carboxylesterase to an organophosphorus hydrolase and con- fers insecticide resistance on a blowfly. Proc. Natl. Acad. Sci. U. S. A., 94(14):7464–7468, jul 1997. ISSN 0027-8424. URL http://www.ncbi.nlm.nih.gov/pubmed/9207114. [40] Alice B. Nongonierma and Richard J. FitzGerald. Learnings from quan- titative structureactivity relationship (QSAR) studies with respect to food protein-derived bioactive peptides: a review. RSC Adv., 6(79): 75400–75413, 2016. ISSN 2046-2069. doi: 10.1039/C6RA12738J. URL http://xlink.rsc.org/?DOI=C6RA12738J. [41] Ann K. Nowinski, Andrew D. White, Andrew J. Keefe, and Shaoyi Jiang. Biologically Inspired Stealth Peptide-Capped Gold Nanoparticles. 25 Langmuir, 30(7):1864–1870, feb 2014. ISSN 0743-7463. doi: 10.1021/ la404980g. URL http://pubs.acs.org/doi/10.1021/la404980g. [42] Manal Othman, Sujay Ratna, Anant Tewari, Anthony M. Kang, Kather- ine Du, and Iosif I. Vaisman. Classification and Prediction of Antimi- crobial Peptides Using N-gram Representation and Machine Learning. In Proc. 8th ACM Int. Conf. Bioinformatics, Comput. Biol. Heal. In- formatics - ACM-BCB ’17, pages 605–605, New York, New York, USA, 2017. ACM Press. ISBN 9781450347228. doi: 10.1145/3107411.3108215. URL http://dl.acm.org/citation.cfm?doid=3107411.3108215. [43] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander- plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- esnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [44] L A Rogers. THE INHIBITING EFFECT OF STREPTO- COCCUS LACTIS ON LACTOBACILLUS BULGARICUS. J. Bacteriol., 16(5):321–325, nov 1928. ISSN 0021-9193. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC375033. [45] John Salvatier, Thomas V. Wiecki, and Christopher Fonnesbeck. Prob- abilistic programming in Python using PyMC3. PeerJ Comput. Sci., 2:e55, apr 2016. ISSN 2376-5992. doi: 10.7717/peerj-cs.55. URL https://peerj.com/articles/cs-55. [46] Mario Salwiczek, Yue Qu, James Gardiner, Richard A. Strugnell, Trevor Lithgow, Keith M. McLean, and Helmut Thissen. Emerging rules for effective antimicrobial coatings. Trends Biotechnol., 32(2):82–90, feb 2014. ISSN 0167-7799. doi: 10.1016/J.TIBTECH.2013.09.008. URL http://www.sciencedirect.com/science/article/pii/S0167779913002072. [47] Daniel Schwartz and Steven P Gygi. An iterative statistical approach to the identification of protein phosphorylation motifs from large-scale data sets. Nat. Biotechnol., 23(11):1391–1398, nov 2005. ISSN 1087-0156. doi: 10.1038/nbt1146. URL http://www.nature.com/articles/nbt1146. [48] D M Stalker, W R Hiatt, and L Comai. A single amino acid substitution in the enzyme 5-enolpyruvylshikimate-3-phosphate syn- thase confers resistance to the herbicide glyphosate. J. Biol. 26 Chem., 260(8):4724–4728, apr 1985. ISSN 0021-9258. URL http://www.ncbi.nlm.nih.gov/pubmed/2985565. [49] Christoph Steinbeck, Yongquan Han, Stefan Kuhn, Oliver Horlacher, Edgar Luttmann, and Egon Willighagen. The chemistry devel- opment kit (CDK): An Open-Source java library for chemo- and bioinformatics. J. Chem. Inf. Comput. Sci., 43(2):493–500, Febru- ary 2003. ISSN 0095-2338. doi: 10 . 1021 / ci025584y. URL http://dx.doi.org/10.1021/ci025584y. [50] Michael C. Sweeney, Anne-Sophie S. Wavreille, Junguk Park, Jonathan P. Butchar, Susheela Tridandapani, and Dehua Pei. Decoding protein-protein interactions through combinatorial chemistry: sequence specificity of SHP-1, SHP-2, and SHIP SH2 domains. Biochem, 44(45): 14932–14947, November 2005. ISSN 0006-2960. doi: 10.1021/bi051408h. URL http://dx.doi.org/10.1021/bi051408h. [51] Shaini Thomas, Shreyas Karnik, Ram Shankar Barai, V. K. Jayara- man, and Susan Idicula-Thomas. CAMP: a useful resource for re- search on antimicrobial peptides. Nucleic Acids Res., 38(suppl 1):D774– D780, jan 2010. ISSN 0305-1048. doi: 10.1093/nar/gkp1021. URL http://www.ncbi.nlm.nih.gov/pubmed/19923233. [52] Marc Torrent, David Andreu, Victo`ria M. Nogu´es, and Ester Boix. Connecting Peptide Physicochemical and Antimicrobial Prop- erties by a Rational Prediction Model. PLoS ONE, 6(2):e16968, February 2011. doi: 10 . 1371 / journal . pone . 0016968. URL http://dx.doi.org/10.1371/journal.pone.0016968. [53] Aki Vehtari, Andrew Gelman, and Jonah Gabry. Prac- tical Bayesian model evaluation using leave-one-out cross- validation and WAIC. Stat. Comput., 27(5):1413–1432, sep 2017. ISSN 0960-3174. doi: 10 . 1007 / s11222-016-9696-4. URL http://link.springer.com/10.1007/s11222-016-9696-4. [54] Guangshun Wang. Post-translational Modifications of Nat- ural Antimicrobial Peptides and Strategies for Peptide En- gineering. Curr. Biotechnol., 1(1):72–79, feb 2012. URL http://www.ncbi.nlm.nih.gov/pubmed/24511461. 27 [55] Guangshun Wang, Xia Li, and Zhe Wang. APD2: the updated antimi- crobial peptide database and its application in peptide design. Nucl. Acids Res., 37(suppl 1):D933–D937, January 2009. ISSN 1362-4962. doi: 10.1093/nar/gkn823. URL http://dx.doi.org/10.1093/nar/gkn823. [56] Andrew D White, Wenjun Huang, and Shaoyi Jiang. Role of Nonspe- cific Interactions in Molecular Chaperones through Model-based Bioin- formatics. Biophs. J., 103:2485–2491, 2012. [57] Andrew D White, Ann K Nowinski, Wenjun Huang, Andrew J Keefe, Fang Sun, and Shaoyi Jiang. Decoding nonspecific interactions from nature. Chem. Sci., 3:3488–3494, 2012. [58] Andrew D. White, Andrew J. Keefe, Ann K. Nowinski, Qing Shao, Kyle Caldwell, and Shaoyi Jiang. Standardizing and sim- plifying analysis of peptide library data. J. Chem. Inf. Model., 53(2):493–499, January 2013. doi: 10 . 1021 / ci300484q. URL http://dx.doi.org/10.1021/ci300484q. [59] Xuan Xiao, Pu Wang, Wei-Zhong Lin, Jian-Hua Jia, and Kuo-Chen Chou. iAMP-2L: A two-level multi-label classifier for identifying an- timicrobial peptides and their functional types. Anal. Biochem., 436(2): 168–177, may 2013. ISSN 00032697. doi: 10.1016/j.ab.2013.01.019. URL http://linkinghub.elsevier.com/retrieve/pii/S0003269713000390. [60] N Yadav, R E McDevitt, S Benard, and S C Falco. Single amino acid substitutions in the enzyme acetolactate synthase confer resis- tance to th e herbicide sulfometuron methyl. Proc. Natl. Acad. Sci. U. S. A., 83(12):4418–4422, jun 1986. ISSN 0027-8424. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC323744/. [61] G. Yu, G. Sapiro, and S. Mallat. Solving inverse problems with piecewise linear estimators: From gaussian mixture models to structured sparsity. IEEE Trans. Image Process., 21(5):2481–2499, May 2012. ISSN 1057- 7149. doi: 10.1109/TIP.2011.2176743. [62] G P Zhou and M H Deng. An extension of Chou’s graphic rules for deriving enzyme kinetic equations to systems involving parallel reaction pathways. Biochem. J., 222(1):169–176, aug 1984. ISSN 0264-6021. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1144157/. 28 Classifying Antimicrobial and Multifunctional Peptides Supporting Information † ∗,‡ ∗,† Rainier Barrett, Shaoyi Jiang, and Andrew D White †Department of Chemical Engineering, University of Rochester, 4001 Wegmans Hall, Rochester, NY 14627, USA ‡Department of Chemical Engineering, University of Washington, 440 Benjamin Hall IRB 616 NE Northlake Pl. Seattle, WA 98105 E-mail: sjiang@uw.edu; andrew.white@rochester.edu Phone: 585-276-7395 Department of Chemical Engineering University of Rochester 206 Gavett Hall Rochester, NY 14627, USA Email: andrew.white@rochester.edu Department of Chemical Engineering University of Washington Box 351750 Seattle, WA 98195, USA To whom correspondence should be addressed arXiv:1804.06327v1 [stat.AP] 17 Apr 2018 Converting Descriptors into Ranks In order to use a structural descriptor in a Bayesian network model, it must be converted into a form that may be described by a probability distribution. The probability that a descriptor f(·) equals a value x in a compound c is given by: X X Pr(f(c) = x) = w δ(f(c) − x), Z = w (1) i i i i where Z is the partition coefficient, w is the un-normalized probability (weights) of observing the ith compound in the chemical space, and δ is an indicator or delta-function. The weighting param- eters may be adjusted, for example, to account for synthetic difficulty, recognizing the fact that the experimentally active compounds are likely chosen with bias. In the case of peptide libraries, the weights are unity because peptide libraries have little to no synthetic bias. The partition coefficient for a peptide library is A , where A is the size of the alphabet (generally 20 for amino acids) and l is the length of the amino acid sequence. Constructing probability distributions for group-wise additive descriptors follows two approaches. When the chemical space is small (l ≤ 3 for peptides), all descriptor values may be enumerated to create a probability distribution. When the chemical space is large (l > 3 for peptides), the prob- ability distribution may be approximated as a sum of l normal distributions. In the case of peptide libraries, l is the length and the normal distributions are identical. In the case of combinatorial organic libraries, l is the number of positions that may be exchanged and the normal distributions may not be identical between positions. The approximation is accurate provided the number of 0’s is low (e.g., the number of sulfur atoms in the peptides will not fit into this approximation). Examples of this approximation may be seen in Figures S2-S4. The mean of the normal distri- butions is the mean (μ) of the descriptor calculated on the combinatorial components (e.g, amino acids) and the variance (σ ) is calculated likewise. The sum of the l normal distributions will have a mean of lμ and variance lσ . For non-group-wise distributions, the probability distribution may be estimated by sampling from the combinatorial library where the sampling is done according to 2 the weights w . Once the probability distribution over the chemical space of the library is calculated, descriptors for the active/training compounds are transformed to incorporate information about this probability distribution. This is done by converting the descriptor into a rank between 0 and 100, where the rank of the descriptor relative to the chemical space. The ranks come from quantiling the descriptors calculated over the chemical space. For example, quantiling the number of charged groups over a chemical space with 4 quantiles could yield that the bottom 25% are between 0–5 charge groups, the 25–50% are between 5–6 charge groups, 50–75% are between 6–7 and the top 25% are 7–15. Using this distribution, an active compound with 3 charged groups would be given a rank of 1, because it is in the first quantile. An active compound with 7 charge groups has a rank of 3 and an active compound with 13 charges would also has a rank of 3 . Notice how the unevenness of the original distribution is removed and the ranks correlate to the ranking over the chemical space. The entire transformation process is depicted in Figure S1. This descriptor transformation has three benefits. First, it is immediately obvious if a descrip- tor is at an extreme value. Second, when examining multiple descriptors, their range corresponds exactly to their span of the entire chemical space. Thus, if a descriptor range is 5–95, it is not sig- nificant. If it is within the range of 20–25, then the descriptors occupy a range that only 5% of the chemical space of the library occupies. Third, the effect of length on the peptide descriptors may be removed by only comparing descriptors against uniform length probability distributions. For example, if there are sequences from lengths 3–10 in a library, the descriptors may be calculated relative only to sequences of the same length. Then a rank of 5 is interpreted as in the bottom 5% relative to sequences of the same length. If this is not desired, only the probability distribution on the longest 2 lengths need to be calculated since that corresponds to 99.75% (1 − 1/20 ) of the possible values. 3 f(c) Group Additive Yes No length < 3 Yes No Normal Enumerate Sample Approximation Data P{ f(c)=x } 0-100 Score Figure S1 A flowchart for converting a descriptor, f(c), into a rank from 0–100 that both removes biases from the chemical space of the library and normalizes it for use in Bayesian network models. Histogram Approximation 0 100 200 300 400 MW Figure S2 The dataset is all combinations of the 20 amino acids with length 3. The approximation is the sum of three identical normal distributions parameterized to the molecular weight of the 20 amino acids. The approximation works well, even at this low length. Histogram Approximation 0 1 2 3 nAromaticGroups Figure S3 The dataset is all combinations of the 20 amino acids with length 3. The approximation is the sum of three identical normal distributions parameterized to an aromatic indicator function on the 20 amino acids (1 for aromatic, 0 for non-aromatic). The approximation doesn’t works well, due to the high number of zero values, as mentioned in the text. Probability Probability 0.0 0.3 0.6 0.000 0.004 Histogram Approximation -3 -2 -1 0 1 2 3 netCharge Figure S4 The dataset is all combinations of the 20 amino acids with length 3. The approximation is the sum of three identical normal distributions parameterized to the charge of the 20 amino acids. The approximation works well, even at this low length. Model specifications The grphical models are specified in the GitHub repository found at https://github.com/RainierBarrett/pymc3_qspr. This software is made freely available under the GNU General Public License. Motif Model Verification Figures S5-S8 are the plots of the motif class distribution trained on the artificial dataset with the imposed motif “ARND”. As described in the main text, the dataset of peptides with imposed motifs was artificially constructed by adding uniformly-distributed amino acids before and/or after the imposed motif randomly. Notice the sparseness – the model clearly distinguishes which amino acid is most likely to be in which motif position, with no knowledge of what the motif will be or where it will occur. The model accurately captures the motifs in this simple case. These figures were produced by fitting the motif model with 1000 training steps with the same training method as described in the main text. Figure S9 shows a comparison between two equally-accurate kernel numbers for the QSPR model. The receiver operating characteristc (ROC) curves and predicted rank distribution for the number of non-polar groups descriptor generated by the model is shown for each case. The Ma- terials and Methods section of the main text details how the ROC curves are generated. We can see that while the ROC curves (and thus, performance) are nearly identical, the three-kernel gen- Probability 0.0 0.2 0.4 erated distribution is more representative of the qualitative features of the true histogram of the rankings found from the APD dataset. Figure S10 displays the ROC curve of the background-only motif model. Note its similarity to the best-case motif model (Figure 6b, main text). Figure S11 shows the ROC curve generated by the QSPR model on the human dataset. Note the dissimilarity between this ROC curve and that of the QSPR model trained on the APD (Figure S9a). Figure S5 Predicted probabilities for amino acid ocurrence in the first position in a single motif class model. 6 Figure S6 Predicted probabilities for amino acid ocurrence in the second position in a single motif class model. Figure S7 Predicted probabilities for amino acid ocurrence in the third position in a single motif class model. 7 Figure S8 Predicted probabilities for amino acid ocurrence in the fourth position in a single motif class model. Figure S10 The ROC curve generated by the motif model with k = 0, w = 0. TPR is true positive rate and FPR is false positive rate for predictions on withheld testing data. The high degree of accuracy with no motif identification may indicate that motifs are not important for antimicrobial activity, or that the model is not equipped to capture the nature of their importance. 8 (a) The ROC Curve for the 3-kernel Gaussian (b) The ROC Curve for the 6-kernel Gaussian mixture QSPR model. mixture QSPR model. (c) The fitted distribution for number of non- (d) The fitted distribution for number of non- polar groups score in the 3-kernel QSPR model polar groups score in the 6-kernel QSPR model compared with the histogram of the raw data. compared with the histogram of the raw data. Figure S9 A comparison of one descriptor’s prediciton histogram with two different kernel num- bers in the QSPR model, and the corresponding ROC curves. TPR is true positive rate and FPR is false positive rate for predictions on withheld testing data. Though both of these kernel num- bers produced similar performance (note ROC curve similarity), the 3-kernel distribution is more indicative of the underlying properties. 9 Figure S11 ROC curve of the 8-kernel QSPR model trained on the human dataset. FPR is false positive rate, and TPR is true positive rate for predictions made by this model on withheld testing data from the human dataset. 10 References [1] Hyndman, R. J.; Fan, Y. Am. Stat. 1996, 50, 361–365.

Journal

StatisticsarXiv (Cornell University)

Published: Apr 17, 2018

There are no references for this article.