Access the full text.
Sign up today, get DeepDyve free for 14 days.
The prospect of identifying contacts in protein structures purely from aligned protein sequences has lured researchers for a long time, but progress has been modest until recently. Here, we reviewed the most successful methods for identifying structural contacts from sequence and how these methods differ and made an initial assessment of the overlap of predicted contacts by alternative approaches. We then discussed the limitations of these methods and possibilities for future development and highlighted the recent applications of contacts in tertiary structure prediction, identifying the residues at the interfaces of protein-protein interactions, and the use of these methods in disentangling alternative conformational states. Finally, we identified the current challenges in the field of contact prediction, concentrating on the limitations imposed by available data, dependencies on the sequence alignments, and possible future developments. Keywords: contact prediction; correlated mutation analysis; protein structure prediction. DOI 10.1515/bams-2014-0013 Received August 11, 2014; accepted October 15, 2014 Introduction The function and interactions of a protein are fundamentally tied to a unique three-dimensional (3D) structure encoded by its amino acid sequence . During the folding *Corresponding author: David T. Jones, Department of Computer Science, University College London, London WC1E 6BT, UK, E-mail: email@example.com Stuart Tetchner: Department of Computer Science, University College London, London, UK Tomasz Kosciolek: Department of Computer Science, University College London, London, UK. http://orcid.org/0000-0002-9915-7387 process, nonadjacent pairs of residues come in close proximity and form contacts that stabilize and maintain the protein fold . Depending on the number of intervening residues between interacting positions, contacts are generally categorized into short range (between 4 and 8), medium range (between 9 and 23), and long range ( 3). >2 Short-range contacts provide information relating to local structure (e.g., secondary structure features), whereas medium- and long-range contacts relate to the fold of a protein. The overall structure of a protein is more conserved than the underlying sequence, with many diverse homologous proteins encoding the same global fold [3, 4]. Nonsynonymous mutations will alter the local physicochemical environment surrounding the residue. In some cases, these alterations may be substantial and may disrupt the typical protein structure . More commonly, the overall effect on the structure will be minimal, permitting the large diversity of sequences that adopt the same fold. To maintain the overall structure, mutations can be accommodated by the concerted mutation of other residues in the local structure . These so-called "correlated mutations" etch a record relating to the local interresidue interactions of a protein within the sequence. By comparing homologous sequences in a multiple-sequence alignment, it is possible to identify pairs of covarying amino acids. With the knowledge that these pairings can be the result of structural accommodation, correlated mutations can provide a bridge between sequence positions and residues in close proximity within the folded structure . In theory at least, if a sufficient number of long-range contacts can be correctly inferred, it should be possible to reconstruct the native 3D structure of the protein, albeit an average structure for the family as a whole. Extracting correlated mutations indicative of structural contacts is complicated by two predominant sources of noise . Homologous protein sequences, by definition, arise from common ancestors. Due to the relatedness of homologous sequences, there is a background level of correlation 244Tetchner et al.: Coevolution-derived contacts for protein structure prediction among all pairs of sites, in addition to random noise . This first effect has been termed "phylogenetic bias". Phylogenetic bias can cause spurious observations of correlated mutations due to the underlying uneven distribution of aligned homologues (Figure 1). The methods attempting to reduce the effect of phylogeny and the background interdependence among sites have been developed, improving the ability to identify directly interacting residues . The second issue, "correlation chaining", has proven to be more difficult to reduce. The mutations of pairs of residues may display apparent correlation without being in close proximity due to the transitive nature of correlation. In the simplest case with three residues, A, B, and C, if residues A and B are interacting and therefore coevolving (a "direct interaction") and B and C are interacting and coevolving (direct interactions), then there will be an apparent (although not necessarily genuine) correlation between A and C (an "indirect interaction") regardless of whether these residues interact (Figure 2). This additional correlation appears purely as a result of the two real interactions sharing residue B. It might be thought that these indirect effects can be eliminated by simply filtering out low-strength couplings, but this is insufficient, as indirect couplings may exhibit stronger covariation than those that are direct . It is only within the last 5 years that there has been significant progress in tackling the latter issue of indirect chaining effects, and as a result, there has been a huge resurgence of interest in the development of interresidue contact-prediction methods. The excitement in the field comes not just from the algorithmic breakthroughs but also from the more prosaic effects of having an abundance of protein sequence data to analyze. With the level of revived interest in predicting residue contacts from sequence alignments, recent reviews have outlined the underlying link between selective pressure and sequence coevolution along with some of the different methods that have been developed to tackle the effects of phylogeny and covariation chaining [20, 21]. However, in all the excitement, there has been relatively little attention given to the obvious limitations of these new methods and where research focus should be directed next. Here, we discussed the different approaches taken by the most successful methods to date and their inherent limitations. We also discussed the current uses of contact predictions de novo predictions of globular and transmembrane protein structures, predicting interactions and structures of protein complexes and determining alternative conformational states of allosteric proteins. Finally, we put Figure 1Influence of underlying phylogeny on the number of correlated mutations in an alignment. (A) Correlated mutation analysis attempts to infer structural pairings (e.g., orange and blue, purple and green) from patterns of coevolving positions in multiple-sequence alignments. (B) The simplest model assumes that all sequences have independently diverged from a single common ancestor over a long evolutionary period, detecting two correlated mutations. Depending on the true underlying phylogeny, the number of correlated mutations required to produce the observed sequences may be 1 (in case C) or 2 (in case D). Gray stars represent correlated mutations. Figure 2Effect of covariance chaining. (A) A toy example of a protein structure, with two residues (green and pink) coevolving with a third one (blue) to maintain a property (e.g., side-chain packing) of the protein. Observing these coevolving residues in a multiple-sequence alignment, one might infer that all three pairs are in contact. (B) In light of the atomic coordinates of the three residues, it is clear that there is no physical interaction between the pink and green amino acids. Tetchner et al.: Coevolution-derived contacts for protein structure prediction245 recent developments into context with a discussion of possible future developments. Summary of current top-performing contact prediction methods The concept of using covarying sequence positions to identify structural contacts has been proposed since the early 1990s [10, 12, 2225]. Although Benner and Gerloff's work on modeling the catalytic domain of protein kinases is perhaps better known for triggering a man-versusmachine debate on how best to predict protein secondary structure, the frequent mentions in the text of using observed covariation of sites in the alignment to constrain the predicted fold are really quite prescient. However, all these early attempts to extract structural contacts from sequence were plagued with low accuracies because of limited sequence data at the time and because those methods were unable to effectively address phylogenetic and chained correlation biases. Most early covariation methods were based on mutual information (MI) (e.g., ), a quality coming from information theory. MI provides a measure for the interdependence of two variables (e.g., positions in an alignment), calculated by comparing the frequency of a pair of residues occurring together with the distributions of each separate amino acid, allowing MI to identify putative residue-residue contacts. However, although the concept behind MI is attractively simple, this approach considers every pair of positions in isolation and therefore suffers heavily from both phylogenetic and indirect interaction effects, yielding only moderate contact prediction accuracy. The major breakthrough in the accuracy of contact prediction methods was achieved through the use of global statistical methods, that is, methods that attempt to build a model of the whole protein sequence alignment rather than analyzing pairs of sites separately. Lapedes et al. provided a theoretical framework for disentangling indirect from direct couplings using a maximum entropy approach . However, the proposed calculation required considerable computations; subsequently, the approach was not widely adopted. This work also suffered from simply being ahead of its time, because it predated the massive growth of sequence data banks that occurred in the latter half of the 2000s. The direct coupling analysis (DCA) method was the first demonstrably successful implementation of a global method building on the theoretical, computationally intractable work of Lapedes et al., established 10 years earlier [14, 26]. DCA is based on the observation that protein sequences in multiple-sequence alignments can be represented as a 21-state Potts model (20 states representing each of the standard proteinogenic amino acids and 1 state relating to a gap in the alignment), which can disentangle direct from indirect couplings, a drawback that simple two-element MI and other local statistical techniques were suffering from. Each Potts model represents the probability of observing an amino acid sequence: N 1 P( A ) = exp hi ( Ai ) + J ij ( Ai , Aj ) Z i=1 1i j N (1) where A A1, A2, ..., AN) is an amino acid sequence of =( length N, Z is a normalizing constant (known as the partition function), hi(Ai) is a single-body term, and Jij(Ai, Aj) is a two-body term. The two-body terms in the context of protein sequence data can be interpreted as residueresidue interactions (or couplings), whereas single-body terms [hi(Ai)] account for the conservation of individual residues. Computing couplings from Potts models require an inference method and an approximation of the normalizing constant (Z), because the partition function Z is not directly computable. In the original DCA approach, the model was inferred using a maximum-entropy principle (i.e., ensuring that the distribution of probabilities is distributed maximally evenly under empirically observed conditions), and Z was approximated using an iterative message passing approach under the Bethe-Peierls approximation (Table 1). This approach had some significant drawbacks, that is, dependency on many auxiliary variables and high computational cost [O(212N4)], which required the authors to preprocess the data using MI to limit their computations to 60 positions . Sometime later, the DCA idea was rendered practical for longer alignments by the use of mean-field approximation (in mfDCA and EVfold [27, 28]; see Table 1). In mfDCA/ EVfold, as in the original DCA approach, a maximum entropy principle is also applied to the Potts model. Then, using a mean-field approximation, the distribution probability values are calculated and direct information (DI; in analogy to MI) is computed. DI is expressed as DIij = Pij( dir ) ( Ai , Aj ) ln Ai ,Aj Pij( dir ) ( Ai , Aj ) fi ( Ai ) f j ( Aj ) (2) where Pij( dir ) are predicted direct interaction pair probabilities and fi, fj are frequency counts of amino acids Ai, Aj at positions i, j in a multiple-sequence alignment. The purpose of introducing DI was to estimate the coupling strength of residue pairs and help rank all possible pairs of residues in terms of their probability of being observed. 246Tetchner et al.: Coevolution-derived contacts for protein structure prediction Table 1Outline of the different methods used to distinguish between direct and indirect couplings. Method DCA Model Approximation Sequence weighting Inverse neighborhood density (effective sequences) As in DCA at 80% identity Regularization MI pre-filtering Scoring Direct Information (DI) Ref.  Inverse Potts model: Iterative Hamiltonian maximum entropy gradient descent/BethePeierls approximation EV couplings/ Inverse Potts model: Mean-field mfDCA maximum entropy plmDCA Inverse Potts model: Pseudo-likelihood maximum likelihood maximization Inverse Potts model: Pseudo-likelihood using maximum likelihood conditional correlations Sparse inverse covariance estimation Regularized Least Squares inverse covariance Multi-dimensional Mutual Information Continuous maximum entropy inverse Potts model Multivariate Gaussian GREMLIN PSICOV RLS None Effective sequences at 90% identity Effective sequences at 90% identity BLOSUM-style weighting at 62% identity k-mer-based similarity kernel None Implicit: real counts and pseudo-counts in correlation matrix (mfDCA) L2-norm DI [27, 28] Frobenius norm adjusted by APC Frobenius norm adjusted by APC L1-norm adjusted by APC L1-norm adjusted by APC [29 31]  L2-norm+Gaussian priors L1-norm  Effective sequence regularization None  mdMI None gaussDCA Multivariate Gaussian with Bayesian priors MI adjusted by APC and ZRES/ ZPX As in PSICOV with Normal-inverse Wishart Frobenius norm preprocessing to prior and DI adjusted estimate similarity by APC threshold   At the same time, a parallel approach, precise structural contact prediction using sparse inverse covariance (PSICOV), was developed . Notably, PSICOV approaches the problem from a multivariate statistical viewpoint rather than a statistical physics one. Instead of trying to compute the direct couplings in a Potts model, PSICOV simply builds upon the well-established fact that, for a system of Gaussian variables, the inverse of the covariance matrix (i.e., the precision or concentration matrix) contains information about the degree of independence between the variables considered as a complete set. In other words, it derives partial correlations for each pair of variables rather than the hybrid of direct and indirect correlations present in the observed covariance matrix. Indeed, these techniques were well founded in the gene network reconstruction field, where similar problems of inferring direct versus indirect effects occur. Rather than trying in vain to directly invert the empirical covariance matrix, PSICOV approximates the calculation of the covariance matrix inverse through the use of the graphical least absolute shrinkage and selection operator (LASSO) technique under the assumption of a multivariate Gaussian distribution [13, 36, 37]. As single domains contain very few contacts in comparison with the number of all possible contacts, this sparsity can be exploited for both computational efficiency and prediction accuracy by imposing a sparsity constraint on calculations. The requirement for solutions to be sparse is an important characteristic of the PSICOV approach and distinguishes it significantly from the mean-field approaches (e.g., mfDCA/EVfold). The PSICOV technique applies a L1-regularization parameter to the inferred precision matrix, which is iteratively computed by minimizing the objective function: i , j =1 ij ij -log det + | ij | i , j =1 (3) where Sij is an empirical covariance matrix and ij is the inverse covariance matrix. The first two terms in Equation (3) can be interpreted as the negative log-likelihood of the inverse covariance matrix under the assumption that the data distribution is a multivariate Gaussian, and the latter term (the L1-regularization term), which constrains the solution to be sparse. Constraining the solution to be sparse not only speeds up the calculation by allowing null rows to be skipped in Tetchner et al.: Coevolution-derived contacts for protein structure prediction247 the graphical LASSO procedure but also constrains the statistical model to be as simple as possible. In terms of compressed sensing, which is the general field in which all these modern covariation algorithms fit, by building the simplest model that effectively reconstructs the observed signal (the observed interresidue correlations), we greatly reduce the effects of overfitting to the noise. In terms of performance, we would expect this to result in PSICOV, making a higher proportion of correct predictions near the top of the ranked list of contacts compared to non-sparsity-constrained methods; indeed, this is broadly what is seen in comparative benchmarks. To predict residue contacts, PSICOV combines the individual amino acid terms for each pair of positions into a single norm and then attempts to reduce the effects of phylogenetic bias by applying average product correction (APC) , which was originally proposed for improving MI calculations. An alternative method for exploiting the covariance matrix inversion, broadly similar to PSICOV, was proposed by the Smale group through the use of a computationally efficient regularized least squares (RLS) approach along with a refinement to the sequence weighting scheme . The most recent development in this area has been the use of pseudo-likelihood maximization (plm), which had already been shown to be a better approach to inferring Potts models than mean-field approaches. Two groups have published plm-based approaches: plmDCA/gplmDCA and GREMLIN . plmDCA/gplmDCA uses plm to infer couplings and single-body terms (in gplmDCA, separate gap energies are also introduced to the Potts model) and then uses the Frobenius norm instead of the previously popular DI [as used in DCA and mfDCA/EVfold; Equation (2)]. GREMLIN, on the contrary, uses conditional correlations to cancel out the global partition function and makes the local partition functions (Zi) directly computable. This formulation allows GREMLIN to easily make use of priors to account for structural features predicted from other sources (e.g., secondary structure); however, the results in the paper seem to show that these priors have a negligible impact on overall prediction accuracy . Interestingly, MI has also seen a recent revival with the development of its global statistical version multidimensional MI (mdMI ). The idea here is simply to correct each observed MI value by the average MI computed for all remaining pairs or triplets of sites. Indeed, the approach can be generalized to take into account all higher-order terms (creating variants such as 3D_MI and 4D_MI) but only at the expense of greatly increased computation time. This can be broadly thought of as analogous to the direct computation of partial correlations, though without convenient shortcuts such as covariance matrix inversion. Nevertheless, even 3D_MI seems to perform on par with the latest DCA methods, but higherorder calculations suffer badly from unfavorable scaling with the dimensions of the MSA. Some hybrid approaches have also been proposed, such as gaussDCA , which is similar to PSICOV in the respect that it assumes an underlying multivariate Gaussian distribution while still trying to approximate a maximum entropy continuous Potts model. It is not clear what benefit such a hybrid approach offers, however, as it arguably loses the simplicity of the PSICOV approach and still only provides results similar to the ones achieved by PSICOV itself. The key features of all the different methods are summarized in Table 1. Do different methods predict different contacts or the same ones? With multiple available methods capable of predicting contacts, it is interesting to see whether the different underlying approximations and assumptions produce different predicted contacts. It is evident that contact prediction accuracy is affected by the number and diversity of aligned sequences, although the details of this relationship are unknown and still under investigation. Therefore, we compared three methods based on disparate principles (PSICOV, mfDCA/EVfold, and plmDCA) using alignment files generated from a previous study  for 141 Pfam domains with a highly resolved X-ray structure. Assessing each method individually, plmDCA is the most precise overall and predicts the greatest number of true contacts in both cases (Figure 3). However, looking at the predictions from each method, 20%30% of predicted contacts are apparently unique to each method. Of the remaining predictions, 40%50% of predicted contacts predicted are common to all three. For all cases, precision at the intersection of all methods is greater than the precision of any single method or pair of methods (equally precise in one case). These observations hold regardless of the number of contacts assessed (e.g., top-L/5 and topL/2). As with many other prediction problems, this complementarity between methods points toward there being a great deal of benefit from constructing a metapredictor that combines prediction methods applied to the same alignment. The PconsC method  is one of the first such metapredictors, though currently based on just two combined methods (PSICOV and plmDCA). Although the complementarity between methods can clearly be useful in a metapredictor, it does nonetheless 248Tetchner et al.: Coevolution-derived contacts for protein structure prediction Applications of coevolution methods in protein structure prediction Intrachain contacts Coevolving positions relate to local residue pairings; therefore, an obvious application is to use contacts for tertiary structure prediction. Thus far, two alternative approaches have been presented. The first group of methods rely entirely on contactderived distance constraints and refinement using simulated annealing aided by secondary structure predictions [27, 40, 41]. In this case, the contacts serve as restraints on the topology. This is a very purist approach to protein structure prediction in that 3D information is derived directly from the aligned sequences, with no need for a knowledge base of experimentally determined protein structures. As a demonstration of the power of covariation in 3D modeling, this is clearly ideal. However, in real-world applications, sticking with such limitations is probably not advisable. Impressive as it is to be able to reconstruct a protein's native 3D structure from analysis of sequence alignments, these folding algorithms still make use of knowledge-based approaches to predict secondary structure, for example. If we are happy to use some knowledge-based information to generate 3D structures from covariation analysis, then why not use even more knowledge-based information to improve on results. The second approach, therefore, is to use predicted contacts merely as pseudo-energy terms to supplement existing knowledge-based potentials . In this case, predicted contacts from PSICOV were combined with the knowledge-based potentials employed in the FRAGFOLD fragment-assembly program. By appropriately weighting the contributions of both predicted contacts from PSICOV and the empirical potentials, gaps in the predicted contact map can essentially be filled. Such hybrid approaches will clearly be the only way of handling larger protein domains where none of the covariation approaches are able to infer enough long-range contacts on their own. Even where contacts have been accurately predicted, there is still no guarantee that a viable 3D protein structure can be calculated, as conformational sampling problems may prevent the native fold from being found. Fragment-assembly methods, such as FRAGFOLD , are less susceptible to incorrect predicted contacts, as they supplement contact prediction inaccuracies with knowledge-based potentials to find a correct structure. Figure 3Distribution of 2610 all-distance [all, sequence separation residues (A)] and 2420 long-range [LR, sequence separa>4 tion 3 residues (B)] correct contacts as predicted by three distinct >2 methods (PSICOV, mfDCA/EVfold, and plmDCA) on a set of 141 Pfam domains. Overall precision values across all targets are given below the name of each method. Inside the diagram, a precision value is given for the respective set and the number of correctly predicted contacts in parentheses. EVfold refers to the original EVfold method published by Marks et al. , as implemented by FreeContact . L relates to the number of amino acids in a query sequence. The depth of alignments used for these analyses ranged from 1,164 to 65,535 sequences. raise some interesting questions about how such differences in predicted contacts arise. Although the technical aspects of the three methods differ, the overarching task being tackled in each case is very similar, that is, the breaking of correlation chains that interfere with the accuracy of true contact prediction. If that is all each method is doing, then it might be expected that the predictions would be a lot more similar than they are. Perhaps these methods are doing something more than simply breaking correlation chains? One possible explanation is that these methods are simply acting as generic compressed sensing algorithms. In compressed sensing, an observed signal (in this case, the observed frequencies of amino acids and amino acid pairs in a protein family) is compressed into a minimal statistical model that can reconstruct the original signal as accurately as possible. Looking at the three different coevolution approaches in the context of compressed sensing, then it is quite reasonable that they should generate different predictions as the underlying statistical models are mathematically distinct. These observations suggest that more work might need to be done to truly understand exactly how these new methods are achieving the excellent results that they are. Probably, the explanation will be partially based on the idea of breaking correlation chains, but there is perhaps a more complex underlying story. Once we properly understand the behaviors of these methods, we may be able to design even better algorithms. Tetchner et al.: Coevolution-derived contacts for protein structure prediction249 However, fragment-assembly approaches to simulating protein folding struggle with complex protein folds, especially -rich protein folds with high contact order (e.g., -sandwich/immunoglobulin-like folds that incorporate long crossover connections between separate sheets). In these cases, fragments inserted into one part of the chain can completely disrupt the hydrogen bonding of the other very distant part of the structure. As a result of this, the conformational space of a complex fold may not be sampled sufficiently, or the search process may simply end up locked in a local minimum. An alternative to fragment assembly is to use distance geometry to generate an approximate starting structure coupled with molecular dynamics to sample the conformational space around the starting model. The EVfold methods, for example, uses the Crystallography and NMR System modeling software [44, 45] to generate models. Although, in theory, better able to deal with high contact order complex folds, distance geometry methods, however, suffer badly in situations where the distance restraints are sparse and noisy and also are limited in the lengths of protein chain that can be handled successfully. To try to get a handle on these issues, several studies have been done on the ideal case of using true rather than predicted contacts, where some of the requirements necessary for accurately modeling small proteins have been determined [46, 47]. Although large numbers of contacts obviously constrain greater portions of the protein chain, even surprisingly sparse sets of contacts can be sufficient for determining the overall protein fold. Kim et al.  demonstrated that a knowledge of just 8% of all native contacts was sufficient to retrieve the correct fold, provided they are broadly distributed along the chain. However, in a more realistic case where predicted contacts are less uniformly distributed along the chain, a greater number of contacts is required for accurate modeling . The authors also demonstrated that the ability to correctly fold a protein is highly sensitive to noise introduced by incorrect contacts . For predicting the structures of large and complex protein folds, therefore, even the latest covariation methods may not provide sufficiently accurate restraints. However, contact prediction is not just limited to de novo protein modeling. Limited numbers of (true) contacts allow refining initial models generated by templatebased modeling . Correctly identified contacts also have the potential to aid in model quality assessment when the decoys have been generated starting from distantly related templates . Covariation techniques can also provide information about other aspects of structure prediction. Using contacts calculated from PSICOV, methods have also been developed to predict protein topology , -sheet topology , and even domain boundaries . In summary, the ability to recall even a small number of correct native contacts greatly reduces the number of possible conformations that the native protein chain might adopt. All these methods, however, highlight the importance for high accuracy in predicted contacts. It is clearly much more important to predict a small subset of contacts at high accuracy than to predict a much larger number of contacts at lower accuracy, although, in the latter case, the absolute number of correctly predicted contacts is higher. As is often the case, quality ultimately wins out over quantity. Protein-protein interactions Proteins perform their functions through the interaction with partners via the formation of protein complexes. However, experimentally determining the structure of protein complexes is more challenging than it is for monomeric proteins; subsequently, complexes are underrepresented in the Protein Data Bank . If the structure of the two interacting proteins is known, protein docking can be performed . Protein docking samples multiple conformations (or poses) in an attempt to determine the one most likely to represent the structure of the complex through the use of various energy terms. Coevolving residue pairs are also present at the interfaces of interacting proteins, similar to those found within single proteins, to maintain a favorable binding interface for proper binding [8, 57, 58]. Identification of interacting residues across the protein interfaces can be used to limit the search space in docking simulations . Indeed, one study demonstrates that just four correct interprotein contacts can be sufficient to correctly dock unbound proteins with a success rate above 80% . As it happens, the problem of predicting protein-protein interactions was the subject of the original DCA paper . In this case, DCA predicted coevolving residues at the interfaces of the histidine kinase-response regulator (HK-RR) complexes found in the bacterial two-component regulatory system. These proteins were selected due to the high number of homologous variants found within the bacterial two-component regulatory system , providing a sufficiently large alignment for use with the method (with more than 2500 SK-RR sequence pairs found in approximately 300 bacterial genomes). Using coevolutionbased contacts to predict the effect of mutagenesis studies on the binding of two proteins could also potentially help guide the design of novel protein interactions . 250Tetchner et al.: Coevolution-derived contacts for protein structure prediction More recently, groups have focused on predicting the interaction of proteins that are not as ubiquitous as the HK-RR interaction [57, 62]. By pairing and concatenating interacting sequences in a "paired" alignment and treating the two interacting protein chains as an extended single sequence, covariation methods are capable of identifying coevolving residue pairs between two proteins. To create these paired alignments, the structure of the operon has been exploited to avoid the incorporation of paralogues. However, the reliance on the operon structure to identify cognate pairs greatly limits the scope of these techniques, as only a small percentage of interacting proteins are organized in this manner . Future development of novel approaches to efficiently pair interacting proteins without relying on known operon structure or proxy information such as chromosomal adjacency will greatly increase the scope of these analyses. significant developments have only arisen in the past 5 or so years, so the field is relatively young. Unsurprisingly, therefore, current methods still have a number of limitations that will require further exploration and subsequent technological development if they are to become a staple tool in protein modelling. Limitations of available data All current covariation methods rely on having large and diverse alignments to reliably identify the covariation events indicative of residues in close proximity. The methods display a strong correlation between the precision of predicted contacts and the number of sequences available in the alignment [13, 32]. As a rule of thumb, most methods recommend a minimum of 1000 nonredundant sequences in a query alignment to achieve reasonable results. However, although it is clear that the latest generation of methods can accurately produce highquality contacts from large alignments (from the original papers as well as Figure 3), it is not clear what the true lower bound for these methods is. Determining the lowest practical limit for the use of these methods is important, as many of the largest protein families typically have a structure available for template-based modeling [21, 32], and there is clearly more interest in applying covariation analysis to small- and intermediate-sized families. Proteins from higher organisms are a particular problem in terms of available data, as they suffer from there being far fewer sequenced genomes from different species than for bacterial proteins. The majority of successes in coevolution-based protein structure prediction have come from proteins that share common ancestry with bacterial proteins, and the comparative lack of available eukaryotic sequences limits the ability to apply covariation methods to families within higher organisms. Although some recent covariation methods have claimed improved performance on smaller families, the evidence is far from incontrovertible. The main problem here is that these claims are often based on average precision scores calculated over a large benchmark set. In making use of these methods, we are really only interested in predictions that achieve very high precision. For example, let us suppose two methods have been applied to a small family of sequences, and it is found that Method A has a correct contact precision of 0.2 (i.e., 20% of predicted contacts agree with the native structure) and Method B has a precision of 0.4. Obviously, Method B is statistically superior to Method A in this case, but the true conclusion is that both methods are effectively useless, because both Multiple conformational states Flexible regions of protein structures sometimes exhibit clusters of highly coevolving residues that maintain the ability of the protein to undergo functional conformational changes . In structures that have two distinct biologically relevant conformations (e.g., the "open" and "closed" forms of enzymes and binding proteins), predicted contacts can contain signals relating to these states [40, 65, 66]. Provided the structure of one conformer is known, the nonmatching predicted contacts may be able to provide insight into the functional movement of the protein chain, guiding the modeling of its alternate conformation. However, without prior knowledge of the mobility and structure of a protein, this complicates de novo modeling. Where nothing is known about multiple conformers, any contact maps that are generated can essentially comprise overlapping contact information from the unknown multiple conformers. Using the contact data to generate a single model for the protein can therefore fail because the contact data arises from multiple incompatible structures. How to recognize this situation and interpret the contact data is clearly an important subject for future research. Current challenges in contact predictions There has clearly been a great deal of progress made in deducing coevolutionary signals from aligned families of protein sequences. Nevertheless, some of the most Tetchner et al.: Coevolution-derived contacts for protein structure prediction251 precisions are too low to be useful in 3D modelling. What we really need is a method that is capable of calculating contact maps from 100 homologous sequences that are as accurate as those computed from 1000 sequences. Currently, we are a long way from achieving this, and it may indeed be an impossible task due to the systematic biases that arise from missing data . One avenue for improvement that could expand the applicability of methods to smaller families is to incorporate more sophisticated phylogenetic methods in both sequence weighting and sampling. At present, every method employs some kind of sequence weighting scheme, but generally these are only simplistic methods such as BLOSUM weighting . To extend the reach of coevolution to smaller families, it is going to be necessary to build an explicit evolutionary model of the family being studied to better determine what the true interrelationships are between the selection pressures at each pair of sites. Of course, one solution to the data problem is simply to wait until a particular target sequence family has grown to a suitable size through the normal growth of sequence data bank size. However, the fact that a protein family is currently too small probably indicates that the family occupies a rare niche or has recently evolved. Therefore, waiting around until the right genomes happen to be sequenced could take a very long time indeed. Possibly, as the per-base cost of sequencing becomes even lower, it will be possible to rapidly target particular genomes with a specific aim of expanding sequence families, but there still remains the difficulty of identifying and isolating the appropriate organisms to allow their genomes to be sequenced. However, with the ability to detect very divergent sequences, it is difficult to balance between stringent selection criteria that further reduce the limited sequences available and being more lenient, at the risk of adding noise, in the form of nonhomologous sequences (e.g., by homologous overextension ). Typically, the detection of suitable homologous sequences is performed using simple coverage and E-value criteria. Refining the protocols for sequence selection and subsequent alignment generation is likely to hugely benefit contact prediction in the future. Recently, the problem of refining the input alignments has been tackled with a belt-and-braces machine learning approach . Rather than trying to build one single accurate alignment, these authors used eight different alignments (e.g., from different HHblits and jackHMMer E-value thresholds) for a single protein sequence and combined these with the predictions from PSICOV and plmDCA by means of a random forest classifier. Although apparently effective, we can hope that further developments in alignment methodology will render such strategies obsolete. Conclusions There has been recent progress in predicting contacts from protein sequence data alone, stemming from the DCA paper in 2009 , which reawakened the field to the ideas first put forward by Lapedes et al. in 1999 . The original DCA method was practically limited to analyzing couplings within 60 preselected sequence positions, with a run-time of 4 days. With simplified statistical models, better algorithms, and use of parallel computing hardware, the latest generation of tools can calculate more precise couplings for proteins containing many hundreds of residues in just minutes [13, 35, 38] or even seconds . With the latest advances in accuracy and calculation speed of the newest generation of methods, limiting factors for producing contacts in this way have shifted toward having sufficiently large, high-quality sequence alignments. With more available sequences, additional protein families will be candidates for analysis using these methods, allowing for further insight into evolutionarily constrained residues that may not be observable through existing experimental techniques. We expect that, in the future, analyzing covarying residue pairs within alignments will be as routine as analyzing ordinary sequence conservation is today. The methods used to apply covariation-derived contacts to structure prediction are still in their infancy, yet Dependence on large and accurate alignments The quality of an input alignment underpins the generated results, as this is the sole source of information that coevolution-based methods exploit. Current contact prediction methods rely on using profile hidden Markov models (HMMs) to generate alignments, either obtaining precomputed results for a protein family from Pfam  or generating their own with HHblits  or jackHMMer . Recently, a new type of method employing Markov random fields (MRFs) was shown to achieve superior alignment accuracy and better detection of homologous sequences than HMM-based methods, although MRF-based alignments have yet to be used for contact prediction . 252Tetchner et al.: Coevolution-derived contacts for protein structure prediction proof-of-principle studies have demonstrated that they can guide the modeling of both globular and transmembrane protein structures [27, 40, 42, 75, 76]. The improvements in modeling for transmembrane proteins, which remain very difficult to study by experimental methods, are, in particular, very remarkable. It can also be envisaged that, in the future, the synergistic use of covariation methods and biophysical techniques will play a key role in modeling challenging structural targets such as protein complexes and large macromolecules. Correlated mutation-based methods can also help highlight the diverse range of pressures on the protein sequence and may even give us new insight into how particular folds evolve. Coevolution carefully balances the structure, interactions, and even movements a chain makes while sampling alternative viable combinations of amino acids, permitting long-term change [77, 78]. With such diverse information contained purely in the sequences of the protein chain, and new methods now capable of reliably extracting it, it is truly an exciting time in both studying protein evolution and predicting protein structure from sequence. Acknowledgments: The authors thank Domenico Cozzetto for useful discussions. ST and TK were supported by the Wellcome Trust (studentship numbers 096622/Z/11/Z and 096624/Z/11/Z, respectively). Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission. Research funding: None declared. Employment or leadership: None declared. Honorarium: None declared. Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the report for publication. 6. Williams SG, Lovell SC. The effect of sequence evolution on protein structural divergence. Mol Biol Evol 2009;26:105565. 7. Poon A, Chao L. The rate of compensatory mutation in the DNA bacteriophage X174. Genetics 2005;170:98999. 8. Goh C-S, Bogan AA, Joachimiak M, Walther D, Cohen FE. Coevolution of proteins with their interaction partners. J Mol Biol 2000;299:28393. 9. Altschuh D, Lesk A, Bloomer A, Klug A. Correlation of co-ordinated amino acid substitutions with function in viruses related to tobacco mosaic virus. J Mol Biol 1987;193:693707. 10. Göbel U, Sander C, Schneider R, Valencia A. Correlated mutations and residue contacts in proteins. Prot Struct Funct Bioinf 1994;18:30917. 11. Vernet T, Tessier DC, Khouri HE, Altschuh D. Correlation of co-ordinated amino acid changes at the two-domain interface of cysteine proteases with protein stability. J Mol Biol 1992;224:5019. 12. Neher E. How frequent are correlated changes in families of protein sequences? Proc Natl Acad Sci 1994;91:98102. 13. Jones DT, Buchan DW, Cozzetto D, Pontil M. PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics 2012;28:18490. 14. Lapedes AS, Giraud BG, Liu L, Stormo GD. Correlated mutations in models of protein sequences: phylogenetic and structural effects. Lecture Notes Monogr Ser 1999:23656. 15. Pollock D, Taylor W. Effectiveness of correlation analysis in identifying protein residues undergoing correlated evolution. Protein Eng 1997;10:64757. 16. Dunn SD, Wahl LM, Gloor GB. Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics 2008;24:33340. 17. Little DY, Chen L. Identification of coevolving residues and coevolution potentials emphasizing structure, bond formation and catalytic coordination in protein evolution. PLoS One 2009;4:e4762. 18. Gloor GB, Tyagi G, Abrassart DM, Kingston AJ, Fernandes AD, Dunn SD, et al. Functionally compensating coevolving positions are neither homoplasic nor conserved in clades. Mol Biol Evol 2010;27:118191. 19. Giraud B, Heumann JM, Lapedes AS. Superadditive correlation. Phys Rev E 1999;59:4983. 20. de Juan D, Pazos F, Valencia A. Emerging methods in protein coevolution. Nat Rev Genet 2013;14:24961. 21. Taylor WR, Hamilton RS, Sadowski MI. Prediction of contacts from correlated sequence substitutions. Curr Opin Struct Biol 2013;23:4739. 22. Korber B, Farber RM, Wolpert DH, Lapedes AS. Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic analysis. Proc Natl Acad Sci 1993;90:717680. 23. Shindyalov I, Kolchanov N, Sander C. Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? Protein Eng 1994;7:34958. 24. Taylor WR, Hatrick K. Compensating changes in protein multiple sequence alignments. Protein Eng 1994;7:3418. 25. Benner SA, Gerloff D. Patterns of divergence in homologous proteins as indicators of secondary and tertiary structure: a prediction of the structure of the catalytic domain of protein kinases. Adv Enzyme Regul 1991;31:12181.
Bio-Algorithms and Med-Systems – de Gruyter
Published: Dec 19, 2014
Access the full text.
Sign up today, get DeepDyve free for 14 days.