Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Protein intrachain contact prediction with most interacting residues (MIR)

Protein intrachain contact prediction with most interacting residues (MIR) The transition state ensemble during the folding process of globular proteins occurs when a sufficient number of intrachain contacts are formed, mainly, but not exclusively, due to hydrophobic interactions. These contacts are related to the folding nucleus, and they contribute to the stability of the native structure, although they may disappear after the energetic barrier of transition states has been passed. A number of structure and sequence analyses, as well as protein engineering studies, have shown that the signature of the folding nucleus is surprisingly present in the native three-dimensional structure, in the form of closed loops, and also in the early folding events. These findings support the idea that the residues of the folding nucleus become buried in the very first folding events, therefore helping the formation of closed loops that act as anchor structures, speed up the process, and overcome the Levinthal paradox. We present here a review of an algorithm intended to simulate in a discrete space the early steps of the folding process. It is based on a Monte Carlo simulation where perturbations, or moves, are randomly applied to residues within a sequence. In contrast with many technically similar approaches, this model does not intend to fold the protein but to calculate the number of non-covalent neighbors of each residue, during the early steps of the folding process. Amino acids along the sequence are categorized as most interacting residues (MIRs) or least interacting residues. The MIR method can be applied under a variety of circumstances. In the cases tested thus far, MIR has successfully identified the exact residue whose mutation causes *Corresponding author: Zoé Lacroix, Scientific Data Management Laboratory, School of Electrical, Computer and Energy Engineering (ECEE), Arizona State University, Tempe, AZ 85282-5706, USA, E-mail: zoe.lacroix@asu.edu Ruben Acuña: Scientific Data Management Laboratory, School of Electrical, Computer and Energy Engineering (ECEE), Arizona State University, Tempe, AZ 85282-5706, USA Nikolaos Papandreou: Genetics Department, Agricultural University of Athens, Athens, Greece Jacques Chomilier: Protein Structure Prediction group, IMPMC, Sorbonne University, UPMC, CNRS, MNHN, IRD, Paris, France; and RPBS, Paris, France a switch in conformation. This follows with the idea that MIR identifies residues that are important in the folding process. Most MIR positions correspond to hydrophobic residues; correspondingly, MIRs have zero or very low accessible surface area. Alongside the review of the MIR method, we present a new postprocessing method called smoothed MIR (SMIR), which refines the original MIR method by exploiting the knowledge of residue hydrophobicity. We review known results and present new ones, focusing on the ability of MIR to predict structural changes, secondary structure, and the improved precision with the SMIR method. Keywords: globular proteins; hydrophobic core; prediction method; protein folding nucleus; protein structure. DOI 10.1515/bams-2014-0015 Received September 11, 2014; accepted October 16, 2014 Introduction Since the pioneering work of Anfinsen on the proteinfolding problem [1, 2], large advances have occurred; however, it still remains an unresolved scientific problem. From the viewpoint of theory, a dynamical simulation is not amenable because it is able to produce results up to microseconds at best, whereas the typical folding kinetics is on the order of milliseconds. This is why "toy models" are useful for solving this issue in a discrete space. This reduces the spatial complexity and has the ability to determine neighbors easily. Potential energies used in this type of simulation can either come from an analysis of the native structure of the studied chain [3], or from statistical derivation from a database of structures [4]. In previous works, we developed such a simulation on a cubic lattice, and focused on the formation of a complete globular protein [5]. It appeared that during initial folding, residues tend to locally aggregate to form fragments. The limits of the sequence of these fragments in the protein sequence 228Acuña et al.: Prediction of the protein folding nucleus with most interacting residues are rather well fixed, although they can be displaced in the three-dimensional (3D) space. Subsequently, the number of fragments regularly decreases up to the formation of a single globule of contiguous fragments that collapse. The boundaries of two successive fragments are commonly located in loops, and almost never in regular secondary structures (RSSs). The algorithm is sensitive to the modification of the secondary structure, such as the introduction of a bulge in FK506 binding protein (FKBP) [5]. The bottleneck of this simulation is that it provides no information on the type of regular structure. Moreover, a single cubic lattice only allows angles of 90° between two successive -carbons. The method has evolved with time, both technically and conceptually. From a technical point of view, a significant improvement was introduced by the use of a (2, 1, 0) lattice, based on the work of Skolnick and coworkers [6­8]. This kind of lattice provides large gains in chain flexibility and far more realistic local chain configurations. More precisely, we used a simplified version of the Skolnick model, without side chains and with a distanceindependent, standard Miyazawa-Jernigan potential. The main evolution of our approach was guided by the observation that all simulated polypeptide chains tended to form a succession of local fragments, indicating that some residues act as anchors and "attract" more spatial neighbors in the early steps of the calculation. The identification of these residues and their comparison with critical positions, provided by other sequence- or structure-based approaches, was a meaningful research path, as it could result in a reliable ab initio identification of some position that might be included in the folding nucleus; however, we do not pretend to exhaustively define it. For this latter purpose, the application of this method to a set of aligned sequences of similar fold seems to be promising to detect some aspects of the conservation of the folding nucleus, when it is the case. As we are interested in the early steps, we will focus on a set of residues able to interact with each other. This is related to the concept of "contact order," which, in essence, quantifies the mean length of the sequence between two interacting residues. This approach has been the object of many publications over the years, and various names have been given to the sequences joining the contacting residues: foldons, closed loops, building blocks, tightened end fragments (TEFs), and so on. Therefore, the newly developed most interacting residues (MIR) algorithm periodically records the number of non-covalent neighbors (NCNs) [9]. The time mean value of the NCNs is calculated at the end of the simulation, and a diagram is produced as an output of the algorithm. This distribution presents maxima that have been called MIR, and minima known as LIR (for least interacting residues). In a few cases tested thus far, the MIRs correspond to some extent to the residues involved in the protein-folding core [10]. They have also been compared to the energetic stability of positions when subjected to a single mutation. This comparison is enabled by the development of a web server proposing several algorithms that are applied to the calculation of the variation of the free energy due to the mutation. There is a statistical correspondence between MIRs and stable positions [11]. The MIR algorithm is now available online as part of Structural Prediction for pRotein fOlding UTility System (SPROUTS) [12], hosted by Ressource Parisienne en Bioinformatique Structurale (RPBS) [13]. It may be accessed at http://sprouts.rpbs.univ-parisdiderot.fr/. The web server SPROUTS comprises a database and submission server integrating various resources in order to predict the impact of protein stability under mutation. It contains a web interface for interacting with the results of the MIR simulation and a submission server that automatically analyzes sequence data with MIR. This article aims to address the following questions: what are the residues likely to play a critical role in the early steps of the folding of the protein, and whose mutation may influence its structure? The article is organized as follows. We first describe the MIR method and introduce the SMIR. We then report on known findings about the MIR method and present new results, in particular on the ability of MIR to predict structural changes, secondary structure, and the improved precision with the smoothed MIR (SMIR) method. We conclude with current research projects expected to benefit greatly from the use of MIR. Materials and methods We now give an outline of our method for simulating the early steps of protein folding. First, a random initial conformation is produced for an -carbon-only simplified representation of the polypeptide chain. Each -carbon is placed at random on the nodes of a lattice as illustrated in Figure 1. An extension of a cubic lattice, namely (2, 1, 0), originally proposed by Skolnick and coworkers [6­8], is used. Compared with the simple cubic lattice, it allows a wider range of backbone angles, from 64° to 143°. The number of first neighbors is also higher (24 instead of 6). Side chains are discarded in our implementation of the method. Folding is simulated by randomly selecting one amino acid and submitting it to one of two available moves: an end move for the N- or C-terminal positions, Acuña et al.: Prediction of the protein folding nucleus with most interacting residues229 3.8 Å 1.7 Å acids that can be found among FILMVWY because this set is more frequently found in RSSs than in loops [14]. If the NCN at one position is , this position is then assigned as <2 a LIR. An optional smoothing process based on the Pascal triangle may then be applied to the resulting data to refine the MIRs identified. The concepts and implementation of the MIR algorithm are described in detail in the Appendix. Figure 1Details of a (2, 1, 0) move, with respect to the underlying cubic lattice. Postprocessing One limitation of the MIR algorithm is the number of residues identified as MIRs. The SMIR smoothing method is aimed at selecting, with a neighborhood analysis, the most interacting hydrophobic residue. Our method uses the Pascal triangle and adjusts the maxima that are identified in the smoothed graph to nearby (within three residues) hydrophobic positions, if any, based on our accepted precision of the algorithm. We continue to identify the minima with a threshold; however, there is no constraint on the nature of the LIR. Our smoothing algorithm has four parts: Pascal-triangle-based smoothing, identifying the maxima in the smoothed data, identifying the minima in the smoothed data, and, lastly, moving the identified maxima to the nearest hydrophobic residue, if any. We classify the amino acids FILMVWY to be hydrophobic in the following discussion. The algorithm introduces the notion of smoothed NCN, from which SMIR and smoothed LIR (SLIR) are deduced. The input to the smoothing process is the average number of NCNs as determined by the MIR simulation. To provide greater fidelity in the smoothing, the raw NCN count is multiplied by a factor of 10; it is later renormalized to match the initial values. Our technique uses the 10th row of Pascal's triangle. The numbers are used to calculate a weighted average (which we call Pascal value, or PV) at each residue by centering on it. As this technique requires 10 surrounding values, we first smooth the first and last five residues by using only the Pascal numbers that correspond to valid residues. We then label the maxima in the smoothed NCN data. If the residue has a higher PV than each of its adjacent neighbors and has a PV of at least 4.95 (NCN), we label it as a maximum. The next step is to label the minima in the smoothed NCN data. If a residue has a lower PV than each of its adjacent neighbors and has a PV of 3.04 (NCN), we label it as a minimum. Once the positions with the maxima have been labeled, the labeled positions may be moved to nearby hydrophobic residues. The nearest hydrophobic residue up to three positions away will be or a corner move otherwise. Crankshaft moves (two neighbor residues displaced together to a pair of empty nodes) are no longer permitted with the (2, 1, 0) lattice. We then perform potential perturbations, or moves, upon randomly selected residues within the sequence. A new position can be used when it is not occupied by another residue. The new conformation energy is computed by use of statistical potential of mean force, taken from the literature [4]. The Metropolis criterion is applied to accept or reject the new conformation based on its energy change. This is performed until some number of simulation steps has been taken, based on the length of the sequence. A limit in the length of the sequences submitted to the simulation has been fixed at 500 amino acids. One reason is to limit the CPU time, and the second one is because folding makes sense as long as one deals with domains, not complete chains. Thus far, there is no domain separation of the query sequences, and this limit corresponds to the longest known single domain. The process is stopped when roughly 105­106 Monte Carlo steps have been completed, depending on the length of the input protein sequence. The required number of steps has been calibrated on a few sequences, by trying to assess the minimum number of steps beyond which the obtained MIR set does not fluctuate. The process is repeated 100 times, starting from 100 initial conformations. Intermediate models are analyzed to determine, for each residue, how many other residues are close enough to interact while not being covalently bonded. This gives the number of NCNs at each residue, which is recorded so that the data can be averaged later. Two non-covalently bound residues are considered to interact (i.e., be NCNs) if the distance between their respective -carbons does not exceed 7.2 Å. The mean NCN is calculated at the end of the process and averaged for all the initial conformations. The distribution of NCNs along the sequence presents the maxima and minima. The maxima are called MIRs. A residue is accepted as an MIR if the NCN at this position is 6. They are mainly hydrophobic; one of the six amino 230Acuña et al.: Prediction of the protein folding nucleus with most interacting residues used. If there are multiple choices, the residue possessing the highest PV will be selected. If none of the neighbor residues is hydrophobic, the maximum will be lost. The maxima may now be considered SMIRs. The minima are not reallocated and so are taken to be the SLIRs. At this point, postprocessing is complete. filled out at submission time (this is optional). An alternative input can be a custom sequence in FASTA format [16]. The MIR results are stored in the SPROUTS database together with general information retrieved with the PDB code, when available. Alternatively, custom sequence data may be uploaded together with a retrieval code that provides access to the submitter. By populating a database automatically upon user submission, SPROUTS enables the quick retrieval of MIR results while allowing users to grow the database of PDB code MIR results with their PDB code MIR submissions. Data are presented with a dynamic interface implemented in Javascript with D3 [17] for visualization. This interface allows one to apply the smoothing algorithm dynamically to the displayed data. The interface has been primarily tested in Google Chrome 35. Firefox 21 and Safari 6 have also been tested. Our browser-based implementation allows users to retrieve the new SMIR analysis for existing proteins without the need to resubmit the entry to our submission server. By clicking on the appropriate link, a file with the results in the commaseparated values (CSV) format can be downloaded. The following statistical analyses were performed using the live (as of June 30, 2014) set of proteins available in the SPROUTS database. Only entries corresponding to public entries in the PDB were used. This comprised 498 chains >1 (containing 32,655 MIRs). These analyses were performed on the unsmoothed MIR results. In future work, we plan Results In this section, we report the results obtained by running MIR in a variety of circumstances. The standalone MIR 1.0 implementation was first made available online as a part of the RPBS server in 2005 [13]. The SPROUTS submission server [12] uses the MIR2.2, with a browser client side extension for SMIR3.10 algorithms, implementation (in Fortran) for server side simulation, and provides a Javascript front end for interactive analysis and smoothing [15]. It may be accessed at http://sprouts.rpbs.univ-paris-diderot.fr/mir. html. The basic interface is shown in Figure 2. When a user submits a list of Protein Data Bank (PDB) IDs (or retrieval codes; used for private data), SPROUTS will immediately return the corresponding MIR data if they have been computed already. New PDB IDs will be submitted for processing on the SPROUTS server; at completion time, the user will receive a notification message if the e-mail box was Figure 2SPROUTS online interface for MIR. Background is the submission window on the server. In front, the output of the protein of PDB code 1asu, in the smoothed option. In light blue are the NCNs; dark blue are the SMIRs; brown are the LIRs; and the orange zones are the TEFs. Acuña et al.: Prediction of the protein folding nucleus with most interacting residues231 to strengthen these results by extending these analyses to PDB IDs that are representative of the 1200 folds contained in the structural classification of proteins (SCOP) [18]. The value of applying a smoothing procedure can be seen in Figure 3, in the case of the fibronectin type III (PDB code 1ten), chosen because an extensive folding nucleus determination has been performed [19]. Smoothing reduces the concentration of MIRs in neighboring positions (8 and 10, for instance) owing to statistical fluctuations. The smoothed curve of NCN presents several maxima at positions I8, T21, Y36, I48, I59, L72, and M79, if we skip to two SMIRs on both ends of the sequence. According to Hamill et al. [19], the folding nucleus of this protein contains only four residues, namely I20, Y36, I59, and V70. If we assume an accuracy of , our prediction is ±1 correct for three of these four positions, and one is shifted by two residues (V70). Nevertheless, the overprediction still needs improvement, which is partly accomplished when the stability prediction is considered, as will be later shown. sequence covered by the TEF is in orange, and statistically their limits are close to MIR [21]. Stability As MIRs should correspond to residues that are important in the transition states of protein folding, mutations at those positions should also affect the stability of the protein. Protein stability can be examined in terms of G values, representing the change in free energy difference between a wild type and a mutant. One feature of SPROUTS is a repository of G values predicted by a suite of eight tools for every possible amino acid substitution at every position of a protein [12]. SPROUTS includes the ability to graph this in terms of a so-called impact count. For each predictive tool, the impact count is equal to the number of mutations at each position that give a mutant with a G .00 kcal/mol or 2.00 kcal/mol. As >2 <a score of 1 is attributed to any mutation that modifies the G above this threshold of kcal/mol, the scale goes ±2 from 0 to 19 if all possible mutations at a given position produce a strong destabilization or stabilization. This threshold has been proposed because it is estimated as the best accuracy, based on an evaluation of prediction tools [23] with calculations of the free energy. This represents the number of mutations with a non-neutral G. Positions of a protein with high impact count are positions that are sensitive to substitutions, either stabilizing or destabilizing. An example is shown in Figure 4, with an immunoglobulin fold from the PDB entry 1ten. Not all available tools can be run for all the PDB codes, for data integrity reasons. As the impact count shows positions TEF correspondence For a number of years, the analysis of the 3D structures of globular proteins at the subdomain level has led to the evidence of fragments of proteins where the two ends are very close in 3D space. Their extremities are mainly located in the cores of the globules, and also mainly occupied by conserved hydrophobic residues [20­22]. They are either called closed loops, or TEF. Because of this property of burying, they are presented on the graphs of the 2D outputs of SPROUTS. In Figure 2, the length of the 8 7 6 5 NCN 4 3 2 1 0 0 10 20 30 40 50 Sequence 60 70 80 90 NCN 8 7 6 5 4 3 2 1 0 0 10 20 30 40 50 Sequence 60 70 80 90 Figure 3Raw (left) and smoothed (right) NCNs on the protein of PDB code 1ten. Dark blue bars indicate either MIR or SMIR, and dark red bar the smoothed LIR. 232Acuña et al.: Prediction of the protein folding nucleus with most interacting residues Free energy analysis of stability impact for each residue of 1TEN (A) MUpro I-Mut 2 seq only I-Mut 3 seq only DFIRE 20 18 16 14 Impact count 12 10 8 6 4 2 0 0 10 8 7 6 5 NCN 4 3 2 1 FoldX 3 40 50 Sequence Figure 4Stability of the protein of PDB code 1ten. Left: five algorithms predicting the G for a point mutation: FoldX [24], MUpro [25], DFIRE [26], and versions 2 and 3 of I-mutant [27]. Right: SMIR distribution. that have a non-neutral G under mutation, we expect that they have some correspondence to MIRs and SMIRs. In Figure 4, one can see the disagreement between the five tools used on this example. Nevertheless, one must notice that most of them are in agreement for the positions where the impact count is maximum. This means that, although the prediction of stability is not highly accurate, there is a reasonable agreement between the algorithms to consider positions where mutation can affect the structure, either toward stability or toward instability. One may visualize that the maxima in the smoothed distribution of the NCN produces a rather qualitative agreement with the maximum in the impact count distribution. In particular, the highest values are compatible with the maxima of the smoothed NCN curves at the following positions: T21, Y36, I48, I59, and L72. Therefore, compared with the results presented with MIR calculation alone (see Figure 3), the false positives at positions 8 and 79 disappear when they are crossed with the stability prediction. In other words, SMIRs are able to capture some of the physics that is at the core of the structural stability of the proteins [10]. [28­30]. They engineered two proteins that are available in the PDB under the codes 2LHC (GA98) and 2LHD (GB98). These proteins contain 58 residues that give a 98% sequence similarity. 2LHC contains three -helices, whereas 2LHD contains four -sheets and one -helix, as illustrated in Figure 5. A single mutation at the 45th residue (leucine toward tyrosine) changes the folded conformation. As seen in the 2LHC SMIR curve presented in Figure 6, the 45th residue is initially identified as a SMIR, when it is a leucine. In the 2LHD SMIR curve in Figure 6, we see that the SMIR at that position has disappeared when it is mutated by another hydrophobic residue. Although the difference in the smoothed NCN distributions is small, this mutation is locally sufficient to perturb Predicting structural change Aimed at demonstrating the minimal key residues coding for a fold, several attempts have been published that are based on the same scenario. Starting from two short sequences of different folds, a set of mutation is performed on each one with the goal of obtaining the closest sequences possible while preserving the respective folds. Orban's group has demonstrated how a single point mutation could have a transformative impact on protein fold Figure 5PDB structures for 2LHC (GA98) (left) and 2LHD (GB98) (right). The colored dot indicates the position of 45th residue in both sequences. Acuña et al.: Prediction of the protein folding nucleus with most interacting residues233 8 7 6 5 NCN 4 3 2 1 0 0 5 10 15 20 25 30 Sequence 35 40 45 50 55 NCN 8 7 6 5 4 3 2 1 0 0 5 10 15 20 25 30 35 Sequence 40 45 50 55 Figure 6Smoothed NCN distributions for 2LHC (GA98) (left) and for 2LHD (GB98) (right). SMIRs are indicated by dark blue positions. The SMIR comparison analysis shows that the residue at position 45 shown as an SMIR in 2LHC (left) is no longer an SMIR in 2LHD. This corresponds to the mutation that caused a dramatic structural change as illustrated in Figure 5. the NCN curve such that the maximum vanishes with the tyrosine instead of the leucine. Regular secondary structure distribution Of the 32,314 MIRs existing in the data set available with DSSP formatted [31] data, 20,573 (63.67%) were found in RSSs. Comparatively, the fraction of arbitrary residues that were found in RSSs was 51.07% in the full dataset. Of MIRs in RSSs, 11,732 (57.03%) were found in helix 1200 1000 800 Occurrences 600 400 200 0 structures, whereas 8841 (42.97%) were found in strands. For both kinds of structures, MIRs tend to cluster at the ends. Figures 7 and 8 show how MIRs are distributed within RSSs. For clarity, the graphs omit those positions with 0 entries. Secondary structure data were extracted <3 from the DSSP files associated with each PDB entry. We consider helices to be composed of HGI conformational states, whereas strands are EB according to the DSSP classification. In Figure 7, we see a pattern on the N-terminal end of the sequence where MIRs peak at the N1 and N4 positions of N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 N12 C12 C11 C10 C9 C8 C7 C6 C5 C4 C3 C2 Position Figure 7MIR distribution in helices. All helices of different lengths are reported in this figure. The first position at the N (C) terminus is called N1 (C1), and successive numbers are used up to the center of the helix. C1 234Acuña et al.: Prediction of the protein folding nucleus with most interacting residues helices. This seems to correspond roughly to the number of residues in a turn of an -helix (3.6) projected on our lattice model. Amazingly, this feature, which one may assume is linked to the periodicity of the -helix, is not seen on the C-terminal end. One assumption is that the capping effect at this C end is stronger than at the N cap. It would mean that at this end, the helix is slightly shrunk, modifying the periodicity of the hydrophobic-hydrophilic distribution. This is coherent with the tendency to form 310 helices at the C terminus of the -helices. In the central part of the figure, one cannot obtain any information as the mean length of an -helix in the PDB is 12 amino acids; thus, from N6 to C6, this corresponds to a smaller number of occupied positions that results in a smaller trend of statistics. The features in Figure 8 are less distinct with strands but seem to show that a pair of MIRs may cluster at either end of a strand. That one does not observe a succession of maxima and minima in this distribution may be attributed to the fact that completely buried strands, at the opposite of globule border strands, are occupied by hydrophobic residues on both sides of the strand. Conclusions about the periodicity of the MIR in strands are not as easy as in the case of helices because the mean length of a strand in the PDB is five amino acids, corresponding to a number jumping from N3 to C2. those that could be exactly aligned with their associated PDB data. The set of proteins analyzed contained a total of 17,211 MIRs. Among them, 11,104 (64.52%) are buried if one applies a classical threshold of 25 Å2. As the sequences retrieved from the atomic coordinates may differ from the one in the sequence tab of a given PDB file, not all proteins could be exactly aligned. Surface Racer [32] was used to compute the accessible surface area for each of the protein pairs that were exactly aligned. Surface Racer failed to run on several of these submissions; hence, a total of only 848 were processed. The default probe size (1.4 Å) and van der Waals radii set was used. Figure 9 is a histogram of the solvent-accessible area of the 17,211 MIRs. The histogram excludes the 66 MIRs with surface area 00 Å2. Moreover, 3% of MIRs are >2 >9 located on hydrophobic residues (see Table 1). As MIRs are typically hydrophobic residues, we expect, and see, that most are deeply buried in the protein core. Discussion The knowledge of the residues involved in intrachain contacts during the folding process is important, for instance in the annotation of misfolding-related pathologies. The amino acids involved in several interactions are key residues in achieving the folding and in determining the fold, and are very sensitive to mutation because of their strong contribution to the structure. The role of prediction, at this time, is a valuable complementary approach because the experimental determination of the transition state ensemble is very difficult [33, 34]. Published literature admits Accessible surface area For this analysis, we started with the basic set of 1498 entries in the SPROUTS database and then narrowed it down to 848 entries (the final number used) based on 1800 1600 1400 Occurrences 1200 1000 800 600 400 200 N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 N12 C12 C11 C10 C9 C8 C7 C6 C5 C4 C3 C2 C1 0 Position Figure 8MIR distribution among strands. All strands of different lengths are reported in this figure. The first position at the N (C) terminus is called N1 (C1), and successive numbers are used up to the center of the strand. Acuña et al.: Prediction of the protein folding nucleus with most interacting residues235 5000 4500 4000 Occurrences 3500 3000 2500 2000 1500 1000 500 0 0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152 160 168 176 184 192 200 Accessible area (Å2) Figure 9Frequency of MIR accessible area. Table 1Frequency of hydrophobic residues in SPROUTS. General frequency 8.27 8.05 7.71 6.94 6.42 6.03 6.03 5.90 5.85 5.14 5.13 4.80 4.32 3.98 3.73 3.56 2.26 2.18 2.12 1.58 0.02 Residue Leu Ala Gly Val Ser Lys Glu Asp Thr Arg Ile Asn Pro Gln Phe Tyr His Cys Met Trp Xaa MIR frequency 35.08 16.40 15.64 13.36 5.18 4.58 3.66 2.85 1.49 0.44 0.37 0.37 0.16 0.13 0.13 0.05 0.04 0.03 0.03 0.00 0.00 Residue Leu Ile Phe Val Met Tyr Trp Cys Ala His Thr Gly Ser Arg Pro Gln Glu Asn Asp Xaa Lys SMIR frequency 32.58 19.26 15.58 15.25 7.45 6.34 2.86 0.10 0.10 0.09 0.09 0.09 0.07 0.05 0.03 0.02 0.02 0.01 0.01 0.00 0.00 Residue Leu Val Ile Phe Met Tyr Trp Gln Glu Asp Thr Lys Ser Pro Ala His Gly Arg Cys Xaa Asn The left column of the table displays the frequency of residues from all protein chains available in SPROUTS (1,498 protein chains at time of retrieval). The middle column reports on the distribution of the MIRs found for these 1,498 protein chains whereas the right column reports on the SMIRs. Nearly a third of the overall residues found in SPROUTS (31.33%) were hydrophobic (left). A total of 32,656 MIRs were identified in those 1,498 protein chains, 93.91% of which are located on hydrophobic residues (middle). The SMIR method only identified 12,917 residues, a number of residues which is 60% less than those identified as MIRs (right). that the number of residues involved in the folding nucleus can vary from a few percent up to half the sequence length, roughly one-third of the hydrophobic residues. Conservation of the folding nucleus among members of a given fold is also the object of large debates, and in the case of the immunoglobulin fold, can be as low as four residues [19, 35]. To contribute to this goal, we have developed an algorithm aimed at predicting from the sequence the set of residues involved in a large number of intrachain contacts, known as MIR. However, the energy related to a stabilizing contact comes, in fact, from a somewhat wider sequence range than a single residue, like it was shown in the article by Berezovsky et al. where the ends of the closed loops where composed of from three to five residues [20]. Therefore, the prediction can be improved with a smoothing of the curve of NCNs as presented in SMIR. This drastically reduces the number of MIRs and avoids local fluctuations due to the presence of hydrophobic residues close along the sequence. In protein sequence comparison, one is faced with two types of conserved positions. On the one hand, residues defining the active site are extremely conserved at the chemical level because they are responsible for the activity of the enzyme. On the other hand, some residues are important for the structure, and their physical property must be conserved. In other words, one must keep the hydrophobicity or hydrophilicity. This results in an effective amino acid alphabet that is degenerated, composed of at least two classes. We are here concerned by this second type of conservation, due to structural contingencies. Mutations within one of these classes are allowed; otherwise, they are forbidden because the structure would be destroyed, and consequently the function would be lost. It has been demonstrated that for each protein sharing a common fold where the nucleus is highly conserved, residues identified as MIRs constitute a nontrivial subset of the hydrophobic residues. Among such 236Acuña et al.: Prediction of the protein folding nucleus with most interacting residues families of proteins (several sequences per family, same structure, potentially different functions, and very divergent sequences) MIRs occupy equivalent positions in the multiple alignments [10]. Therefore, a small number of hydrophobic positions are conserved as hydrophobic. They are compulsory for the folding to occur; they are deeply buried. For these reasons, it seems reasonable to question whether they constitute or belong to the folding nucleus of the various folds. We are not at the stage of giving a definite answer yet; however, one can estimate it on the immunoglobulin-like fold (56 structures of divergent sequences), where the MIRs reconstruct most of the common folding nucleus, which is highly conserved [19]. These results concern a very small number of families; however, experimental evidence on the folding nucleus is not obvious and can show strong biases [36, 37]. When analyzing the proteins constructed by Orban's group, MIR successfully identified the exact residue whose mutation causes a switch between two folds. The curve generated by the MIR simulation that gives the mean number of non-covalent numbers can also be shown to partially correspond to the curve (in terms of maxima and minima) of the G impact count (representing the number of mutations that give a non-neutral G). This follows with the idea that the MIR method identifies residues that are important in the folding process. An analysis of the correspondence between MIR positions and secondary structures shows that MIRs appear to follow the pattern of matching to a side of -helices. As expected from their favor to hydrophobic residues, most MIRs have zero or very low accessible surface area. An a posteriori justification of the selection of the seven amino acids we made for the hydrophobic class is the fact that they have top-ranked frequencies among the MIRs. Finally, a brief discussion on the model used in the MIR/SMIR approach is useful. It has been argued that such simplified models may not lead to a realistic prediction of a folded protein, without introducing strong biases to the desired topology. However, the simple, bias-free, rather standard model used here, is, in our opinion, well suited for the job of simulating the first steps, departing from the unfolded state. The non-compact nature of protein topology at these steps is compatible with the discrete space, side-chain-free lattice geometry, with a mean-field-type, distance-independent potential. To provide open access to the community, a web server has been developed that performs the calculation of the MIR positions. If the query is a PDB code, then it verifies if this protein is available in a dataset where previous calculations are stored; otherwise, the prediction is computed. It also allows submitting a sequence in the FASTA format, and data can be retrieved from the collection of results, provided one code is given at the submission time. MIR calculation is part of the SPROUTS server that was originally dedicated to the prediction of the stability of a position in a sequence, under the effect of a point mutation. On several examples, there is a good agreement between positions predicted as involved in a number of intrachain contacts (SMIR) and the susceptibility of a change in the energy level if a mutation would occur. It seems to indicate that the SMIR method may identify the most critical residues out of the most interacting residues (40% of MIRs). Conclusions At the time of submission, October 2014, MIR prediction has been performed for 900 PDB codes, and they are stored in >1 the SPROUTS database (see http://sprouts.rpbs.univ-parisdiderot.fr/) hosted at the French RPBS, a portal devoted to services supporting analyses of protein structures [13]. It can be queried directly at http://sprouts.rpbs.univ-paris-diderot. fr/mir.html. The output of the SPROUTS server was designed to be user-friendly and intuitive. It presents on the same 2D graph the expected impact of all possible mutations at one position and the predicted number of intrachain contacts. It then gives a precise idea of positions where a mutation that would change the hydrophobicity of the side chain is likely to have important consequences on the structure. MIR and its extension SMIR are also integrated in the whole SPROUTS analysis system, where the results can be compared to stability analyses [10, 38, 39]. MIR is listed as a service of the Mobyle portal for bioinformatics analyses (http://mobyle.rpbs.univparis-diderot.fr/), developed jointly by the Institut Pasteur Biology IT Center and RPBS that supports linear workflows integrating various services and databanks [40]. MIR is also listed as a resource in the Semantic Map for Structural Bioinformatics [41], where methods and tools are expressed in the terms of an ontology and displayed in a conceptual graph [42]. There are several leads that are expected to improve the method. For instance, Kolinski et al. [43] have proposed extending lattice models with flexible distances. Thus far, the algorithm treats a sequence on an "as it is" basis, and an improvement would be to first determine the domains and do the calculation on them instead of the full chain. Another extension would be to use a family of sequences instead of a single one, for instance through a query to PFAM, that would allow giving an insight into the conservation of the contacts in a family of proteins. While we improve the method, we are also exploring how to integrate the method to enhance other analyses of protein structures. A first approach consists of using Acuña et al.: Prediction of the protein folding nucleus with most interacting residues237 signal processing methods such as mapping protein sequences to time-domain waveforms. We will explore how MIR can be used as an index to guide the alignment of proteins, therefore identifying areas of similarities that typically are not captured by traditional alignment methods [44]. A second promising area resides in integrating MIR with stability analyses to better predict the impact of mutations on protein structure [10, 38, 39]. We will investigate how the consensus method consisting of the average of various stability analyses currently made available in SPROUTS [12] can be improved for the prediction of the dramatic impact of mutation of protein structures with MIR. -2.0 -1.5 Possible alpha carbon 2.0 1.5 1.0 0.5 0 -0.5 -1.0 -1.5 -2.0 2.0 1.5 1.0 0.5 0 -0.5 -1.0 -1.5 -2.0 Acknowledgments: We acknowledge Pierre Tufféry for his help on using the RPBS resources. Mathieu Lonquety and Christophe Legendre contributed to the SPROUTS database where SMIR results are stored. They are all thanked for their help. We also wish to acknowledge our collaborators at ASU: Antonia Papandreou-Suppappola and Anna Malin who have worked on an alternative MIR method, and Banu Ozkan for evaluating SPROUTS functionalities and discussing future improvement. Author contributions: All authors have accepted responsibility for the entire content of the submitted manuscript and approved the submission. Research funding: This work was partially supported by the National Science Foundation (grants IIS 0431174, IIS 0551444, IIS 0612273, IIS 0738906, IIS 0832551, IIS 0944126, and CNS 0849980) and by an invitation of the Université Pierre et Marie Curie. Employment or leadership: None declared. Honorarium: None declared. Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the report for publication. Any opinion, finding, and conclusion or recommendation expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. -1.0 -0.5 Figure 10Vectors resulting from intersection of lattice with sphere at origin 0.0. form (2, 1, 0). These vectors are 5 lu in length, which corresponds to 3.8 Å, the mean distance between adjacent C atoms in proteins. This results in 24 immediate neighbor positions for each point in the lattice. This represents the intersection of a 4 -segmented cube with a sphere of ×4 radius 3.8 Å ( 5 lu), as shown in Figure 10. Our model does not take into account the presence of side chains; therefore, the required separation is modeled with a 3.8 Å minimum distance requirement. On the basis 4 3 2 1 0 -1 -2 3 Appendix Lattice geometry We model a protein as a chain of evenly spaced C atoms placed on a lattice [14, 45]. We define a lattice unit (lu) to be 1.7 Å. Hence, C atoms are connected by vectors of the 2 -2.0 -1.5 -1.0 -0.5 0 1 0 0.5 1.0 1.5 2.0 -2 -1 Figure 11Subset of i to i+2 residue positions. Angle restriction: vectors parallel or producing sharp corners, thus violating the angle constraint, are shown in red. The sixth invalid position is not visible because it overlaps with the origin. 238Acuña et al.: Prediction of the protein folding nucleus with most interacting residues of chain geometry, we limit the angle between some Cs at positions i, i+1, and i+2 in a sequence by requiring the distance between them to be from 4.1 to 7.2 Å (or from 6 to 18 lu). This corresponds to angles from 66° to 143° [14, 43], which are closer to the real angles in -helices and -strand conformations than previous cubic lattice methods [5]. This is demonstrated in Figure 11. Here, some residue i is fixed at (0, 0, 0); we then show all 24 possible positions for residue i+1 (black vectors). For each i+2 residue, after excluding the occupied origin, there is a choice of 23 possible vectors: for clarity, only one (0, 1, 2) (the green and red vectors) is shown. Red vectors are those that violate the distance (angle) restriction and will not be permitted by the method. To initiate the simulations, 100 different starting conformations (or models) within the lattice are used. Figure 12 displays a sample of these models as a comprehensive plot. These starting conformations were taken from offline computations for chains of 1100 residues. The only requirement is that these randomly computed seed conformations have some level of non-compactness [14, 45]. Starting from the first backbone residue located at position (0, 0, 0), the first n positions in the seed model will be used for an input model with n residues in its sequence. When placing an input protein into the lattice for the first time, each residue in the protein is positioned based on the respective residue in the seed model, i.e., the first residue of the given protein is assigned the position of the first residue in the seed model, the second residue of the given protein is assigned the position of second residue in the seed model, and so on. If the input sequence is shorter than the seed model, then the positions of the final residues in the seed will not be used. As will be discussed later, a number of possible perturbations may be applied to each of the residues in the model. The model is stored as a set of relative vectors between Cs, representing their distance relative to the previous C. This is useful as we may, before knowing that a move is valid, change the position of a single residue without having to update every other residue that follows it in the chain. For the purposes of energy and neighbor calculations, a displacement vector must be calculated for the absolute position of each residue. We create new conformations by working along the relative positions of residues and accumulating their positions in space, in a neighbor-to-neighbor fashion. Hence, every residue following the one being changed will be translated by the distance between the initial and final positions of the perturbed residue. The resulting models are verified by checking that no non-adjacent residues have come into close proximity ( 5 lu) with other residues (which should be disallowed by our move set) and that no invalid bond angles have formed. If these checks fail, the perturbation that proposed the new conformation will be abandoned. At the end of each simulation, the algorithm internally produces a model that has collected all of the perturbations that were accepted, based on an energy criterion described later. It then uses the topology of the model to determine the NCN count at each residue. Two of these resulting models are shown in Figure 13 for the PDB code 1asu. Both of the MIR models follow the rough pattern of the starting model; however, one can observe the same globular regions starting to form at the same places (e.g., ends). In previous works, these were called proto fragments [5]. Initial structure 0 Initial structure 1 Initial structure 2 Initial structure 3 Initial structure 4 400 200 0 -200 -400 -600 -400 -200 0 -200 -100 -400 -200 200 600 400 400 200 0 -200 -400 -600 -200 -400 Figure 12Left: first five initial models. Right: all 100 initial models. Acuña et al.: Prediction of the protein folding nucleus with most interacting residues239 10 0 -5 -10 -15 -20 -25 -30 -5 -10 -15 -20 -25 -30 -35 0 0 -10 -20 -30 -40 -50 0 -5 -10 -15 -20 -25 -30 -35 -100 -80 -60 -40 -20 -20 -10 Figure 13Models resulting from the first two simulations for the protein with PDB code 1asu. The blue models are the results, and the green model is seed model used. Energy model Although the protein chain is only modeled as a sequence of Cs, the effect of side chains is included in energy terms associated with each pair of residue interactions. We assume that inter-residue energies are significant when the distance between them is between 3.8 and 5.88 Å ( 5 to 12 lu). The lattice model requires a minimum distance of 3.8 Å for side-chain separation. For this work, we use the distance-independent statistical pair potential by Miyazawa and Jernigan [4, 46]. This is a 20 0 symmetric ×2 matrix where solvent effects are implicit. We take ER(i) to be the energy at the ith residue in a sequence and calculate it as ER( i ) = j i+ 1 applied to an existing conformation and accepted on the basis of standard Metropolis criterion. In the following sections, we detail the concrete MIR implementation. The algorithm is implemented in Fortran 90 and is used in a Linux environment. All random numbers are computed with the "Keep It Simple, Stupid" method [47]. Each time the algorithm is run, the same initial seed value is used; together with the precomputed random initial confirmation, this enables reproducible results. The Monte Carlo simulation is used to generate a total of 100 models. The limiting number of Monte Carlo steps for each simulation is given by: 10 6 , if L < 50 2 MClimit( L ) = 6 L 10 , otherwise 50 where L is the sequence length. During a simulation, we record the state (snapshot) of residue interactions every MClimit steps for a minimum of every 102 states. MChop = 10 4 The same process is followed for each simulation: first, we calculate the MClimit value. We then calculate how often an interaction snapshot should be taken. The overall initial energy of the model is then computed. From this point, we run the main simulation function to create new conformations until the model has performed MClimit number of steps. Let E be the sum of all residue energies for an entire model. For each change to the existing model, we compute E=Enew­Eexisting. A new conformation is accepted if E ; otherwise, <0 it may be accepted with probability b = e RT where R is the gas constant and T is the temperate such that RT=1.5, EI ( i, j ), where the energy interaction EI is calculated according to the distance between residues (i, j) and the energy matrix PE corresponding to the residue-residue energy interaction [4, 14] that is a function of the type of residue (one of the 20 amino acids). We use type(i) to denote a function from residue index to residue type. Let dist(i,j) compute the Euclidean distance between two points on the lattice. PE (type( i ), type( j ) ) , if 5 dist ( i, j ) 12 EI ( i, j ) = 0, otherwise Monte Carlo algorithm The core of the MIR algorithm is a Monte Carlo simulation where possible perturbations (moves) are repeatedly 240Acuña et al.: Prediction of the protein folding nucleus with most interacting residues corresponding to the optimized test previously performed [5]. According to [6] acceptance happens with probability: b . 1+ b Regardless of its acceptance, each attempted conformation represents an MC step. For each position, the number of NCNs is periodically recorded (every MChop steps). After the simulations have completed, we calculate the NCN count for each residue as an average based on the number of snapshots (104). Let N(i, j) be the total number of times i and j are non-covalent neighbors over all snapshots. Thus: NCN( i ) = 1 10 4 4 3 2 1 0 -1 -2 -1 0 1 2 2.0 1.5 1.0 0.5 0 -0.5 -1.0 -1.5 -2.0 3 j i+ 1 N ( i, j ) Residues with NCN of at least 6 are marked as MIRs; any residues with NCN of no more than 2 are marked as LIRs. To reduce statistical fluctuations that can produce successive positions attributed as an MIR, but without physical meaning, a smoothing procedure is implemented on the web server of SPROUTS. On the basis of a Pascal algorithm, it produces a smoothed distribution of NCNs, and the maxima are then considered as SMIRs. Figure 14Possible perturbations for residue i at position (0, 1, 2). Residue i-1 is always located at (0, 0, 0), which is the bottom intersection of the bright colored lines. Residues i+1 are located at the ends of the top black vectors. Green are 10, brown is 14, and orange are 18. Simulation step A model is evolving while we have not reached MClimit and can still select perturbations, which change energy between some residues. When we begin processing the model, we randomly select an unseen residue in the sequence to perturb. Call it residue i. We calculate the angle between residues i­1, i, and i+1. We limit the angle between these residues by restricting the distance between them from 4.1 to 7.2 Å (or from 6 to 18 ). On the basis of the angle between the residues adjacent to i, a number of different perturbations (or moves) may be possible. Below, we will consider one of the six groups of angle perturbations that are possible. In the illustration of Figure 14, we simplify the possible placements by assuming that residue i-1 is located at the origin of the coordinate system and that residue i is located at (0, 1, 2). The following perturbations were originally presented by Skolnick and coworkers [6­8]. We use the term perturbation vectors to refer to the use of the difference vectors defined by the 24 possible positions between covalently bonded residues in the lattice. For a simple example of a perturbation, consider those perturbations of length 10 , 14, 18 . These three distances each define a family of vectors (along some rotational symmetry) that can each serve as perturbation of the model. For these so-called corner moves, the perturbation takes the form of an exchange of the relative vectors between residues i-1 to i with i to i+1 (see Figure 1; the continuous and dashed lines are swapped). Figure 14 shows the results of these perturbations to an arbitrary position for residue i (midpoint of bold lines). The nine positions possible, after removing unnatural bond angles, for residue i+1 are indicated in black. For each possibility from these nine initial positions, the dashed green lines indicate the associated result of the perturbation. Notice that for each case, residues i­1 and i+1 do not change position. For distances of 0, 6 , 8 , 12 , 16 , the case is more complex: the algorithm locates multiple position perturbations. For 0, i.e., the N-terminal residue, the algorithm selects any of the 23 permutation vectors that do not overlap with the model (the so-called end move). For other distances, the algorithm also randomly selects one of the generated perturbation vectors that do not overlap with the model. However, it will have fewer valid choices than for the residues at the limits. The process of randomly selecting a perturbation continues until one that generates a valid model is found. When a position is problematic (i.e., it generates a model with residue overlap or unnatural angles), it is marked as seen so that it is not retried during the particular simulation step. If the new model is valid, the change is applied and the energy value is calculated for the residue that was Acuña et al.: Prediction of the protein folding nucleus with most interacting residues241 moved. This local energy result is passed to the energy acceptance function to probabilistically determine if this new model should be accepted. If the model is invalid or is not accepted, then we reset to the previous model. In this case, one begins the process of seeking a residue to perturb again. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bio-Algorithms and Med-Systems de Gruyter

Protein intrachain contact prediction with most interacting residues (MIR)

Loading next page...
 
/lp/de-gruyter/protein-intrachain-contact-prediction-with-most-interacting-residues-LNF0sWScQ4
Publisher
de Gruyter
Copyright
Copyright © 2014 by the
ISSN
1895-9091
eISSN
1896-530X
DOI
10.1515/bams-2014-0015
Publisher site
See Article on Publisher Site

Abstract

The transition state ensemble during the folding process of globular proteins occurs when a sufficient number of intrachain contacts are formed, mainly, but not exclusively, due to hydrophobic interactions. These contacts are related to the folding nucleus, and they contribute to the stability of the native structure, although they may disappear after the energetic barrier of transition states has been passed. A number of structure and sequence analyses, as well as protein engineering studies, have shown that the signature of the folding nucleus is surprisingly present in the native three-dimensional structure, in the form of closed loops, and also in the early folding events. These findings support the idea that the residues of the folding nucleus become buried in the very first folding events, therefore helping the formation of closed loops that act as anchor structures, speed up the process, and overcome the Levinthal paradox. We present here a review of an algorithm intended to simulate in a discrete space the early steps of the folding process. It is based on a Monte Carlo simulation where perturbations, or moves, are randomly applied to residues within a sequence. In contrast with many technically similar approaches, this model does not intend to fold the protein but to calculate the number of non-covalent neighbors of each residue, during the early steps of the folding process. Amino acids along the sequence are categorized as most interacting residues (MIRs) or least interacting residues. The MIR method can be applied under a variety of circumstances. In the cases tested thus far, MIR has successfully identified the exact residue whose mutation causes *Corresponding author: Zoé Lacroix, Scientific Data Management Laboratory, School of Electrical, Computer and Energy Engineering (ECEE), Arizona State University, Tempe, AZ 85282-5706, USA, E-mail: zoe.lacroix@asu.edu Ruben Acuña: Scientific Data Management Laboratory, School of Electrical, Computer and Energy Engineering (ECEE), Arizona State University, Tempe, AZ 85282-5706, USA Nikolaos Papandreou: Genetics Department, Agricultural University of Athens, Athens, Greece Jacques Chomilier: Protein Structure Prediction group, IMPMC, Sorbonne University, UPMC, CNRS, MNHN, IRD, Paris, France; and RPBS, Paris, France a switch in conformation. This follows with the idea that MIR identifies residues that are important in the folding process. Most MIR positions correspond to hydrophobic residues; correspondingly, MIRs have zero or very low accessible surface area. Alongside the review of the MIR method, we present a new postprocessing method called smoothed MIR (SMIR), which refines the original MIR method by exploiting the knowledge of residue hydrophobicity. We review known results and present new ones, focusing on the ability of MIR to predict structural changes, secondary structure, and the improved precision with the SMIR method. Keywords: globular proteins; hydrophobic core; prediction method; protein folding nucleus; protein structure. DOI 10.1515/bams-2014-0015 Received September 11, 2014; accepted October 16, 2014 Introduction Since the pioneering work of Anfinsen on the proteinfolding problem [1, 2], large advances have occurred; however, it still remains an unresolved scientific problem. From the viewpoint of theory, a dynamical simulation is not amenable because it is able to produce results up to microseconds at best, whereas the typical folding kinetics is on the order of milliseconds. This is why "toy models" are useful for solving this issue in a discrete space. This reduces the spatial complexity and has the ability to determine neighbors easily. Potential energies used in this type of simulation can either come from an analysis of the native structure of the studied chain [3], or from statistical derivation from a database of structures [4]. In previous works, we developed such a simulation on a cubic lattice, and focused on the formation of a complete globular protein [5]. It appeared that during initial folding, residues tend to locally aggregate to form fragments. The limits of the sequence of these fragments in the protein sequence 228Acuña et al.: Prediction of the protein folding nucleus with most interacting residues are rather well fixed, although they can be displaced in the three-dimensional (3D) space. Subsequently, the number of fragments regularly decreases up to the formation of a single globule of contiguous fragments that collapse. The boundaries of two successive fragments are commonly located in loops, and almost never in regular secondary structures (RSSs). The algorithm is sensitive to the modification of the secondary structure, such as the introduction of a bulge in FK506 binding protein (FKBP) [5]. The bottleneck of this simulation is that it provides no information on the type of regular structure. Moreover, a single cubic lattice only allows angles of 90° between two successive -carbons. The method has evolved with time, both technically and conceptually. From a technical point of view, a significant improvement was introduced by the use of a (2, 1, 0) lattice, based on the work of Skolnick and coworkers [6­8]. This kind of lattice provides large gains in chain flexibility and far more realistic local chain configurations. More precisely, we used a simplified version of the Skolnick model, without side chains and with a distanceindependent, standard Miyazawa-Jernigan potential. The main evolution of our approach was guided by the observation that all simulated polypeptide chains tended to form a succession of local fragments, indicating that some residues act as anchors and "attract" more spatial neighbors in the early steps of the calculation. The identification of these residues and their comparison with critical positions, provided by other sequence- or structure-based approaches, was a meaningful research path, as it could result in a reliable ab initio identification of some position that might be included in the folding nucleus; however, we do not pretend to exhaustively define it. For this latter purpose, the application of this method to a set of aligned sequences of similar fold seems to be promising to detect some aspects of the conservation of the folding nucleus, when it is the case. As we are interested in the early steps, we will focus on a set of residues able to interact with each other. This is related to the concept of "contact order," which, in essence, quantifies the mean length of the sequence between two interacting residues. This approach has been the object of many publications over the years, and various names have been given to the sequences joining the contacting residues: foldons, closed loops, building blocks, tightened end fragments (TEFs), and so on. Therefore, the newly developed most interacting residues (MIR) algorithm periodically records the number of non-covalent neighbors (NCNs) [9]. The time mean value of the NCNs is calculated at the end of the simulation, and a diagram is produced as an output of the algorithm. This distribution presents maxima that have been called MIR, and minima known as LIR (for least interacting residues). In a few cases tested thus far, the MIRs correspond to some extent to the residues involved in the protein-folding core [10]. They have also been compared to the energetic stability of positions when subjected to a single mutation. This comparison is enabled by the development of a web server proposing several algorithms that are applied to the calculation of the variation of the free energy due to the mutation. There is a statistical correspondence between MIRs and stable positions [11]. The MIR algorithm is now available online as part of Structural Prediction for pRotein fOlding UTility System (SPROUTS) [12], hosted by Ressource Parisienne en Bioinformatique Structurale (RPBS) [13]. It may be accessed at http://sprouts.rpbs.univ-parisdiderot.fr/. The web server SPROUTS comprises a database and submission server integrating various resources in order to predict the impact of protein stability under mutation. It contains a web interface for interacting with the results of the MIR simulation and a submission server that automatically analyzes sequence data with MIR. This article aims to address the following questions: what are the residues likely to play a critical role in the early steps of the folding of the protein, and whose mutation may influence its structure? The article is organized as follows. We first describe the MIR method and introduce the SMIR. We then report on known findings about the MIR method and present new results, in particular on the ability of MIR to predict structural changes, secondary structure, and the improved precision with the smoothed MIR (SMIR) method. We conclude with current research projects expected to benefit greatly from the use of MIR. Materials and methods We now give an outline of our method for simulating the early steps of protein folding. First, a random initial conformation is produced for an -carbon-only simplified representation of the polypeptide chain. Each -carbon is placed at random on the nodes of a lattice as illustrated in Figure 1. An extension of a cubic lattice, namely (2, 1, 0), originally proposed by Skolnick and coworkers [6­8], is used. Compared with the simple cubic lattice, it allows a wider range of backbone angles, from 64° to 143°. The number of first neighbors is also higher (24 instead of 6). Side chains are discarded in our implementation of the method. Folding is simulated by randomly selecting one amino acid and submitting it to one of two available moves: an end move for the N- or C-terminal positions, Acuña et al.: Prediction of the protein folding nucleus with most interacting residues229 3.8 Å 1.7 Å acids that can be found among FILMVWY because this set is more frequently found in RSSs than in loops [14]. If the NCN at one position is , this position is then assigned as <2 a LIR. An optional smoothing process based on the Pascal triangle may then be applied to the resulting data to refine the MIRs identified. The concepts and implementation of the MIR algorithm are described in detail in the Appendix. Figure 1Details of a (2, 1, 0) move, with respect to the underlying cubic lattice. Postprocessing One limitation of the MIR algorithm is the number of residues identified as MIRs. The SMIR smoothing method is aimed at selecting, with a neighborhood analysis, the most interacting hydrophobic residue. Our method uses the Pascal triangle and adjusts the maxima that are identified in the smoothed graph to nearby (within three residues) hydrophobic positions, if any, based on our accepted precision of the algorithm. We continue to identify the minima with a threshold; however, there is no constraint on the nature of the LIR. Our smoothing algorithm has four parts: Pascal-triangle-based smoothing, identifying the maxima in the smoothed data, identifying the minima in the smoothed data, and, lastly, moving the identified maxima to the nearest hydrophobic residue, if any. We classify the amino acids FILMVWY to be hydrophobic in the following discussion. The algorithm introduces the notion of smoothed NCN, from which SMIR and smoothed LIR (SLIR) are deduced. The input to the smoothing process is the average number of NCNs as determined by the MIR simulation. To provide greater fidelity in the smoothing, the raw NCN count is multiplied by a factor of 10; it is later renormalized to match the initial values. Our technique uses the 10th row of Pascal's triangle. The numbers are used to calculate a weighted average (which we call Pascal value, or PV) at each residue by centering on it. As this technique requires 10 surrounding values, we first smooth the first and last five residues by using only the Pascal numbers that correspond to valid residues. We then label the maxima in the smoothed NCN data. If the residue has a higher PV than each of its adjacent neighbors and has a PV of at least 4.95 (NCN), we label it as a maximum. The next step is to label the minima in the smoothed NCN data. If a residue has a lower PV than each of its adjacent neighbors and has a PV of 3.04 (NCN), we label it as a minimum. Once the positions with the maxima have been labeled, the labeled positions may be moved to nearby hydrophobic residues. The nearest hydrophobic residue up to three positions away will be or a corner move otherwise. Crankshaft moves (two neighbor residues displaced together to a pair of empty nodes) are no longer permitted with the (2, 1, 0) lattice. We then perform potential perturbations, or moves, upon randomly selected residues within the sequence. A new position can be used when it is not occupied by another residue. The new conformation energy is computed by use of statistical potential of mean force, taken from the literature [4]. The Metropolis criterion is applied to accept or reject the new conformation based on its energy change. This is performed until some number of simulation steps has been taken, based on the length of the sequence. A limit in the length of the sequences submitted to the simulation has been fixed at 500 amino acids. One reason is to limit the CPU time, and the second one is because folding makes sense as long as one deals with domains, not complete chains. Thus far, there is no domain separation of the query sequences, and this limit corresponds to the longest known single domain. The process is stopped when roughly 105­106 Monte Carlo steps have been completed, depending on the length of the input protein sequence. The required number of steps has been calibrated on a few sequences, by trying to assess the minimum number of steps beyond which the obtained MIR set does not fluctuate. The process is repeated 100 times, starting from 100 initial conformations. Intermediate models are analyzed to determine, for each residue, how many other residues are close enough to interact while not being covalently bonded. This gives the number of NCNs at each residue, which is recorded so that the data can be averaged later. Two non-covalently bound residues are considered to interact (i.e., be NCNs) if the distance between their respective -carbons does not exceed 7.2 Å. The mean NCN is calculated at the end of the process and averaged for all the initial conformations. The distribution of NCNs along the sequence presents the maxima and minima. The maxima are called MIRs. A residue is accepted as an MIR if the NCN at this position is 6. They are mainly hydrophobic; one of the six amino 230Acuña et al.: Prediction of the protein folding nucleus with most interacting residues used. If there are multiple choices, the residue possessing the highest PV will be selected. If none of the neighbor residues is hydrophobic, the maximum will be lost. The maxima may now be considered SMIRs. The minima are not reallocated and so are taken to be the SLIRs. At this point, postprocessing is complete. filled out at submission time (this is optional). An alternative input can be a custom sequence in FASTA format [16]. The MIR results are stored in the SPROUTS database together with general information retrieved with the PDB code, when available. Alternatively, custom sequence data may be uploaded together with a retrieval code that provides access to the submitter. By populating a database automatically upon user submission, SPROUTS enables the quick retrieval of MIR results while allowing users to grow the database of PDB code MIR results with their PDB code MIR submissions. Data are presented with a dynamic interface implemented in Javascript with D3 [17] for visualization. This interface allows one to apply the smoothing algorithm dynamically to the displayed data. The interface has been primarily tested in Google Chrome 35. Firefox 21 and Safari 6 have also been tested. Our browser-based implementation allows users to retrieve the new SMIR analysis for existing proteins without the need to resubmit the entry to our submission server. By clicking on the appropriate link, a file with the results in the commaseparated values (CSV) format can be downloaded. The following statistical analyses were performed using the live (as of June 30, 2014) set of proteins available in the SPROUTS database. Only entries corresponding to public entries in the PDB were used. This comprised 498 chains >1 (containing 32,655 MIRs). These analyses were performed on the unsmoothed MIR results. In future work, we plan Results In this section, we report the results obtained by running MIR in a variety of circumstances. The standalone MIR 1.0 implementation was first made available online as a part of the RPBS server in 2005 [13]. The SPROUTS submission server [12] uses the MIR2.2, with a browser client side extension for SMIR3.10 algorithms, implementation (in Fortran) for server side simulation, and provides a Javascript front end for interactive analysis and smoothing [15]. It may be accessed at http://sprouts.rpbs.univ-paris-diderot.fr/mir. html. The basic interface is shown in Figure 2. When a user submits a list of Protein Data Bank (PDB) IDs (or retrieval codes; used for private data), SPROUTS will immediately return the corresponding MIR data if they have been computed already. New PDB IDs will be submitted for processing on the SPROUTS server; at completion time, the user will receive a notification message if the e-mail box was Figure 2SPROUTS online interface for MIR. Background is the submission window on the server. In front, the output of the protein of PDB code 1asu, in the smoothed option. In light blue are the NCNs; dark blue are the SMIRs; brown are the LIRs; and the orange zones are the TEFs. Acuña et al.: Prediction of the protein folding nucleus with most interacting residues231 to strengthen these results by extending these analyses to PDB IDs that are representative of the 1200 folds contained in the structural classification of proteins (SCOP) [18]. The value of applying a smoothing procedure can be seen in Figure 3, in the case of the fibronectin type III (PDB code 1ten), chosen because an extensive folding nucleus determination has been performed [19]. Smoothing reduces the concentration of MIRs in neighboring positions (8 and 10, for instance) owing to statistical fluctuations. The smoothed curve of NCN presents several maxima at positions I8, T21, Y36, I48, I59, L72, and M79, if we skip to two SMIRs on both ends of the sequence. According to Hamill et al. [19], the folding nucleus of this protein contains only four residues, namely I20, Y36, I59, and V70. If we assume an accuracy of , our prediction is ±1 correct for three of these four positions, and one is shifted by two residues (V70). Nevertheless, the overprediction still needs improvement, which is partly accomplished when the stability prediction is considered, as will be later shown. sequence covered by the TEF is in orange, and statistically their limits are close to MIR [21]. Stability As MIRs should correspond to residues that are important in the transition states of protein folding, mutations at those positions should also affect the stability of the protein. Protein stability can be examined in terms of G values, representing the change in free energy difference between a wild type and a mutant. One feature of SPROUTS is a repository of G values predicted by a suite of eight tools for every possible amino acid substitution at every position of a protein [12]. SPROUTS includes the ability to graph this in terms of a so-called impact count. For each predictive tool, the impact count is equal to the number of mutations at each position that give a mutant with a G .00 kcal/mol or 2.00 kcal/mol. As >2 <a score of 1 is attributed to any mutation that modifies the G above this threshold of kcal/mol, the scale goes ±2 from 0 to 19 if all possible mutations at a given position produce a strong destabilization or stabilization. This threshold has been proposed because it is estimated as the best accuracy, based on an evaluation of prediction tools [23] with calculations of the free energy. This represents the number of mutations with a non-neutral G. Positions of a protein with high impact count are positions that are sensitive to substitutions, either stabilizing or destabilizing. An example is shown in Figure 4, with an immunoglobulin fold from the PDB entry 1ten. Not all available tools can be run for all the PDB codes, for data integrity reasons. As the impact count shows positions TEF correspondence For a number of years, the analysis of the 3D structures of globular proteins at the subdomain level has led to the evidence of fragments of proteins where the two ends are very close in 3D space. Their extremities are mainly located in the cores of the globules, and also mainly occupied by conserved hydrophobic residues [20­22]. They are either called closed loops, or TEF. Because of this property of burying, they are presented on the graphs of the 2D outputs of SPROUTS. In Figure 2, the length of the 8 7 6 5 NCN 4 3 2 1 0 0 10 20 30 40 50 Sequence 60 70 80 90 NCN 8 7 6 5 4 3 2 1 0 0 10 20 30 40 50 Sequence 60 70 80 90 Figure 3Raw (left) and smoothed (right) NCNs on the protein of PDB code 1ten. Dark blue bars indicate either MIR or SMIR, and dark red bar the smoothed LIR. 232Acuña et al.: Prediction of the protein folding nucleus with most interacting residues Free energy analysis of stability impact for each residue of 1TEN (A) MUpro I-Mut 2 seq only I-Mut 3 seq only DFIRE 20 18 16 14 Impact count 12 10 8 6 4 2 0 0 10 8 7 6 5 NCN 4 3 2 1 FoldX 3 40 50 Sequence Figure 4Stability of the protein of PDB code 1ten. Left: five algorithms predicting the G for a point mutation: FoldX [24], MUpro [25], DFIRE [26], and versions 2 and 3 of I-mutant [27]. Right: SMIR distribution. that have a non-neutral G under mutation, we expect that they have some correspondence to MIRs and SMIRs. In Figure 4, one can see the disagreement between the five tools used on this example. Nevertheless, one must notice that most of them are in agreement for the positions where the impact count is maximum. This means that, although the prediction of stability is not highly accurate, there is a reasonable agreement between the algorithms to consider positions where mutation can affect the structure, either toward stability or toward instability. One may visualize that the maxima in the smoothed distribution of the NCN produces a rather qualitative agreement with the maximum in the impact count distribution. In particular, the highest values are compatible with the maxima of the smoothed NCN curves at the following positions: T21, Y36, I48, I59, and L72. Therefore, compared with the results presented with MIR calculation alone (see Figure 3), the false positives at positions 8 and 79 disappear when they are crossed with the stability prediction. In other words, SMIRs are able to capture some of the physics that is at the core of the structural stability of the proteins [10]. [28­30]. They engineered two proteins that are available in the PDB under the codes 2LHC (GA98) and 2LHD (GB98). These proteins contain 58 residues that give a 98% sequence similarity. 2LHC contains three -helices, whereas 2LHD contains four -sheets and one -helix, as illustrated in Figure 5. A single mutation at the 45th residue (leucine toward tyrosine) changes the folded conformation. As seen in the 2LHC SMIR curve presented in Figure 6, the 45th residue is initially identified as a SMIR, when it is a leucine. In the 2LHD SMIR curve in Figure 6, we see that the SMIR at that position has disappeared when it is mutated by another hydrophobic residue. Although the difference in the smoothed NCN distributions is small, this mutation is locally sufficient to perturb Predicting structural change Aimed at demonstrating the minimal key residues coding for a fold, several attempts have been published that are based on the same scenario. Starting from two short sequences of different folds, a set of mutation is performed on each one with the goal of obtaining the closest sequences possible while preserving the respective folds. Orban's group has demonstrated how a single point mutation could have a transformative impact on protein fold Figure 5PDB structures for 2LHC (GA98) (left) and 2LHD (GB98) (right). The colored dot indicates the position of 45th residue in both sequences. Acuña et al.: Prediction of the protein folding nucleus with most interacting residues233 8 7 6 5 NCN 4 3 2 1 0 0 5 10 15 20 25 30 Sequence 35 40 45 50 55 NCN 8 7 6 5 4 3 2 1 0 0 5 10 15 20 25 30 35 Sequence 40 45 50 55 Figure 6Smoothed NCN distributions for 2LHC (GA98) (left) and for 2LHD (GB98) (right). SMIRs are indicated by dark blue positions. The SMIR comparison analysis shows that the residue at position 45 shown as an SMIR in 2LHC (left) is no longer an SMIR in 2LHD. This corresponds to the mutation that caused a dramatic structural change as illustrated in Figure 5. the NCN curve such that the maximum vanishes with the tyrosine instead of the leucine. Regular secondary structure distribution Of the 32,314 MIRs existing in the data set available with DSSP formatted [31] data, 20,573 (63.67%) were found in RSSs. Comparatively, the fraction of arbitrary residues that were found in RSSs was 51.07% in the full dataset. Of MIRs in RSSs, 11,732 (57.03%) were found in helix 1200 1000 800 Occurrences 600 400 200 0 structures, whereas 8841 (42.97%) were found in strands. For both kinds of structures, MIRs tend to cluster at the ends. Figures 7 and 8 show how MIRs are distributed within RSSs. For clarity, the graphs omit those positions with 0 entries. Secondary structure data were extracted <3 from the DSSP files associated with each PDB entry. We consider helices to be composed of HGI conformational states, whereas strands are EB according to the DSSP classification. In Figure 7, we see a pattern on the N-terminal end of the sequence where MIRs peak at the N1 and N4 positions of N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 N12 C12 C11 C10 C9 C8 C7 C6 C5 C4 C3 C2 Position Figure 7MIR distribution in helices. All helices of different lengths are reported in this figure. The first position at the N (C) terminus is called N1 (C1), and successive numbers are used up to the center of the helix. C1 234Acuña et al.: Prediction of the protein folding nucleus with most interacting residues helices. This seems to correspond roughly to the number of residues in a turn of an -helix (3.6) projected on our lattice model. Amazingly, this feature, which one may assume is linked to the periodicity of the -helix, is not seen on the C-terminal end. One assumption is that the capping effect at this C end is stronger than at the N cap. It would mean that at this end, the helix is slightly shrunk, modifying the periodicity of the hydrophobic-hydrophilic distribution. This is coherent with the tendency to form 310 helices at the C terminus of the -helices. In the central part of the figure, one cannot obtain any information as the mean length of an -helix in the PDB is 12 amino acids; thus, from N6 to C6, this corresponds to a smaller number of occupied positions that results in a smaller trend of statistics. The features in Figure 8 are less distinct with strands but seem to show that a pair of MIRs may cluster at either end of a strand. That one does not observe a succession of maxima and minima in this distribution may be attributed to the fact that completely buried strands, at the opposite of globule border strands, are occupied by hydrophobic residues on both sides of the strand. Conclusions about the periodicity of the MIR in strands are not as easy as in the case of helices because the mean length of a strand in the PDB is five amino acids, corresponding to a number jumping from N3 to C2. those that could be exactly aligned with their associated PDB data. The set of proteins analyzed contained a total of 17,211 MIRs. Among them, 11,104 (64.52%) are buried if one applies a classical threshold of 25 Å2. As the sequences retrieved from the atomic coordinates may differ from the one in the sequence tab of a given PDB file, not all proteins could be exactly aligned. Surface Racer [32] was used to compute the accessible surface area for each of the protein pairs that were exactly aligned. Surface Racer failed to run on several of these submissions; hence, a total of only 848 were processed. The default probe size (1.4 Å) and van der Waals radii set was used. Figure 9 is a histogram of the solvent-accessible area of the 17,211 MIRs. The histogram excludes the 66 MIRs with surface area 00 Å2. Moreover, 3% of MIRs are >2 >9 located on hydrophobic residues (see Table 1). As MIRs are typically hydrophobic residues, we expect, and see, that most are deeply buried in the protein core. Discussion The knowledge of the residues involved in intrachain contacts during the folding process is important, for instance in the annotation of misfolding-related pathologies. The amino acids involved in several interactions are key residues in achieving the folding and in determining the fold, and are very sensitive to mutation because of their strong contribution to the structure. The role of prediction, at this time, is a valuable complementary approach because the experimental determination of the transition state ensemble is very difficult [33, 34]. Published literature admits Accessible surface area For this analysis, we started with the basic set of 1498 entries in the SPROUTS database and then narrowed it down to 848 entries (the final number used) based on 1800 1600 1400 Occurrences 1200 1000 800 600 400 200 N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 N12 C12 C11 C10 C9 C8 C7 C6 C5 C4 C3 C2 C1 0 Position Figure 8MIR distribution among strands. All strands of different lengths are reported in this figure. The first position at the N (C) terminus is called N1 (C1), and successive numbers are used up to the center of the strand. Acuña et al.: Prediction of the protein folding nucleus with most interacting residues235 5000 4500 4000 Occurrences 3500 3000 2500 2000 1500 1000 500 0 0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152 160 168 176 184 192 200 Accessible area (Å2) Figure 9Frequency of MIR accessible area. Table 1Frequency of hydrophobic residues in SPROUTS. General frequency 8.27 8.05 7.71 6.94 6.42 6.03 6.03 5.90 5.85 5.14 5.13 4.80 4.32 3.98 3.73 3.56 2.26 2.18 2.12 1.58 0.02 Residue Leu Ala Gly Val Ser Lys Glu Asp Thr Arg Ile Asn Pro Gln Phe Tyr His Cys Met Trp Xaa MIR frequency 35.08 16.40 15.64 13.36 5.18 4.58 3.66 2.85 1.49 0.44 0.37 0.37 0.16 0.13 0.13 0.05 0.04 0.03 0.03 0.00 0.00 Residue Leu Ile Phe Val Met Tyr Trp Cys Ala His Thr Gly Ser Arg Pro Gln Glu Asn Asp Xaa Lys SMIR frequency 32.58 19.26 15.58 15.25 7.45 6.34 2.86 0.10 0.10 0.09 0.09 0.09 0.07 0.05 0.03 0.02 0.02 0.01 0.01 0.00 0.00 Residue Leu Val Ile Phe Met Tyr Trp Gln Glu Asp Thr Lys Ser Pro Ala His Gly Arg Cys Xaa Asn The left column of the table displays the frequency of residues from all protein chains available in SPROUTS (1,498 protein chains at time of retrieval). The middle column reports on the distribution of the MIRs found for these 1,498 protein chains whereas the right column reports on the SMIRs. Nearly a third of the overall residues found in SPROUTS (31.33%) were hydrophobic (left). A total of 32,656 MIRs were identified in those 1,498 protein chains, 93.91% of which are located on hydrophobic residues (middle). The SMIR method only identified 12,917 residues, a number of residues which is 60% less than those identified as MIRs (right). that the number of residues involved in the folding nucleus can vary from a few percent up to half the sequence length, roughly one-third of the hydrophobic residues. Conservation of the folding nucleus among members of a given fold is also the object of large debates, and in the case of the immunoglobulin fold, can be as low as four residues [19, 35]. To contribute to this goal, we have developed an algorithm aimed at predicting from the sequence the set of residues involved in a large number of intrachain contacts, known as MIR. However, the energy related to a stabilizing contact comes, in fact, from a somewhat wider sequence range than a single residue, like it was shown in the article by Berezovsky et al. where the ends of the closed loops where composed of from three to five residues [20]. Therefore, the prediction can be improved with a smoothing of the curve of NCNs as presented in SMIR. This drastically reduces the number of MIRs and avoids local fluctuations due to the presence of hydrophobic residues close along the sequence. In protein sequence comparison, one is faced with two types of conserved positions. On the one hand, residues defining the active site are extremely conserved at the chemical level because they are responsible for the activity of the enzyme. On the other hand, some residues are important for the structure, and their physical property must be conserved. In other words, one must keep the hydrophobicity or hydrophilicity. This results in an effective amino acid alphabet that is degenerated, composed of at least two classes. We are here concerned by this second type of conservation, due to structural contingencies. Mutations within one of these classes are allowed; otherwise, they are forbidden because the structure would be destroyed, and consequently the function would be lost. It has been demonstrated that for each protein sharing a common fold where the nucleus is highly conserved, residues identified as MIRs constitute a nontrivial subset of the hydrophobic residues. Among such 236Acuña et al.: Prediction of the protein folding nucleus with most interacting residues families of proteins (several sequences per family, same structure, potentially different functions, and very divergent sequences) MIRs occupy equivalent positions in the multiple alignments [10]. Therefore, a small number of hydrophobic positions are conserved as hydrophobic. They are compulsory for the folding to occur; they are deeply buried. For these reasons, it seems reasonable to question whether they constitute or belong to the folding nucleus of the various folds. We are not at the stage of giving a definite answer yet; however, one can estimate it on the immunoglobulin-like fold (56 structures of divergent sequences), where the MIRs reconstruct most of the common folding nucleus, which is highly conserved [19]. These results concern a very small number of families; however, experimental evidence on the folding nucleus is not obvious and can show strong biases [36, 37]. When analyzing the proteins constructed by Orban's group, MIR successfully identified the exact residue whose mutation causes a switch between two folds. The curve generated by the MIR simulation that gives the mean number of non-covalent numbers can also be shown to partially correspond to the curve (in terms of maxima and minima) of the G impact count (representing the number of mutations that give a non-neutral G). This follows with the idea that the MIR method identifies residues that are important in the folding process. An analysis of the correspondence between MIR positions and secondary structures shows that MIRs appear to follow the pattern of matching to a side of -helices. As expected from their favor to hydrophobic residues, most MIRs have zero or very low accessible surface area. An a posteriori justification of the selection of the seven amino acids we made for the hydrophobic class is the fact that they have top-ranked frequencies among the MIRs. Finally, a brief discussion on the model used in the MIR/SMIR approach is useful. It has been argued that such simplified models may not lead to a realistic prediction of a folded protein, without introducing strong biases to the desired topology. However, the simple, bias-free, rather standard model used here, is, in our opinion, well suited for the job of simulating the first steps, departing from the unfolded state. The non-compact nature of protein topology at these steps is compatible with the discrete space, side-chain-free lattice geometry, with a mean-field-type, distance-independent potential. To provide open access to the community, a web server has been developed that performs the calculation of the MIR positions. If the query is a PDB code, then it verifies if this protein is available in a dataset where previous calculations are stored; otherwise, the prediction is computed. It also allows submitting a sequence in the FASTA format, and data can be retrieved from the collection of results, provided one code is given at the submission time. MIR calculation is part of the SPROUTS server that was originally dedicated to the prediction of the stability of a position in a sequence, under the effect of a point mutation. On several examples, there is a good agreement between positions predicted as involved in a number of intrachain contacts (SMIR) and the susceptibility of a change in the energy level if a mutation would occur. It seems to indicate that the SMIR method may identify the most critical residues out of the most interacting residues (40% of MIRs). Conclusions At the time of submission, October 2014, MIR prediction has been performed for 900 PDB codes, and they are stored in >1 the SPROUTS database (see http://sprouts.rpbs.univ-parisdiderot.fr/) hosted at the French RPBS, a portal devoted to services supporting analyses of protein structures [13]. It can be queried directly at http://sprouts.rpbs.univ-paris-diderot. fr/mir.html. The output of the SPROUTS server was designed to be user-friendly and intuitive. It presents on the same 2D graph the expected impact of all possible mutations at one position and the predicted number of intrachain contacts. It then gives a precise idea of positions where a mutation that would change the hydrophobicity of the side chain is likely to have important consequences on the structure. MIR and its extension SMIR are also integrated in the whole SPROUTS analysis system, where the results can be compared to stability analyses [10, 38, 39]. MIR is listed as a service of the Mobyle portal for bioinformatics analyses (http://mobyle.rpbs.univparis-diderot.fr/), developed jointly by the Institut Pasteur Biology IT Center and RPBS that supports linear workflows integrating various services and databanks [40]. MIR is also listed as a resource in the Semantic Map for Structural Bioinformatics [41], where methods and tools are expressed in the terms of an ontology and displayed in a conceptual graph [42]. There are several leads that are expected to improve the method. For instance, Kolinski et al. [43] have proposed extending lattice models with flexible distances. Thus far, the algorithm treats a sequence on an "as it is" basis, and an improvement would be to first determine the domains and do the calculation on them instead of the full chain. Another extension would be to use a family of sequences instead of a single one, for instance through a query to PFAM, that would allow giving an insight into the conservation of the contacts in a family of proteins. While we improve the method, we are also exploring how to integrate the method to enhance other analyses of protein structures. A first approach consists of using Acuña et al.: Prediction of the protein folding nucleus with most interacting residues237 signal processing methods such as mapping protein sequences to time-domain waveforms. We will explore how MIR can be used as an index to guide the alignment of proteins, therefore identifying areas of similarities that typically are not captured by traditional alignment methods [44]. A second promising area resides in integrating MIR with stability analyses to better predict the impact of mutations on protein structure [10, 38, 39]. We will investigate how the consensus method consisting of the average of various stability analyses currently made available in SPROUTS [12] can be improved for the prediction of the dramatic impact of mutation of protein structures with MIR. -2.0 -1.5 Possible alpha carbon 2.0 1.5 1.0 0.5 0 -0.5 -1.0 -1.5 -2.0 2.0 1.5 1.0 0.5 0 -0.5 -1.0 -1.5 -2.0 Acknowledgments: We acknowledge Pierre Tufféry for his help on using the RPBS resources. Mathieu Lonquety and Christophe Legendre contributed to the SPROUTS database where SMIR results are stored. They are all thanked for their help. We also wish to acknowledge our collaborators at ASU: Antonia Papandreou-Suppappola and Anna Malin who have worked on an alternative MIR method, and Banu Ozkan for evaluating SPROUTS functionalities and discussing future improvement. Author contributions: All authors have accepted responsibility for the entire content of the submitted manuscript and approved the submission. Research funding: This work was partially supported by the National Science Foundation (grants IIS 0431174, IIS 0551444, IIS 0612273, IIS 0738906, IIS 0832551, IIS 0944126, and CNS 0849980) and by an invitation of the Université Pierre et Marie Curie. Employment or leadership: None declared. Honorarium: None declared. Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the report for publication. Any opinion, finding, and conclusion or recommendation expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. -1.0 -0.5 Figure 10Vectors resulting from intersection of lattice with sphere at origin 0.0. form (2, 1, 0). These vectors are 5 lu in length, which corresponds to 3.8 Å, the mean distance between adjacent C atoms in proteins. This results in 24 immediate neighbor positions for each point in the lattice. This represents the intersection of a 4 -segmented cube with a sphere of ×4 radius 3.8 Å ( 5 lu), as shown in Figure 10. Our model does not take into account the presence of side chains; therefore, the required separation is modeled with a 3.8 Å minimum distance requirement. On the basis 4 3 2 1 0 -1 -2 3 Appendix Lattice geometry We model a protein as a chain of evenly spaced C atoms placed on a lattice [14, 45]. We define a lattice unit (lu) to be 1.7 Å. Hence, C atoms are connected by vectors of the 2 -2.0 -1.5 -1.0 -0.5 0 1 0 0.5 1.0 1.5 2.0 -2 -1 Figure 11Subset of i to i+2 residue positions. Angle restriction: vectors parallel or producing sharp corners, thus violating the angle constraint, are shown in red. The sixth invalid position is not visible because it overlaps with the origin. 238Acuña et al.: Prediction of the protein folding nucleus with most interacting residues of chain geometry, we limit the angle between some Cs at positions i, i+1, and i+2 in a sequence by requiring the distance between them to be from 4.1 to 7.2 Å (or from 6 to 18 lu). This corresponds to angles from 66° to 143° [14, 43], which are closer to the real angles in -helices and -strand conformations than previous cubic lattice methods [5]. This is demonstrated in Figure 11. Here, some residue i is fixed at (0, 0, 0); we then show all 24 possible positions for residue i+1 (black vectors). For each i+2 residue, after excluding the occupied origin, there is a choice of 23 possible vectors: for clarity, only one (0, 1, 2) (the green and red vectors) is shown. Red vectors are those that violate the distance (angle) restriction and will not be permitted by the method. To initiate the simulations, 100 different starting conformations (or models) within the lattice are used. Figure 12 displays a sample of these models as a comprehensive plot. These starting conformations were taken from offline computations for chains of 1100 residues. The only requirement is that these randomly computed seed conformations have some level of non-compactness [14, 45]. Starting from the first backbone residue located at position (0, 0, 0), the first n positions in the seed model will be used for an input model with n residues in its sequence. When placing an input protein into the lattice for the first time, each residue in the protein is positioned based on the respective residue in the seed model, i.e., the first residue of the given protein is assigned the position of the first residue in the seed model, the second residue of the given protein is assigned the position of second residue in the seed model, and so on. If the input sequence is shorter than the seed model, then the positions of the final residues in the seed will not be used. As will be discussed later, a number of possible perturbations may be applied to each of the residues in the model. The model is stored as a set of relative vectors between Cs, representing their distance relative to the previous C. This is useful as we may, before knowing that a move is valid, change the position of a single residue without having to update every other residue that follows it in the chain. For the purposes of energy and neighbor calculations, a displacement vector must be calculated for the absolute position of each residue. We create new conformations by working along the relative positions of residues and accumulating their positions in space, in a neighbor-to-neighbor fashion. Hence, every residue following the one being changed will be translated by the distance between the initial and final positions of the perturbed residue. The resulting models are verified by checking that no non-adjacent residues have come into close proximity ( 5 lu) with other residues (which should be disallowed by our move set) and that no invalid bond angles have formed. If these checks fail, the perturbation that proposed the new conformation will be abandoned. At the end of each simulation, the algorithm internally produces a model that has collected all of the perturbations that were accepted, based on an energy criterion described later. It then uses the topology of the model to determine the NCN count at each residue. Two of these resulting models are shown in Figure 13 for the PDB code 1asu. Both of the MIR models follow the rough pattern of the starting model; however, one can observe the same globular regions starting to form at the same places (e.g., ends). In previous works, these were called proto fragments [5]. Initial structure 0 Initial structure 1 Initial structure 2 Initial structure 3 Initial structure 4 400 200 0 -200 -400 -600 -400 -200 0 -200 -100 -400 -200 200 600 400 400 200 0 -200 -400 -600 -200 -400 Figure 12Left: first five initial models. Right: all 100 initial models. Acuña et al.: Prediction of the protein folding nucleus with most interacting residues239 10 0 -5 -10 -15 -20 -25 -30 -5 -10 -15 -20 -25 -30 -35 0 0 -10 -20 -30 -40 -50 0 -5 -10 -15 -20 -25 -30 -35 -100 -80 -60 -40 -20 -20 -10 Figure 13Models resulting from the first two simulations for the protein with PDB code 1asu. The blue models are the results, and the green model is seed model used. Energy model Although the protein chain is only modeled as a sequence of Cs, the effect of side chains is included in energy terms associated with each pair of residue interactions. We assume that inter-residue energies are significant when the distance between them is between 3.8 and 5.88 Å ( 5 to 12 lu). The lattice model requires a minimum distance of 3.8 Å for side-chain separation. For this work, we use the distance-independent statistical pair potential by Miyazawa and Jernigan [4, 46]. This is a 20 0 symmetric ×2 matrix where solvent effects are implicit. We take ER(i) to be the energy at the ith residue in a sequence and calculate it as ER( i ) = j i+ 1 applied to an existing conformation and accepted on the basis of standard Metropolis criterion. In the following sections, we detail the concrete MIR implementation. The algorithm is implemented in Fortran 90 and is used in a Linux environment. All random numbers are computed with the "Keep It Simple, Stupid" method [47]. Each time the algorithm is run, the same initial seed value is used; together with the precomputed random initial confirmation, this enables reproducible results. The Monte Carlo simulation is used to generate a total of 100 models. The limiting number of Monte Carlo steps for each simulation is given by: 10 6 , if L < 50 2 MClimit( L ) = 6 L 10 , otherwise 50 where L is the sequence length. During a simulation, we record the state (snapshot) of residue interactions every MClimit steps for a minimum of every 102 states. MChop = 10 4 The same process is followed for each simulation: first, we calculate the MClimit value. We then calculate how often an interaction snapshot should be taken. The overall initial energy of the model is then computed. From this point, we run the main simulation function to create new conformations until the model has performed MClimit number of steps. Let E be the sum of all residue energies for an entire model. For each change to the existing model, we compute E=Enew­Eexisting. A new conformation is accepted if E ; otherwise, <0 it may be accepted with probability b = e RT where R is the gas constant and T is the temperate such that RT=1.5, EI ( i, j ), where the energy interaction EI is calculated according to the distance between residues (i, j) and the energy matrix PE corresponding to the residue-residue energy interaction [4, 14] that is a function of the type of residue (one of the 20 amino acids). We use type(i) to denote a function from residue index to residue type. Let dist(i,j) compute the Euclidean distance between two points on the lattice. PE (type( i ), type( j ) ) , if 5 dist ( i, j ) 12 EI ( i, j ) = 0, otherwise Monte Carlo algorithm The core of the MIR algorithm is a Monte Carlo simulation where possible perturbations (moves) are repeatedly 240Acuña et al.: Prediction of the protein folding nucleus with most interacting residues corresponding to the optimized test previously performed [5]. According to [6] acceptance happens with probability: b . 1+ b Regardless of its acceptance, each attempted conformation represents an MC step. For each position, the number of NCNs is periodically recorded (every MChop steps). After the simulations have completed, we calculate the NCN count for each residue as an average based on the number of snapshots (104). Let N(i, j) be the total number of times i and j are non-covalent neighbors over all snapshots. Thus: NCN( i ) = 1 10 4 4 3 2 1 0 -1 -2 -1 0 1 2 2.0 1.5 1.0 0.5 0 -0.5 -1.0 -1.5 -2.0 3 j i+ 1 N ( i, j ) Residues with NCN of at least 6 are marked as MIRs; any residues with NCN of no more than 2 are marked as LIRs. To reduce statistical fluctuations that can produce successive positions attributed as an MIR, but without physical meaning, a smoothing procedure is implemented on the web server of SPROUTS. On the basis of a Pascal algorithm, it produces a smoothed distribution of NCNs, and the maxima are then considered as SMIRs. Figure 14Possible perturbations for residue i at position (0, 1, 2). Residue i-1 is always located at (0, 0, 0), which is the bottom intersection of the bright colored lines. Residues i+1 are located at the ends of the top black vectors. Green are 10, brown is 14, and orange are 18. Simulation step A model is evolving while we have not reached MClimit and can still select perturbations, which change energy between some residues. When we begin processing the model, we randomly select an unseen residue in the sequence to perturb. Call it residue i. We calculate the angle between residues i­1, i, and i+1. We limit the angle between these residues by restricting the distance between them from 4.1 to 7.2 Å (or from 6 to 18 ). On the basis of the angle between the residues adjacent to i, a number of different perturbations (or moves) may be possible. Below, we will consider one of the six groups of angle perturbations that are possible. In the illustration of Figure 14, we simplify the possible placements by assuming that residue i-1 is located at the origin of the coordinate system and that residue i is located at (0, 1, 2). The following perturbations were originally presented by Skolnick and coworkers [6­8]. We use the term perturbation vectors to refer to the use of the difference vectors defined by the 24 possible positions between covalently bonded residues in the lattice. For a simple example of a perturbation, consider those perturbations of length 10 , 14, 18 . These three distances each define a family of vectors (along some rotational symmetry) that can each serve as perturbation of the model. For these so-called corner moves, the perturbation takes the form of an exchange of the relative vectors between residues i-1 to i with i to i+1 (see Figure 1; the continuous and dashed lines are swapped). Figure 14 shows the results of these perturbations to an arbitrary position for residue i (midpoint of bold lines). The nine positions possible, after removing unnatural bond angles, for residue i+1 are indicated in black. For each possibility from these nine initial positions, the dashed green lines indicate the associated result of the perturbation. Notice that for each case, residues i­1 and i+1 do not change position. For distances of 0, 6 , 8 , 12 , 16 , the case is more complex: the algorithm locates multiple position perturbations. For 0, i.e., the N-terminal residue, the algorithm selects any of the 23 permutation vectors that do not overlap with the model (the so-called end move). For other distances, the algorithm also randomly selects one of the generated perturbation vectors that do not overlap with the model. However, it will have fewer valid choices than for the residues at the limits. The process of randomly selecting a perturbation continues until one that generates a valid model is found. When a position is problematic (i.e., it generates a model with residue overlap or unnatural angles), it is marked as seen so that it is not retried during the particular simulation step. If the new model is valid, the change is applied and the energy value is calculated for the residue that was Acuña et al.: Prediction of the protein folding nucleus with most interacting residues241 moved. This local energy result is passed to the energy acceptance function to probabilistically determine if this new model should be accepted. If the model is invalid or is not accepted, then we reset to the previous model. In this case, one begins the process of seeking a residue to perturb again.

Journal

Bio-Algorithms and Med-Systemsde Gruyter

Published: Dec 19, 2014

References