Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Smotifs as structural local descriptors of supersecondary elements: classification, completeness and applications

Smotifs as structural local descriptors of supersecondary elements: classification, completeness... Protein structures are made up of periodic and aperiodic structural elements (i.e., -helices, -strands and loops). Despite the apparent lack of regular structure, loops have specific conformations and play a central role in the folding, dynamics, and function of proteins. In this article, we reviewed our previous works in the study of protein loops as local supersecondary structural motifs or Smotifs. We reexamined our works about the structural classification of loops (ArchDB) and its application to loop structure prediction (ArchPRED), including the assessment of the limits of knowledge-based loop structure prediction methods. We finalized this article by focusing on the modular nature of proteins and how the concept of Smotifs provides a convenient and practical approach to decompose proteins into strings of concatenated Smotifs and how can this be used in computational protein design and protein structure prediction. Keywords: loop structure prediction; protein loops; protein secondary structures; protein structure design; protein structure prediction; protein supersecondary structures. DOI 10.1515/bams-2014-0016 Received September 10, 2014; accepted October 15, 2014 Introduction Proteins are the workhorses of cells, mediating most of the cellular processes that make life possible. The function of a given protein is dictated by its three-dimensional (3D) structure, which is composed of secondary structure elements. These can be assigned in two major groups: secondary structures with regular patterns of atomic interactions and a translational symmetry (-helices and -strands) and nonrepetitive and nonregular secondary structures (loops). Loops account for a substantial part of proteins, up to 47% [1] of all residues, which are generally exposed and play an important role in the function, structure, and dynamics of proteins. At the structural level, loops play an important role in the folding and dynamics of proteins. Loops can act as hinges facilitating the folding/unfolding process [2­ 4], given their intrinsic flexible nature. It has been also shown that long-range loop-loop interactions are important in the folding of proteins [5]. The size of the loops was related to the stability of proteins [6­9], where, in the most extreme case, a single substitution in a loop can cause the destabilization of the entire protein [10]. Loops also play a central role in the function of proteins and in their associations to other biomolecules. Loops often define the functional specificity of protein families [11], binding pockets of substrates and cofactors (e.g., the P-loop [12] or serve as calcium binding (EF-Hand motifs; Figure 1) [14]), and catalytic sites (e.g., serine proteases [15] or serine/threonine kinases [16]). Given their flexible nature, loops play an important role in the conformational changes of enzymes and often are responsible for the correct positioning of catalytic residues [17, 18] and for activation mechanisms [19­22]. Finally, loops are important in protein-protein [23­25] and protein-nucleic acid [26] associations and in related functions (e.g., serving as complementarity determining regions in immunoglobulins [27] or recognizing motifs in signaling pathways [28, 29]). *Corresponding author: Narcis Fernandez-Fuentes, Structural Bioinformatics Group (GRIB), Department of Experimental and Life Sciences, University Pompeu Fabra, C. Doctor Aiguader, 88, Barcelona 08003, Catalonia, Spain, E-mail: narcis.fernandez@gmail.com Jaume Bonet and Baldo Oliva: Structural Bioinformatics Group (GRIB), Department of Experimental and Life Sciences, University Pompeu Fabra, Barcelona, Catalonia, Spain Andras Fiser: Albert Einstein College of Medicine, Department of Systems and Computational Biology, Bronx, NY, USA 196Bonet et al.: Smotifs as structural local descriptors of supersecondary elements Figure 1EF-Hand motifs. Cartoon representation of calmodulin-IQ domain-containing protein G complex that contains four EF-Hand motifs [13]. Calcium ions are represented as black spheres. In this article, we reviewed our and other related research works in the study of protein loops. We presented our results in the automatic structural classification of loops [30­35] and how this information can be used to predict the structure of protein loops [33, 36, 37]. We also discussed the limitations of the methodology [38, 39]. Finally, we described how our loop classification could be applied to protein structure prediction using a hybrid ab initio approach [40] and de novo structure-based protein design [41]. The common underlying concept that connects the different aspects of our research in this field is our definition of protein loops. We define loops as supersecondary structural motifs, or Smotifs, composed of the loop itself and its flanking regular secondary structures. The definition of loops as Smotifs allows us to define the local structural arrangement of the flanking regions of the loop or the geometry of the motif (Figure 2). In loop classification, geometry is a useful descriptor to group loops and speed up the clustering process. In loop structure prediction and protein design, the geometry allows an optimized hashing and look-up of loop conformations given a set of geometrical restraints. Finally, in protein structure prediction and design, Smotifs provide a convenient and coherent scheme to break down proteins into a sum of supersecondary elements. Figure 2Definition of Smotifs and geometry. Smotifs are defined by the loop (green) and the two flanking regular secondary structures: an -helix (Nt secondary structure in blue) and a -strand (Ct secondary structure in red). The geometry of a Smotif is defined by four internal coordinates: a distance and three angles: , , and (from top left and clockwise). follow conserved structural patterns. Thus, it is not surprising that in the early days of structural biology loops remained largely unclassified and were, sometimes, believed to have random coil conformation. However, loops do follow structural patterns and can be classified. Early work The first types of loops to be classified were short, geometrically highly restricted loops (e.g., -hairpins). One of the first classifications was proposed by Venkatachalam, who described several categories for four-residue-long turns [42]. The classification of -turns was subsequently extended and several times revisited [43­45] to include reversed turns or -turns [46­48] and the subdivision of -hairpins in different classes [49­52]. With the rapid increase of known protein structures and therefore of loop conformations, it was possible to uncover conserved structural patterns for longer and less geometrically restricted loops. Several studies Structural classification of protein loops It might seem counterintuitive that loops, largely known by their intrinsic flexibility and structural diversity, would Bonet et al.: Smotifs as structural local descriptors of supersecondary elements197 identified commonly occurring structural patterns, often linked to sequences and atomic interaction patterns for loops connecting not only -turns but also -strands and -helices [53, 54]. Albeit important, these studies relied mainly on expert, manual classification; therefore, automatic approaches were needed to derive structural classifications to cope with the ever-increasing amount of structural data [55, 56]. This includes our work described in more detail in the next section [31, 32, 34, 35] together with other approaches to classify loops using structural alphabets [57, 58]. Table 1Historical growth of the ArchDB classification. ArchDB Method Source 1997 2004 2007 2007 2007 2014 2014 DS DS+rc DS+rc DS+rc DS+rc DS MCL 25 40 95 40 EC 40 40 PDB Smotifs Subclasses Classes 121 1496 4023 2550 2686 13,198 12,240 56 451 2142 1119 1338 5362 9728 233 3005 2310 12,665 5472 36,153 3640 16,957 2349 20,260 17,961 129,280 17,961 187,117 Automated classification of protein loops based on geometry and conformation (ArchDB) The geometry of Smotifs can be exploited for the purpose of classifying loops. The geometry of Smotifs is defined by four internal variables that determine the relative position between the two flanking secondary structures of the loop: the distance (d) and three angles: hoist (), packing (), and meridian () [34] (Figure 2). The geometry of the Smotif as defined by the aforementioned features, the type of flanking secondary structures (i.e., -, -, -, and -; the latter further divided between -hairpins and -link), and the length of the loop (i.e., the number of residues composing the loop region of the Smotif) are the classifying attributes of Smotifs. The first classification of Smotifs [34] was generated from 233 high-quality crystallographic X-ray structures obtained from the Protein Data Bank (PDB) [59] (resolution better than 2.5 Å) after removing redundancy at 25% sequence identity cutoff. For each of these, -helices and -strands were defined using DSSP [60], which produced a total of 3005 Smotifs. Subsequently, Smotifs were clustered according to their geometrical properties using a density search (DS) algorithm, which is a variant of the single-linkage clustering method [61]. Briefly, a network is built in which the nodes are the Smotifs and the edges are defined by the similarity of the classifying attributes of the Smotifs. The DS algorithm detects regions within that network with high density of Smotifs around a centroid defined by the classifying attributes of the Smotif. In this clustering, loops that belong to the same cluster have a length variation of , similar flanking secondary struc±1 tures, and similar [f|] angles (identified by a consensus conformation). Each cluster is required to have at least three Smotifs. The whole process generated 121 structural subclasses that where further grouped according to their Ramachandran map patterns into 56 classes (Table 1). List of the major ArchDB updates according to the different criteria applied in each new version. The method relates to the clustering algorithm used: DS, reclustering (rc), or MCL. The source refers to the criteria used to select PDB entries from which the Smotifs were generated: 40% or 95% homology filter (40 and 95) or enzymes (EC). PDB counts the number of PDBs that fulfilled those criteria. Smotifs displays the number of classified Smotifs in that version of ArchDB. Subclasses and classes show the final number of subclasses and classes for each classification, respectively. ArchDB is organized in a hierarchical fashion: the two first levels of the hierarchy correspond to the flanking secondary structures (type) and the length of the loop (length). The third level of the classification corresponds to classes, which are formed of subclasses with similar Ramachandran map patterns but different geometry. The lowest level of the hierarchy is the subclass, which is the structural cluster of Smotifs (i.e., Smotifs with the same loop conformation and geometry). This schema is used in all versions of ArchDB regardless of the particularities of the algorithm applied for clustering (Figure 3). The first update of ArchDB [31] made the database available online and introduced minor details in the classification such as the maximum identity between source structures (40%; ArchDB40), the upper limit on resolution (3.0 Å), and the minimum number of loops in a cluster (2). A reclustering algorithm was applied after the first clustering to merge subclasses with shared loops, resulting in an optimized partition of the conformational space (Table 1). Furthermore, it included references to gene ontology (GO) [62] and enzyme [63] annotations. The Enzyme annotation was further exploited for the analysis of kinase superfamilies and their relation to Smotifs [30]. The number of subclasses and classes increased significantly (Table 1). The third release of ArchDB included two new sets: ArchDB95, a redundant set, and ArchDB-EC, a classification derived from protein enzymes [32]. The new release included extensive functional annotations and cross-references to major biological databases and an increase in the number of classified Smotifs, classes, and subclasses. 198Bonet et al.: Smotifs as structural local descriptors of supersecondary elements Figure 3Hierarchy of ArchDB: type, length, class, and subclasses. The first two letters of the code represent the type of Smotifs, and the following three digits represent the length, class, and subclass, respectively. Consensus Ramachandran and geometry patterns are shown between curly and square brackets, respectively. The new database was used for both modeling of loops (ArchDB40) and study of relevant structure-function features in loops (ArchDB95 and ArchDB-EC set). It also included a comprehensive study of the statistical correlation between ArchDB subclasses and GO [62], EC [63], and SCOP [64] annotations as well as atomic interactions to cocrystallized cofactors and additional functional annotation extracted from PDB [65]. The last update of ArchDB [35] is limited to the 40% sequence identity threshold, removes the length varia±1 tion used in the DS classification, and includes a new classification method, the Markov clustering algorithm (MCL) [66]. The algorithm simulates a flow of information within the graph, enhancing the flow where the current is strong and hindering it where the current is weak. In MCL, the flow is controlled expanding and inflating the stochastic Markov matrix that represents the graph. Contrary to the DS, the loop length is not taken into account in this classification. To make the method more computationally feasible, loops were grouped according to their lengths: short loops (length between 0 and 3), short-medium loops (length between 4 and 6), long-medium loops (length between 7 and 13), long loops (between 14 and 20), and very long loops (more than 20). Each of these groups was classified separately (Table 1). In addition, in the latest release of ArchDB, five new Smotif types were included by considering the 310 helix as regular secondary structure; previously, these were considered part of the loop regions, as -helices were considered only if they exceeded five residues. Thus, the new Smotifs include 310 helix-310 helix, 310 helix--helix, 310 helix--strand, -strand-310 helix, and -helix-310 helix. Historical perspective of classification of Smotifs As shown in Table 1, the number of classified Smotifs, classes, and subclasses among releases has been increasing throughout the years. The changes in the criteria to select, build, and classify the Smotifs limit our ability to truly explore the emergence of new Smotifs. To further understand how the coverage of Smotifs in the PDB has evolved, we mapped the last version of ArchDB [35] backwards in time, starting from 1972. We selected the same day of the year for all years and mapped the Smotifs over all existing protein structures in the PDB (considering the official release day of the PDB entries). One of the main objectives was to evaluate not only how many Smotifs were known as the PDB grew but also how many of these were new (i.e., added at the current release). We relied on the MCL classification of ArchDB. In each version of the PDB, we recognized a new subclass of Smotifs if at least three different entries were present. From that point forward, new Smotifs that belonged to the same subclass were not describing new 3D conformations but populating already known entries. After year 2000, more than half Bonet et al.: Smotifs as structural local descriptors of supersecondary elements199 of the new Smotifs could be already assigned to an existing subclass, the level of which reached 70% in 2014 (Figure 4). It is clear that, despite the growing number of Smotif subclasses, there is a tendency toward obtaining a fully classified population of Smotifs, as this has a clear impact in the improvement of predictions in knowledge-based methods. Loop structure prediction Whereas the regular structural elements of proteins (i.e., -helices and -strands) are usually well resolved by either X-ray crystallographic or nuclear magnetic resonance (NMR), loops present a number of challenges that sometimes make it necessary to resort to computational tools to predict their structure. Moreover, comparative modeling, the most accurate and successful structure prediction approach (assuming a suitable template is available [67]), often requires loop modeling because there can be regions that are not present in the template(s) [68, 69] even at high-sequence identity levels between the template(s) and the target sequence. Moreover, several structurebased approaches in protein and drug design require loop structure prediction to understand protein-ligand interactions if templates do not include cocrystallized ligands (e.g., [70]). There are two major approaches to predict the conformations of loops: knowledge-based (also known as database search) and ab initio or de novo methods. The combination of the two, or combined methods, usually implying a knowledge-based approach followed by an ab initio one, has been also proposed [71]. In this article, we focused on knowledge-based approaches. A comprehensive review of loop prediction methods and theory behind can be found elsewhere [71]. Figure 4Mapping of Smotifs in the PDB over the years. Data are shown for each Smotif type (as described in the text) and for all Smotifs (top left double graphic). For each graph, the top section represents the total number of Smotifs (in thousands) mapped over PDB in that particular year (red), the number of those belonging to subclasses (blue), and the number of those that have at least one geometrically similar Smotif (orange). The bottom section represents the percentage of Smotifs clustered (blue) or with a similar one (orange). The pie charts represent the percentage of the different Smotifs, according to their flanking secondary structures, found in 1994, 2004, and 2014. HH, HG, GH, HE, EH, GG, EG, GE, BK, and BN Smotif types stand for -helix--helix, -helix-310 helix, 310 helix--helix, -helix--strand, -strand--helix, 310 helix-310 helix, -strand-310 helix, 310 helix--strand, -hairpins, and -links, respectively. 200Bonet et al.: Smotifs as structural local descriptors of supersecondary elements Knowledge-based approaches for loop modeling As its name suggests, the prediction of loop conformations is done by searching among potentially thousands of conformations extracted from known protein structures. The target loop is flanked by so-called stem residues (i.e., the residues with known structure that precede and follow the loop but are not part of it). The search implies the fitting of potential loops that fit the restraints of stem residues followed by the ranking based on geometric criteria and/ or sequence similarity. Finally, selected loops are superposed and annealed onto the stem regions. Seminal research in knowledge-based loop structure prediction was initiated in the early 1980s by Greer [72] followed by Jones and Thirup [73]. These works represent the first attempts to use libraries of loops in loop structure prediction. The applicability of such methods was limited by the incompleteness of the PDB [74], although remarkable success was achieved when predicting loops that follow canonical conformations such as the ones forming complementarity determining regions [75]. As the number of protein structures grew in the PDB, the number of known loop conformations increased; thus, the range of applicability and success of such approaches improved. The improvement of prediction methods ran in parallel with the development of the first loop structure classification databases [55, 56, 76, 77], including ArchDB (see before). The information included in these databases was used to derive profiles and sequence signatures from the structural clusters that were then used to align the target loop sequences [77­79]. Work in this area includes our contribution in the prediction of the H3 loop of immunoglobulins [33] and the use of hidden Markov models derived from ArchDB classes and subclasses to predict loop structures [36]. The use of sequence profiles derived from structural clusters to predict the structure of loops was complemented by the use of large libraries of fragments. The development in this area includes others and our (ArchPRED, see next) works including Michalsky et al. [80], Heuser et al. [81], and Choi and Deane [82]. ArchPRED [37, 83] is described in more detail in the "ArchPRED, a knowledge-based approach" section. To optimize the selection of suitable fragments from a nonredundant set, a neural network classifier was devised showing an improvement over long loops, although the coverage remained low [84]. Choi and Deane revisited the loop structure prediction performance on a knowledge-based method: FREAD, showing a substantial increase on the accuracy of predictions for loops up to 20 residues long with a coverage up to 50% [82]. More recently, a combination of fragment assembly and analytical loop closure has shown a great potential for even the so-called superlong loops [85]. Limitations of knowledge-based approaches: completeness of the database Knowledge-based approaches aim to capitalize on existing main-chain loop conformations; thus, the major limitation would be the lack of such suitable fragments for a given target sequence or, in other words, the completeness of the database (i.e., a method would fail if suitable conformations are not present in the database). This question was first studied by Fidelis et al. [86], who explored the frequency of distribution of repeat conformations and the clustering of structurally related fragments. They concluded that knowledge-based approaches were only suitable for short loops up to four residues. A similar study by Lessel and Schomburg [87] largely confirmed these observations. With the exponential growth of protein structures largely due to the automation and improvement in X-ray and NMR techniques within the framework of large-scale initiatives of structural genomics efforts [88, 89], the sampling of loop conformations has notably improved. The first proof about this was provided by Du et al. [90] who showed that, even for loops up to 15 residues long, there was a high probability ( 0%) to find a suitable, nonho>9 mologous, structural fragment [i.e., Å of the root mean <2 square deviation (RMSD)]. We addressed the question of database completeness in a more comprehensive manner by studying the fraction of loops extracted from all known protein sequences that are indeed represented by loops from known protein structures [38]. The structures of loops were clustered after an all-versus-all comparison, and sequence identity cutoffs for different loop lengths were identified to ensure the limits of structural similarity (e.g., two loops with a sequence identity of 50% are expected to have an RMSD of smaller than 2 Å; Figure 5). As expected, the required sequence similarity cutoff was more demanding for longer loops, with a general sharp transition between 45% and 55% sequence identity between low and high RMSD values for all loop lengths, indicating that 50% sequence identity guaranteed structural similarity (Figure 5), except for a very few notable exceptions (i.e., where high sequence identity was observed between dissimilar structures) [38, 91, 92]. Once the structural similarity cutoffs were correlated with sequence similarity, the sequences of loops extracted from protein sequences were compared with sequences of loops extracted from protein structures. The results Bonet et al.: Smotifs as structural local descriptors of supersecondary elements201 Cumulative frequency 0.8 0.6 0.4 0.2 90 80 70 60 50 40 Sequence identity, % 30 0 5 RMSD , Å Sequence identity, % Figure 5Database completeness. The relationship between structural similarity and sequence identity for loops of length 8­14 shown in red, blue, orange, green, dark blue, brown, and black, respectively. Inset shows the cumulative frequency distribution of loops that can be matched up between all known sequences and the available structural conformations at a given sequence identity. Adapted from [38]. showed that loops up to 10 residues could be associated with a loop of known structure (i.e., they shared at least 50% sequence identity and only 20% and 10% of loop size 13 and 14 did not have a match at this sequence identity level; Figure 5, inset). The study was repeated by generating a backdated database of loop structures to study effect of the growth in the PDB and saturation of conformations, showing that, from about year 2001, there were not single unique conformations deposited in the PDB for loop length up to 12 residues long [38]. Indeed, our study demonstrated that the limitations of knowledge-based approaches were not related to the sampling of suitable conformations in the library of Smotifs but rather to the successful scoring and identification of suitable Smotifs. These findings were corroborated in a recent study by Choi and Deane [82] and most importantly in the data presented in the historical review of the classification of Smotifs in ArchDB (see "Historical perspective of classification of Smotifs" section). ArchPRED, a knowledge-based approach ArchPRED represents an example of knowledge-based loop structure prediction methodology [37]. ArchPRED relies on a library of Smotifs and features a filtering, selection, and ranking algorithm to select the most suitable conformation for a given target loop sequence (Figure 6). The library of Smotifs is organized by loop type (as defined by the flanking regular secondary structures; i.e., -, -, -, and - loops), size, and geometry. As described previously, the geometry of Smotifs is determined by four internal variables: the distance (d) and three angles: hoist (), packing (), and meridian () [34]. The selection of Smotifs from the library is based on the geometrical restraints imposed by the bracing secondary structures of the missing loop (i.e., Smotifs will be selected if the geometry is similar or fall within the range of tolerance: 2 Å in the case of d and 30°, 30°, and 45° in the case of the , , and angles, respectively). The clear advantage of using the geometry rather than distances and/or structural fitting of stem residues is that the search space is reduced dramatically and the geometry-based filtering is very fast. On 50 randomly chosen examples when using a distance matching criteria, the prediction of loops of sizes 4, 8, and 12 selects 1534, 683, and 430 suitable Smotifs, respectively. The selected number of Smotifs after geometrical filtering was only 181, 85, and 25, respectively. More importantly, this selection does not imply that good Smotifs are discarded; comparing the average RMSD of the best fragment between loops that were selected by distances or by geometry, the differences were .05, 0.09, <0 and 0.11 Å for loops of sizes 4, 8, and 12, respectively. Subsequently, a filtering step discards unsuitable Smotifs based on the structural matching of stem residues: RMSDstems and unfavorable interactions between Smotifs and the new protein environment (i.e., steric crashes). The RMSDstems was shown to correlate with the quality of prediction for both filtering and scoring purposes [80, 81]. However, we found that the correlation is less pronounced in the case of long loops (i.e., above 8 residues of long loops); therefore, ArchPRED uses a dynamic cutoff as a function of loop size. Finally, the filtering step evaluates the fitting of Smotifs in the new environment. This aspect is particularly important, as the native structural environment of Smotifs could be very different from the one in the target protein. The last step in the prediction process is the ranking of remaining Smotifs. The scoring function is composed of a sequence similarity (computed using the K3 substitution matrix [94]) and an amino acid [f|] dihedral angle propensity [95] term. Given the fact that sequence and propensity scores have different dimensions, these are converted into dimensionless statistical Z-scores, which are obtained in reference to randomly generated sequences and [f|] dihedral angles. The scoring function is then a 202Bonet et al.: Smotifs as structural local descriptors of supersecondary elements Figure 6Overview of the ArchPRED prediction pipeline. Suitable Smotifs are selected based on the geometry of missing loop, upon which two filtering steps discard unsuitable Smotifs. The remaining Smotifs are ranked based on a composite scoring function. Adapted from [93]. composite Z-score combining the two types of sequence and [f|] dihedral angle propensity Z-scores. ArchPRED was benchmarked on a nonredundant data set of protein loops of lengths ranging from four to 14 residues by comparing to a competitive ab initio method (ModLoop) [96] and to the theoretical minimum RMSD (i.e., selecting the best Smotif candidate). In general, ArchPRED was competitive, and the library of loops contained suitable Smotifs for loops of all lengths (Figure 7). The latter observation reinforces earlier studies about the completeness of conformations in current databases, as discussed in the previous section. As the number of protein structures and loop conformations continues to grow, so does the accuracy and applicability of ArchPRED. Finally, we used the benchmarking of ArchPRED and the relationship between Z-scores and accuracy and coverage to identify confidence Z-score thresholds for loops of all sizes (Table 2). ArchPRED is available as a web application [83]. Users are required to provide the atomic coordinates of the query structure and define the location, sequence, and flanking secondary structures of the missing loop. The number of returned predictions and Z-score threshold are tunable parameters. As an optional postprediction optimization, users can select whether to graft the predicted conformation into the protein using Modeller [97]. The grafting of the loop in the structure involves (1) the optimization of the side chains of predicted loops and (2) a limited minimization to anneal the stems in the protein framework. The server is accessible at http://www.bioinsilico.org/ARCHPRED. 6 5 RMSD , Å 4 3 2 1 0 8 9 10 Loop length Figure 7Prediction accuracy as a function of loop size. For different loop sizes, the predicted loops were compared with the native structure, and the average RMSD values were computed. The black and green lines represent the accuracy of the theoretical limit (i.e., selecting best Smotifs in the library) and the random prediction (i.e., random selection of Smotifs with same loop size), respectively. The blue and red lines depict ArchPRED and ModLoop accuracy, respectively. Adapted from [37]. Smotifs as building blocks in structure-based protein design Structure-based protein design is a rational approach to alter the properties of proteins by employing structural Bonet et al.: Smotifs as structural local descriptors of supersecondary elements203 Table 2Accuracy of prediction and coverage for different loop lengths. Loop length 4 5 6 7 8 9 10 11 12 13 14 a b sample the conformation of main-chain conformation by means of Smotifs [41]. Z-scorea 1 1 1 1 2 3 3 3 4 4 4 Average RMSD,b Å 0.22 0.15 0.34 0.93 1.38 1.93 2.11 2.30 2.47 2.85 2.88 Coverage,c% 98 96 98 94 78 60 46 44 28 4 6 Saturation of Smotifs In the previous sections, we reviewed our and other works addressing the subject of saturation of loop conformation and the database completeness and quality of loop structure predictions. However, our definition of loops as Smotifs also includes the structural arrangement of the flanking secondary structure or geometry. Therefore, in a different development, we studied the sampling of geometrical space of Smotifs [39]. Thus far, we studied the emergence of different geometries across backdated sets of protein structures. We found that, as early as 1997, for all Smotif types (i.e., -, -, -, and -), all geometrical bins were fully sampled (Figure 8). In addition, we found that, as early as the third edition of the Critical Assessment of Structure Prediction meetings [106], any new protein submitted in the new-fold category could have been reconstructed using Smotifs from known folds. Likewise, proteins classified as new fold on the SCOP database [64] releases 1.75 and 1.73 did not have Smotifs, which were not observed previously. Together, the study of Fernandez-Fuentes et al. [39] and our previous studies [38] concluded that the sampling of both loop conformations and geometries of Smotifs is rather comprehensive and thus opened possible new avenues in the fields of protein structure prediction and computational design. Z-score cutoff to get an accuracy better or equal to ModLoop. Average RMSD for the given Z-score cutoff. cPercentage of query loops that are modelable (i.e., a suitable Smotif can be found) at the given Z-score threshold. Adapted from [93]. information [41]. Such changes include changes to improve protein stability (e.g., thermostability), alter or modify protein function, or modify the binding to substrates or other biomolecules (see reviews in [98, 99] and the references therein). The changes of the properties of proteins can be achieved by changing individual residues (i.e., point mutations) to the entire remodeling of short fragments. Initial works in the field focused in single aminoacid changes with null or very limited flexibility to avoid the complexity and the limitations in the modeling of flexible main-chain backbone (e.g., [100]). However, the remodeling of short regions of proteins, or de novo design, which usually materializes in the loops, presents a number of advantages as illustrated in recent publications [70, 101]. The remodeling of the protein backbone seeks to accommodate specific interactions and restraints (e.g., catalytic amino acid) [101] or to create new structural elements to accommodate certain functionalities [102]. The main hurdle, however, is the inherent complexity that makes the systematic sampling of large insertions very difficult. There has been, however, some examples described in the literature (reviewed in [103]) based on the recombination of modular, short fragments to diversify or graft novel functions and to improve a desired feature on proteins (e.g., thermostability). We can mention, for example, the reconstruction of a beta propeller via the assembly of short fragments [104] or the design of a highly stable protein [105]. Our definition of Smotifs is particularly suitable to account for the modularity of protein segments, so we developed Frag`r'Us, a method designed to 350 300 Cumulative frequency 250 200 150 100 50 0 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 Year Figure 8Saturation of geometry of Smotifs. Cumulative frequency distribution as a function of time for - (red), - (blue), - (green), and - (black) Smotifs. Adapted from [39]. 204Bonet et al.: Smotifs as structural local descriptors of supersecondary elements Structure-based protein design using Smotifs (Frag`r'Us) Loops sharing the same geometry can have different conformations; in other words, the gap between two fixed flanking secondary structures can be spanned by loops having different conformations. If we assume that the sampling of loop conformations and geometries in current databases is rather comprehensive, then the question was whether our collection of Smotifs can be used to sample the conformation of short regions amenable to de novo protein design. To answer to this question, we developed Frag`r'Us, a tool designed to sample loop conformations in a modular fashion using the geometrical constrains of Smotifs [41]. The sampling of the conformation is done at geometry level. Upon defining the geometry of the Smotif of interest (i.e., the region to be remodeled), the library of Smotifs is queried using the geometrical values of the query Smotifs. In its current implementation, Frag`r'Us considers that two Smotifs have a similar geometry if the difference between their corresponding geometry parameters: d, , , and values is less than or equal to 0.5 Å and 5°, 5°, and 10°, respectively. As a result of this comparison, a list of Smotifs with equivalent geometry but different conformation is generated. These Smotifs conform to the starting point in the downstream computational design process. Frag`r'Us was benchmarked against a data set of pairs of proteins (wild-type and redesigned), where the structures of both the initial protein prior to the design (i.e., scaffold) and the designed protein were known. The test set included a wide range of examples in de novo protein design including a Diels-Alderase [70], a human guanine deaminase [101], the catalytic loops in ()8-barrel scaffolds [107], two retro-aldol enzymes [108, 109], a triosephosphate isomerase [110], and the motifs to target antibody b12 [111]. A detailed description of each individual case, regions, and information about scaffold and designed proteins can be found in the supplementary material of our publication describing this work [41]. The remodeled regions of the designed proteins were compared with the loop conformations of Smotifs matching the geometry of the flanking secondary structures. In all cases, except one, the conformation of Smotifs extracted from backdated libraries could recapitulate the conformation of the designed fragment. A particular and challenging case was the remodeling of a 24-residue-long insertion flanked by a -hairpin [70]. In the original work, the remodeling of these regions was achieved by combining Rosetta [112] and crowd sourcing (Foldit game) [113]. Besides the size of the fragment, the remodeled regions contained a helix-turn-helix motif. Even in this case, Frag`r'Us was able to provide a satisfactory Smotif whose loop conformation closely resembled the conformation of the designed version by using just the geometrical restraints (Figure 9). Protein structure prediction using Smotifs and experimental data Smotifs can be used for protein structure prediction following a fragment assembly approach. The underlying hypothesis is that patterns of indirect structural data characterizing the connecting loop region in a Smotif will determine the relative orientation of flanking secondary structures and thus will be informative for the selection of an entire supersecondary structure element (Smotif). The notion for this hypothesis emerges from the success of inverse application (see above), in which the conformations of loops were successfully modeled by the fit of the corresponding Smotif, specifically the flanking secondary structure residues, in the relevant structural environments in the template and target structures [36, 37]. One possible avenue to make a significant advance in protein structure modeling is to incorporate indirect structural data from high-throughput experiments [114]. A growing number of methods incorporate a variety of easily obtainable NMR data as restraints to guide protein structure modeling or simulation. Many of these methods focus on backbone NMR chemical shift (CS) assignments. Obtaining CS is a necessary and straightforward first step in the classic NMR structure determination process. Within the framework of developing the TALOS program, it was shown that CS data can guide the Initial-final Final-frags'r'Us Loop 1 Figure 9Example of Frag`r'Us prediction. Ribbon representation of the superposed structure of scaffold (PDB code 3i1c, chain A), engineered (PDB code 3u0s, chain A) proteins, and the best matching candidate Smotif (PDB code 2p3p, chain A, residues 242­271). A and B indicate the starting and end points of the remodeled loop. Green, blue, and red depict the native, engineered proteins, and Smotifs. Adapted from [41]. Bonet et al.: Smotifs as structural local descriptors of supersecondary elements205 selection of tripeptide segments with similar conformations and provide preferences/restraints for main-chain dihedral angles [115, 116]. The highly successful Rosetta ab initio fragment assembly program [117] combines CS data and sparse NOE restraints (approximately one per residue) to steer the selection and filtering of three and nine residue fragments besides taking into account sequence similarity measures of these fragments [118]. In a similar approach by Gong et al. [119], experimentally determined CS and sequence patterns were used to search the protein database for consecutively overlapping six-residue-long backbone fragments, which then were "stitched" together using Monte Carlo simulation [119]. In more recent applications, CS-Rosetta was shown to be successful in delivering high-quality models when using CS data in combination with sequence information [120, 121]. Similar ideas are implemented in the CHESHIRE method, which first predicts secondary structures of three and nine residue fragments using CS data and then combines these fragments into larger ones by matching sequence information, secondary structures, and CS patterns [122]. NMR CS data were converted into forces in molecular dynamics simulations and were successfully used to fold short polypeptide chains or to refine partially unfolded structures [123, 124]. An important advance for that work was the development of the CamShift method [125] that quickly predicts CS values from structures. Besides CamShift, several other approaches are available that calculate theoretical CS values for a given structure, such as SHIFTX2 [126], SPARTA+ [127], and PROSHIFT [128]. GENMR [129] is a very fast modeling implementation that combines homology models with CS and/or NOE data. The component of GENMR that relies on structure calculation using CS and sequence information without NOE data is CS23D [130]. CS23D incorporates various other methods, such as threading, homology modeling, or small fragment assembly using the Rosetta program. SmotifCS algorithm for hybrid modeling of protein structures As Smotifs are backbone-only defined fragments a relation needs to be made between a target sequence and the backbone-only library of Smotifs. One possible way to do this is hybrid modeling, where a limited amount of easily obtainable, indirect experimental data is used to select Smotifs for structure modeling. We used CS assignments from NMR studies for the target protein and developed a novel algorithm to predict the structure of protein by combining Smotifs and CS information: SmotifCS [40]. The schematic representation of the SmotifCS algorithm, described in more detail below, is illustrated in Figure 10. In this application, first we need to precalculate all backbone atom (N, HN, H, C, C, C) theoretical CS for all Smotifs in our library using SPARTA+ [127]. Next, the structure prediction algorithm relies on another precalculated database that contains the relative weights of structural information conveyed by a given normalized CS. Predicted CS values aggregated from all library Smotifs Figure 10Overview of the SmotifCS prediction pipeline. Smotifs and NMR data are combined to select the combination of Smotifs to predict the structure of the protein [40]. 206Bonet et al.: Smotifs as structural local descriptors of supersecondary elements were divided into groups based on atom type, residue type, and preceding residue type, resulting in 6 0 0 400 ×2 ×2 =2 categories. For each category, CS values were normalized by subtracting the random coil value. The relative weight of structural information conveyed by a given CS (categorized by atom type, residue type, and preceding residue type) is calculated as the difference between the statistical propensities of the "most favored" and "second most favored" secondary structural conformations. To identify the relative orientation of regular secondary structures within a Smotif, we analyzed the CS patterns of the loop segments and the three flanking secondary structure residues on each side of the loop. To select candidate Smotifs from our library, we compared the experimental CS of each query Smotif and the theoretical CS of available Smotifs in our library. TALOS+ [116] predicted theoretical [f|] angles (also obtained from the experimental CS) were used to assign each loop residue of the query Smotif in one of the 11 possible locations within the Ramachandran map [37]. The string of Ramachandran map sublocations constituted the "fingerprint" of loop segments that was compared to similar fingerprints derived from the Smotifs in our library. The best matching Smotif fingerprints were then ranked by their CS match "score" calculated as the sum of weighted squared differences between the CS of the query and library Smotifs. After a suitable set of candidates has been selected for each putative Smotif in the query structure, a full enumeration of the structures is carried out by joining every possible combination of these Smotifs. The lengths of the secondary structures of the sampled Smotifs are extended or shortened as necessary to fit the query sequence. In the process of joining Smotifs, a limited number of steric clashes are allowed. The candidate structures resulting from the full enumeration are evaluated using a linear scoring function with the following components: radius of gyration using C carbons, a distance-dependent statistical potential function [131­133], an implicit solvation potential [134], and a knowledge-based long-range backbone hydrogen-bonding potential [135]. All components were converted into statistical Z-scores before combining them with weights optimized on a set of decoy structures. The best 200 structures from this ranking were relaxed using Modeller [136] to resolve steric clashes and maintain stereochemistry. The accuracy of final models was evaluated using RMSD and GDT_TS scores [137] with respect to the experimental solution structure. The method was tested on a data set of 102 proteins obtained from the Biological Magnetic Resonance Data Bank [138]. The test set is the currently largest nonredundant data set of experimentally known structures for which CS data are publically available and where all structures represent a different SCOP fold category [139]. The results are presented as a distribution of GDT_TS scores [137] for the superposed backbone atoms of experimental structure compared with the top ranked model for the entire lengths of the protein (Figure 11). The top ranked models have GDT_TS scores in the range of 20% to 80%. The number of proteins where the best sampled models have GDT_TS 50% is 47. This means that, for about half of the cases, a high-quality homology model is generated and, for all cases, at least a topologically correct fold is produced. Although the SmotifCS method is unique in its approach, as it is not using any sequence information, we compared SmotifCS to CSRosetta on a randomly selected subset of 15 proteins. This comparison normally would not be completely relevant, as our approach does not use sequence information at all. Smotifs are used with their backbone geometries, and we generated backbone-only models, whereas Rosetta relies on fragments collected from sequentially related structures. To establish comparable conditions, we purged from the Rosetta fragment database all homologous PDB templates that were detected for a target protein using HHblits [140] and PsiBlast [141]. This eliminated, on average, 0.82% of the three residue fragments and 1.43% of the nine residue fragments that CSRosetta could use in modeling. In a head-to-head comparison on the randomly picked 15 test cases, the two methods show competitive performance with an average of 52.07 .08 and 55.07 .16 GDT_TS (SEM) for SmotifCS ±3 ±3 and CSRosetta, respectively. CSRosetta outperforms SmotifCS in five cases and SmotifCS outperforms CSRosetta in Figure 11Accuracy of models. Distribution of GDT_TS scores as a function of secondary structure assignment accuracy from CS data. The entire data set (102 proteins, black columns) is split into two: in 50 proteins, at least one secondary structure is incorrectly assigned (red), whereas, in 52 others, all Smotifs are captured correctly (green). Bonet et al.: Smotifs as structural local descriptors of supersecondary elements207 seven cases, with both performing comparably in three cases. In terms of required computational time, CSRosetta takes about a magnitude longer to perform the calculations for the same proteins, and this difference increases rapidly with protein size. The fact that SmotifCS, due to the large chunks of supersecondary structures it uses, does not scale exponentially with the increasing protein size makes it a promising approach to model larger proteins for which CS data can be collected. Individual modeling cases One of the better performances for a fold with a mixed composition of strands and helices (which usually pose a more difficult challenge) is observed in the case of 1 khm [142], with an overall GDT_TS score of 68.57 (Figure 12A). The general tendency that Smotifs are typically sampled from a range of unrelated folds underlines the algorithmic concept, where large modular building blocks are identified that are shared between unrelated folds that do not show any overall homology. It has been observed in similar studies that proteins with long loops pose the most difficult challenge for modeling. We also showed that Smotifs with long loops are less well sampled and, in general, harder to model. 2jya (Figure 12B) has particularly long loops (longest loop length is 22 residues, and total loop content is 72%). It is clear that, whereas the core of the protein made of regular secondary structures is well captured, the two long loops are poorly modeled resulting in an overall GDT_TS score of 39.53. If we calculate the RMSD for the whole model, we obtain 9.46 Å; if we calculate the RMSD of the structured core only, it is 1.50 Å. Finally, we explored the modeling of a designed protein (PDB code 2kl8 [143]). This case presents no bias with respect to the other already known experimental structures or topologies for sampling Smotifs. For 2kl8, we obtained a high-quality model with a GDT_TS score of 50.33 (Figure 12C). By definition, since 2kl8 is a unique fold, all Smotifs used to build this model come from unrelated proteins; in addition, all five Smotifs come from five unique folds. Conclusions and future perspective The folding of a protein follows several steps, of which one of the first is the transition from a random coil structure to the molten globule state. The exact role of supersecondary Figure 12Examples of modeling cases. Structural superposition of top ranked model (in pink) with the solution structures (in blue). PDB codes and overall GDT_TS score (in brackets) are shown. The templates from which the Smotifs are sampled are shown in gray with the Smotifs themselves colored according to their secondary structures. The PDB code, chain and residues contributing to the Smotif template, the SCOP identifier of the template (if available) and the RMSD between the template and the native Smotif are shown. structures at this point is unclear, because secondary structures are not yet completely defined, although interactions between residues separated by several peptide 208Bonet et al.: Smotifs as structural local descriptors of supersecondary elements bonds are already formed while undergoing a hydrophobic collapse. The folding of loops constitutes a mini-folding problem, where restraints on the stem flanks for the location of secondary structures play its role for the protein folding but not necessarily for the loop conformation. This is shown by the classification of Smotifs, as similar conformations are found in different types and disposition of supersecondary structures, whereas the same geometry of a supersecondary structure can have more than one conformation. We have shown that the information on protein structures stored in the last 20 years has completed many of the standard loop conformations formed by or 10 residues. <8 Consequently, algorithms that model loops using knowledge-based approaches became more and more successful. Loop modeling accuracies can be substantially improved by restraining the stem position of both flanking secondary structures and by focusing on specific residues within the loop sequences. A special application of fragments and suitably selected restraints is illustrated in a protein design example using Frag`r'Us, which compares favorably with other design approaches such as Rosetta. Another illustration how to take advantage of the nearly complete Smotif classes is the application of hybrid ab initio protein fold prediction (SmotifCS). It can be applied to reconstruct the protein fold by pieces, as long as we have enough restraints to locate Smotifs in the protein scaffold. The major challenge remains on accurately modeling the conformations of long loops (with more than 10 residues), as their sampling is still limited in the current databases. We expect a growing role of Smotif classification in the modeling of proteins in three directions: (1) modeling of nonregular secondary structures, (2) ab initio and newfold prediction, and (3) modeling of multimeric complexes. Within the topic of modeling of long loops, we also expect that the classification of Smotifs will help select new protein structure designs with modified loop conformations, whereas for new-fold prediction it will be crucial to find information or accurate mechanisms for the prediction of residue-residue contacts between sequentially distant residues in the protein. Furthermore, swapping loop conformations but restraining the flanking secondary tails will help increase the combinatorial possibilities of protein design. Acknowledgments: This article is partially based on our previous publications [ref. 30­39, 41]. NFF acknowledges support from ACCIO, Generalitat of Catalunya under the TecnioSpring Program, project number TECSPR13-1-0008, REA grant agreement 600388. This work was supported by NIH grants GM094665 and GM096041 to AF. JB and BO acknowledge support from the Spanish Ministry of Economy under grant BIO2011-22568. Author contributions: All authors have accepted responsibility for the entire content of this submitted manuscript and approved submission. Research funding: None declared. Employment or leadership: None declared. Honorarium: None declared. Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the report for publication. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bio-Algorithms and Med-Systems de Gruyter

Smotifs as structural local descriptors of supersecondary elements: classification, completeness and applications

Loading next page...
 
/lp/de-gruyter/smotifs-as-structural-local-descriptors-of-supersecondary-elements-Kmeg09KITx
Publisher
de Gruyter
Copyright
Copyright © 2014 by the
ISSN
1895-9091
eISSN
1896-530X
DOI
10.1515/bams-2014-0016
Publisher site
See Article on Publisher Site

Abstract

Protein structures are made up of periodic and aperiodic structural elements (i.e., -helices, -strands and loops). Despite the apparent lack of regular structure, loops have specific conformations and play a central role in the folding, dynamics, and function of proteins. In this article, we reviewed our previous works in the study of protein loops as local supersecondary structural motifs or Smotifs. We reexamined our works about the structural classification of loops (ArchDB) and its application to loop structure prediction (ArchPRED), including the assessment of the limits of knowledge-based loop structure prediction methods. We finalized this article by focusing on the modular nature of proteins and how the concept of Smotifs provides a convenient and practical approach to decompose proteins into strings of concatenated Smotifs and how can this be used in computational protein design and protein structure prediction. Keywords: loop structure prediction; protein loops; protein secondary structures; protein structure design; protein structure prediction; protein supersecondary structures. DOI 10.1515/bams-2014-0016 Received September 10, 2014; accepted October 15, 2014 Introduction Proteins are the workhorses of cells, mediating most of the cellular processes that make life possible. The function of a given protein is dictated by its three-dimensional (3D) structure, which is composed of secondary structure elements. These can be assigned in two major groups: secondary structures with regular patterns of atomic interactions and a translational symmetry (-helices and -strands) and nonrepetitive and nonregular secondary structures (loops). Loops account for a substantial part of proteins, up to 47% [1] of all residues, which are generally exposed and play an important role in the function, structure, and dynamics of proteins. At the structural level, loops play an important role in the folding and dynamics of proteins. Loops can act as hinges facilitating the folding/unfolding process [2­ 4], given their intrinsic flexible nature. It has been also shown that long-range loop-loop interactions are important in the folding of proteins [5]. The size of the loops was related to the stability of proteins [6­9], where, in the most extreme case, a single substitution in a loop can cause the destabilization of the entire protein [10]. Loops also play a central role in the function of proteins and in their associations to other biomolecules. Loops often define the functional specificity of protein families [11], binding pockets of substrates and cofactors (e.g., the P-loop [12] or serve as calcium binding (EF-Hand motifs; Figure 1) [14]), and catalytic sites (e.g., serine proteases [15] or serine/threonine kinases [16]). Given their flexible nature, loops play an important role in the conformational changes of enzymes and often are responsible for the correct positioning of catalytic residues [17, 18] and for activation mechanisms [19­22]. Finally, loops are important in protein-protein [23­25] and protein-nucleic acid [26] associations and in related functions (e.g., serving as complementarity determining regions in immunoglobulins [27] or recognizing motifs in signaling pathways [28, 29]). *Corresponding author: Narcis Fernandez-Fuentes, Structural Bioinformatics Group (GRIB), Department of Experimental and Life Sciences, University Pompeu Fabra, C. Doctor Aiguader, 88, Barcelona 08003, Catalonia, Spain, E-mail: narcis.fernandez@gmail.com Jaume Bonet and Baldo Oliva: Structural Bioinformatics Group (GRIB), Department of Experimental and Life Sciences, University Pompeu Fabra, Barcelona, Catalonia, Spain Andras Fiser: Albert Einstein College of Medicine, Department of Systems and Computational Biology, Bronx, NY, USA 196Bonet et al.: Smotifs as structural local descriptors of supersecondary elements Figure 1EF-Hand motifs. Cartoon representation of calmodulin-IQ domain-containing protein G complex that contains four EF-Hand motifs [13]. Calcium ions are represented as black spheres. In this article, we reviewed our and other related research works in the study of protein loops. We presented our results in the automatic structural classification of loops [30­35] and how this information can be used to predict the structure of protein loops [33, 36, 37]. We also discussed the limitations of the methodology [38, 39]. Finally, we described how our loop classification could be applied to protein structure prediction using a hybrid ab initio approach [40] and de novo structure-based protein design [41]. The common underlying concept that connects the different aspects of our research in this field is our definition of protein loops. We define loops as supersecondary structural motifs, or Smotifs, composed of the loop itself and its flanking regular secondary structures. The definition of loops as Smotifs allows us to define the local structural arrangement of the flanking regions of the loop or the geometry of the motif (Figure 2). In loop classification, geometry is a useful descriptor to group loops and speed up the clustering process. In loop structure prediction and protein design, the geometry allows an optimized hashing and look-up of loop conformations given a set of geometrical restraints. Finally, in protein structure prediction and design, Smotifs provide a convenient and coherent scheme to break down proteins into a sum of supersecondary elements. Figure 2Definition of Smotifs and geometry. Smotifs are defined by the loop (green) and the two flanking regular secondary structures: an -helix (Nt secondary structure in blue) and a -strand (Ct secondary structure in red). The geometry of a Smotif is defined by four internal coordinates: a distance and three angles: , , and (from top left and clockwise). follow conserved structural patterns. Thus, it is not surprising that in the early days of structural biology loops remained largely unclassified and were, sometimes, believed to have random coil conformation. However, loops do follow structural patterns and can be classified. Early work The first types of loops to be classified were short, geometrically highly restricted loops (e.g., -hairpins). One of the first classifications was proposed by Venkatachalam, who described several categories for four-residue-long turns [42]. The classification of -turns was subsequently extended and several times revisited [43­45] to include reversed turns or -turns [46­48] and the subdivision of -hairpins in different classes [49­52]. With the rapid increase of known protein structures and therefore of loop conformations, it was possible to uncover conserved structural patterns for longer and less geometrically restricted loops. Several studies Structural classification of protein loops It might seem counterintuitive that loops, largely known by their intrinsic flexibility and structural diversity, would Bonet et al.: Smotifs as structural local descriptors of supersecondary elements197 identified commonly occurring structural patterns, often linked to sequences and atomic interaction patterns for loops connecting not only -turns but also -strands and -helices [53, 54]. Albeit important, these studies relied mainly on expert, manual classification; therefore, automatic approaches were needed to derive structural classifications to cope with the ever-increasing amount of structural data [55, 56]. This includes our work described in more detail in the next section [31, 32, 34, 35] together with other approaches to classify loops using structural alphabets [57, 58]. Table 1Historical growth of the ArchDB classification. ArchDB Method Source 1997 2004 2007 2007 2007 2014 2014 DS DS+rc DS+rc DS+rc DS+rc DS MCL 25 40 95 40 EC 40 40 PDB Smotifs Subclasses Classes 121 1496 4023 2550 2686 13,198 12,240 56 451 2142 1119 1338 5362 9728 233 3005 2310 12,665 5472 36,153 3640 16,957 2349 20,260 17,961 129,280 17,961 187,117 Automated classification of protein loops based on geometry and conformation (ArchDB) The geometry of Smotifs can be exploited for the purpose of classifying loops. The geometry of Smotifs is defined by four internal variables that determine the relative position between the two flanking secondary structures of the loop: the distance (d) and three angles: hoist (), packing (), and meridian () [34] (Figure 2). The geometry of the Smotif as defined by the aforementioned features, the type of flanking secondary structures (i.e., -, -, -, and -; the latter further divided between -hairpins and -link), and the length of the loop (i.e., the number of residues composing the loop region of the Smotif) are the classifying attributes of Smotifs. The first classification of Smotifs [34] was generated from 233 high-quality crystallographic X-ray structures obtained from the Protein Data Bank (PDB) [59] (resolution better than 2.5 Å) after removing redundancy at 25% sequence identity cutoff. For each of these, -helices and -strands were defined using DSSP [60], which produced a total of 3005 Smotifs. Subsequently, Smotifs were clustered according to their geometrical properties using a density search (DS) algorithm, which is a variant of the single-linkage clustering method [61]. Briefly, a network is built in which the nodes are the Smotifs and the edges are defined by the similarity of the classifying attributes of the Smotifs. The DS algorithm detects regions within that network with high density of Smotifs around a centroid defined by the classifying attributes of the Smotif. In this clustering, loops that belong to the same cluster have a length variation of , similar flanking secondary struc±1 tures, and similar [f|] angles (identified by a consensus conformation). Each cluster is required to have at least three Smotifs. The whole process generated 121 structural subclasses that where further grouped according to their Ramachandran map patterns into 56 classes (Table 1). List of the major ArchDB updates according to the different criteria applied in each new version. The method relates to the clustering algorithm used: DS, reclustering (rc), or MCL. The source refers to the criteria used to select PDB entries from which the Smotifs were generated: 40% or 95% homology filter (40 and 95) or enzymes (EC). PDB counts the number of PDBs that fulfilled those criteria. Smotifs displays the number of classified Smotifs in that version of ArchDB. Subclasses and classes show the final number of subclasses and classes for each classification, respectively. ArchDB is organized in a hierarchical fashion: the two first levels of the hierarchy correspond to the flanking secondary structures (type) and the length of the loop (length). The third level of the classification corresponds to classes, which are formed of subclasses with similar Ramachandran map patterns but different geometry. The lowest level of the hierarchy is the subclass, which is the structural cluster of Smotifs (i.e., Smotifs with the same loop conformation and geometry). This schema is used in all versions of ArchDB regardless of the particularities of the algorithm applied for clustering (Figure 3). The first update of ArchDB [31] made the database available online and introduced minor details in the classification such as the maximum identity between source structures (40%; ArchDB40), the upper limit on resolution (3.0 Å), and the minimum number of loops in a cluster (2). A reclustering algorithm was applied after the first clustering to merge subclasses with shared loops, resulting in an optimized partition of the conformational space (Table 1). Furthermore, it included references to gene ontology (GO) [62] and enzyme [63] annotations. The Enzyme annotation was further exploited for the analysis of kinase superfamilies and their relation to Smotifs [30]. The number of subclasses and classes increased significantly (Table 1). The third release of ArchDB included two new sets: ArchDB95, a redundant set, and ArchDB-EC, a classification derived from protein enzymes [32]. The new release included extensive functional annotations and cross-references to major biological databases and an increase in the number of classified Smotifs, classes, and subclasses. 198Bonet et al.: Smotifs as structural local descriptors of supersecondary elements Figure 3Hierarchy of ArchDB: type, length, class, and subclasses. The first two letters of the code represent the type of Smotifs, and the following three digits represent the length, class, and subclass, respectively. Consensus Ramachandran and geometry patterns are shown between curly and square brackets, respectively. The new database was used for both modeling of loops (ArchDB40) and study of relevant structure-function features in loops (ArchDB95 and ArchDB-EC set). It also included a comprehensive study of the statistical correlation between ArchDB subclasses and GO [62], EC [63], and SCOP [64] annotations as well as atomic interactions to cocrystallized cofactors and additional functional annotation extracted from PDB [65]. The last update of ArchDB [35] is limited to the 40% sequence identity threshold, removes the length varia±1 tion used in the DS classification, and includes a new classification method, the Markov clustering algorithm (MCL) [66]. The algorithm simulates a flow of information within the graph, enhancing the flow where the current is strong and hindering it where the current is weak. In MCL, the flow is controlled expanding and inflating the stochastic Markov matrix that represents the graph. Contrary to the DS, the loop length is not taken into account in this classification. To make the method more computationally feasible, loops were grouped according to their lengths: short loops (length between 0 and 3), short-medium loops (length between 4 and 6), long-medium loops (length between 7 and 13), long loops (between 14 and 20), and very long loops (more than 20). Each of these groups was classified separately (Table 1). In addition, in the latest release of ArchDB, five new Smotif types were included by considering the 310 helix as regular secondary structure; previously, these were considered part of the loop regions, as -helices were considered only if they exceeded five residues. Thus, the new Smotifs include 310 helix-310 helix, 310 helix--helix, 310 helix--strand, -strand-310 helix, and -helix-310 helix. Historical perspective of classification of Smotifs As shown in Table 1, the number of classified Smotifs, classes, and subclasses among releases has been increasing throughout the years. The changes in the criteria to select, build, and classify the Smotifs limit our ability to truly explore the emergence of new Smotifs. To further understand how the coverage of Smotifs in the PDB has evolved, we mapped the last version of ArchDB [35] backwards in time, starting from 1972. We selected the same day of the year for all years and mapped the Smotifs over all existing protein structures in the PDB (considering the official release day of the PDB entries). One of the main objectives was to evaluate not only how many Smotifs were known as the PDB grew but also how many of these were new (i.e., added at the current release). We relied on the MCL classification of ArchDB. In each version of the PDB, we recognized a new subclass of Smotifs if at least three different entries were present. From that point forward, new Smotifs that belonged to the same subclass were not describing new 3D conformations but populating already known entries. After year 2000, more than half Bonet et al.: Smotifs as structural local descriptors of supersecondary elements199 of the new Smotifs could be already assigned to an existing subclass, the level of which reached 70% in 2014 (Figure 4). It is clear that, despite the growing number of Smotif subclasses, there is a tendency toward obtaining a fully classified population of Smotifs, as this has a clear impact in the improvement of predictions in knowledge-based methods. Loop structure prediction Whereas the regular structural elements of proteins (i.e., -helices and -strands) are usually well resolved by either X-ray crystallographic or nuclear magnetic resonance (NMR), loops present a number of challenges that sometimes make it necessary to resort to computational tools to predict their structure. Moreover, comparative modeling, the most accurate and successful structure prediction approach (assuming a suitable template is available [67]), often requires loop modeling because there can be regions that are not present in the template(s) [68, 69] even at high-sequence identity levels between the template(s) and the target sequence. Moreover, several structurebased approaches in protein and drug design require loop structure prediction to understand protein-ligand interactions if templates do not include cocrystallized ligands (e.g., [70]). There are two major approaches to predict the conformations of loops: knowledge-based (also known as database search) and ab initio or de novo methods. The combination of the two, or combined methods, usually implying a knowledge-based approach followed by an ab initio one, has been also proposed [71]. In this article, we focused on knowledge-based approaches. A comprehensive review of loop prediction methods and theory behind can be found elsewhere [71]. Figure 4Mapping of Smotifs in the PDB over the years. Data are shown for each Smotif type (as described in the text) and for all Smotifs (top left double graphic). For each graph, the top section represents the total number of Smotifs (in thousands) mapped over PDB in that particular year (red), the number of those belonging to subclasses (blue), and the number of those that have at least one geometrically similar Smotif (orange). The bottom section represents the percentage of Smotifs clustered (blue) or with a similar one (orange). The pie charts represent the percentage of the different Smotifs, according to their flanking secondary structures, found in 1994, 2004, and 2014. HH, HG, GH, HE, EH, GG, EG, GE, BK, and BN Smotif types stand for -helix--helix, -helix-310 helix, 310 helix--helix, -helix--strand, -strand--helix, 310 helix-310 helix, -strand-310 helix, 310 helix--strand, -hairpins, and -links, respectively. 200Bonet et al.: Smotifs as structural local descriptors of supersecondary elements Knowledge-based approaches for loop modeling As its name suggests, the prediction of loop conformations is done by searching among potentially thousands of conformations extracted from known protein structures. The target loop is flanked by so-called stem residues (i.e., the residues with known structure that precede and follow the loop but are not part of it). The search implies the fitting of potential loops that fit the restraints of stem residues followed by the ranking based on geometric criteria and/ or sequence similarity. Finally, selected loops are superposed and annealed onto the stem regions. Seminal research in knowledge-based loop structure prediction was initiated in the early 1980s by Greer [72] followed by Jones and Thirup [73]. These works represent the first attempts to use libraries of loops in loop structure prediction. The applicability of such methods was limited by the incompleteness of the PDB [74], although remarkable success was achieved when predicting loops that follow canonical conformations such as the ones forming complementarity determining regions [75]. As the number of protein structures grew in the PDB, the number of known loop conformations increased; thus, the range of applicability and success of such approaches improved. The improvement of prediction methods ran in parallel with the development of the first loop structure classification databases [55, 56, 76, 77], including ArchDB (see before). The information included in these databases was used to derive profiles and sequence signatures from the structural clusters that were then used to align the target loop sequences [77­79]. Work in this area includes our contribution in the prediction of the H3 loop of immunoglobulins [33] and the use of hidden Markov models derived from ArchDB classes and subclasses to predict loop structures [36]. The use of sequence profiles derived from structural clusters to predict the structure of loops was complemented by the use of large libraries of fragments. The development in this area includes others and our (ArchPRED, see next) works including Michalsky et al. [80], Heuser et al. [81], and Choi and Deane [82]. ArchPRED [37, 83] is described in more detail in the "ArchPRED, a knowledge-based approach" section. To optimize the selection of suitable fragments from a nonredundant set, a neural network classifier was devised showing an improvement over long loops, although the coverage remained low [84]. Choi and Deane revisited the loop structure prediction performance on a knowledge-based method: FREAD, showing a substantial increase on the accuracy of predictions for loops up to 20 residues long with a coverage up to 50% [82]. More recently, a combination of fragment assembly and analytical loop closure has shown a great potential for even the so-called superlong loops [85]. Limitations of knowledge-based approaches: completeness of the database Knowledge-based approaches aim to capitalize on existing main-chain loop conformations; thus, the major limitation would be the lack of such suitable fragments for a given target sequence or, in other words, the completeness of the database (i.e., a method would fail if suitable conformations are not present in the database). This question was first studied by Fidelis et al. [86], who explored the frequency of distribution of repeat conformations and the clustering of structurally related fragments. They concluded that knowledge-based approaches were only suitable for short loops up to four residues. A similar study by Lessel and Schomburg [87] largely confirmed these observations. With the exponential growth of protein structures largely due to the automation and improvement in X-ray and NMR techniques within the framework of large-scale initiatives of structural genomics efforts [88, 89], the sampling of loop conformations has notably improved. The first proof about this was provided by Du et al. [90] who showed that, even for loops up to 15 residues long, there was a high probability ( 0%) to find a suitable, nonho>9 mologous, structural fragment [i.e., Å of the root mean <2 square deviation (RMSD)]. We addressed the question of database completeness in a more comprehensive manner by studying the fraction of loops extracted from all known protein sequences that are indeed represented by loops from known protein structures [38]. The structures of loops were clustered after an all-versus-all comparison, and sequence identity cutoffs for different loop lengths were identified to ensure the limits of structural similarity (e.g., two loops with a sequence identity of 50% are expected to have an RMSD of smaller than 2 Å; Figure 5). As expected, the required sequence similarity cutoff was more demanding for longer loops, with a general sharp transition between 45% and 55% sequence identity between low and high RMSD values for all loop lengths, indicating that 50% sequence identity guaranteed structural similarity (Figure 5), except for a very few notable exceptions (i.e., where high sequence identity was observed between dissimilar structures) [38, 91, 92]. Once the structural similarity cutoffs were correlated with sequence similarity, the sequences of loops extracted from protein sequences were compared with sequences of loops extracted from protein structures. The results Bonet et al.: Smotifs as structural local descriptors of supersecondary elements201 Cumulative frequency 0.8 0.6 0.4 0.2 90 80 70 60 50 40 Sequence identity, % 30 0 5 RMSD , Å Sequence identity, % Figure 5Database completeness. The relationship between structural similarity and sequence identity for loops of length 8­14 shown in red, blue, orange, green, dark blue, brown, and black, respectively. Inset shows the cumulative frequency distribution of loops that can be matched up between all known sequences and the available structural conformations at a given sequence identity. Adapted from [38]. showed that loops up to 10 residues could be associated with a loop of known structure (i.e., they shared at least 50% sequence identity and only 20% and 10% of loop size 13 and 14 did not have a match at this sequence identity level; Figure 5, inset). The study was repeated by generating a backdated database of loop structures to study effect of the growth in the PDB and saturation of conformations, showing that, from about year 2001, there were not single unique conformations deposited in the PDB for loop length up to 12 residues long [38]. Indeed, our study demonstrated that the limitations of knowledge-based approaches were not related to the sampling of suitable conformations in the library of Smotifs but rather to the successful scoring and identification of suitable Smotifs. These findings were corroborated in a recent study by Choi and Deane [82] and most importantly in the data presented in the historical review of the classification of Smotifs in ArchDB (see "Historical perspective of classification of Smotifs" section). ArchPRED, a knowledge-based approach ArchPRED represents an example of knowledge-based loop structure prediction methodology [37]. ArchPRED relies on a library of Smotifs and features a filtering, selection, and ranking algorithm to select the most suitable conformation for a given target loop sequence (Figure 6). The library of Smotifs is organized by loop type (as defined by the flanking regular secondary structures; i.e., -, -, -, and - loops), size, and geometry. As described previously, the geometry of Smotifs is determined by four internal variables: the distance (d) and three angles: hoist (), packing (), and meridian () [34]. The selection of Smotifs from the library is based on the geometrical restraints imposed by the bracing secondary structures of the missing loop (i.e., Smotifs will be selected if the geometry is similar or fall within the range of tolerance: 2 Å in the case of d and 30°, 30°, and 45° in the case of the , , and angles, respectively). The clear advantage of using the geometry rather than distances and/or structural fitting of stem residues is that the search space is reduced dramatically and the geometry-based filtering is very fast. On 50 randomly chosen examples when using a distance matching criteria, the prediction of loops of sizes 4, 8, and 12 selects 1534, 683, and 430 suitable Smotifs, respectively. The selected number of Smotifs after geometrical filtering was only 181, 85, and 25, respectively. More importantly, this selection does not imply that good Smotifs are discarded; comparing the average RMSD of the best fragment between loops that were selected by distances or by geometry, the differences were .05, 0.09, <0 and 0.11 Å for loops of sizes 4, 8, and 12, respectively. Subsequently, a filtering step discards unsuitable Smotifs based on the structural matching of stem residues: RMSDstems and unfavorable interactions between Smotifs and the new protein environment (i.e., steric crashes). The RMSDstems was shown to correlate with the quality of prediction for both filtering and scoring purposes [80, 81]. However, we found that the correlation is less pronounced in the case of long loops (i.e., above 8 residues of long loops); therefore, ArchPRED uses a dynamic cutoff as a function of loop size. Finally, the filtering step evaluates the fitting of Smotifs in the new environment. This aspect is particularly important, as the native structural environment of Smotifs could be very different from the one in the target protein. The last step in the prediction process is the ranking of remaining Smotifs. The scoring function is composed of a sequence similarity (computed using the K3 substitution matrix [94]) and an amino acid [f|] dihedral angle propensity [95] term. Given the fact that sequence and propensity scores have different dimensions, these are converted into dimensionless statistical Z-scores, which are obtained in reference to randomly generated sequences and [f|] dihedral angles. The scoring function is then a 202Bonet et al.: Smotifs as structural local descriptors of supersecondary elements Figure 6Overview of the ArchPRED prediction pipeline. Suitable Smotifs are selected based on the geometry of missing loop, upon which two filtering steps discard unsuitable Smotifs. The remaining Smotifs are ranked based on a composite scoring function. Adapted from [93]. composite Z-score combining the two types of sequence and [f|] dihedral angle propensity Z-scores. ArchPRED was benchmarked on a nonredundant data set of protein loops of lengths ranging from four to 14 residues by comparing to a competitive ab initio method (ModLoop) [96] and to the theoretical minimum RMSD (i.e., selecting the best Smotif candidate). In general, ArchPRED was competitive, and the library of loops contained suitable Smotifs for loops of all lengths (Figure 7). The latter observation reinforces earlier studies about the completeness of conformations in current databases, as discussed in the previous section. As the number of protein structures and loop conformations continues to grow, so does the accuracy and applicability of ArchPRED. Finally, we used the benchmarking of ArchPRED and the relationship between Z-scores and accuracy and coverage to identify confidence Z-score thresholds for loops of all sizes (Table 2). ArchPRED is available as a web application [83]. Users are required to provide the atomic coordinates of the query structure and define the location, sequence, and flanking secondary structures of the missing loop. The number of returned predictions and Z-score threshold are tunable parameters. As an optional postprediction optimization, users can select whether to graft the predicted conformation into the protein using Modeller [97]. The grafting of the loop in the structure involves (1) the optimization of the side chains of predicted loops and (2) a limited minimization to anneal the stems in the protein framework. The server is accessible at http://www.bioinsilico.org/ARCHPRED. 6 5 RMSD , Å 4 3 2 1 0 8 9 10 Loop length Figure 7Prediction accuracy as a function of loop size. For different loop sizes, the predicted loops were compared with the native structure, and the average RMSD values were computed. The black and green lines represent the accuracy of the theoretical limit (i.e., selecting best Smotifs in the library) and the random prediction (i.e., random selection of Smotifs with same loop size), respectively. The blue and red lines depict ArchPRED and ModLoop accuracy, respectively. Adapted from [37]. Smotifs as building blocks in structure-based protein design Structure-based protein design is a rational approach to alter the properties of proteins by employing structural Bonet et al.: Smotifs as structural local descriptors of supersecondary elements203 Table 2Accuracy of prediction and coverage for different loop lengths. Loop length 4 5 6 7 8 9 10 11 12 13 14 a b sample the conformation of main-chain conformation by means of Smotifs [41]. Z-scorea 1 1 1 1 2 3 3 3 4 4 4 Average RMSD,b Å 0.22 0.15 0.34 0.93 1.38 1.93 2.11 2.30 2.47 2.85 2.88 Coverage,c% 98 96 98 94 78 60 46 44 28 4 6 Saturation of Smotifs In the previous sections, we reviewed our and other works addressing the subject of saturation of loop conformation and the database completeness and quality of loop structure predictions. However, our definition of loops as Smotifs also includes the structural arrangement of the flanking secondary structure or geometry. Therefore, in a different development, we studied the sampling of geometrical space of Smotifs [39]. Thus far, we studied the emergence of different geometries across backdated sets of protein structures. We found that, as early as 1997, for all Smotif types (i.e., -, -, -, and -), all geometrical bins were fully sampled (Figure 8). In addition, we found that, as early as the third edition of the Critical Assessment of Structure Prediction meetings [106], any new protein submitted in the new-fold category could have been reconstructed using Smotifs from known folds. Likewise, proteins classified as new fold on the SCOP database [64] releases 1.75 and 1.73 did not have Smotifs, which were not observed previously. Together, the study of Fernandez-Fuentes et al. [39] and our previous studies [38] concluded that the sampling of both loop conformations and geometries of Smotifs is rather comprehensive and thus opened possible new avenues in the fields of protein structure prediction and computational design. Z-score cutoff to get an accuracy better or equal to ModLoop. Average RMSD for the given Z-score cutoff. cPercentage of query loops that are modelable (i.e., a suitable Smotif can be found) at the given Z-score threshold. Adapted from [93]. information [41]. Such changes include changes to improve protein stability (e.g., thermostability), alter or modify protein function, or modify the binding to substrates or other biomolecules (see reviews in [98, 99] and the references therein). The changes of the properties of proteins can be achieved by changing individual residues (i.e., point mutations) to the entire remodeling of short fragments. Initial works in the field focused in single aminoacid changes with null or very limited flexibility to avoid the complexity and the limitations in the modeling of flexible main-chain backbone (e.g., [100]). However, the remodeling of short regions of proteins, or de novo design, which usually materializes in the loops, presents a number of advantages as illustrated in recent publications [70, 101]. The remodeling of the protein backbone seeks to accommodate specific interactions and restraints (e.g., catalytic amino acid) [101] or to create new structural elements to accommodate certain functionalities [102]. The main hurdle, however, is the inherent complexity that makes the systematic sampling of large insertions very difficult. There has been, however, some examples described in the literature (reviewed in [103]) based on the recombination of modular, short fragments to diversify or graft novel functions and to improve a desired feature on proteins (e.g., thermostability). We can mention, for example, the reconstruction of a beta propeller via the assembly of short fragments [104] or the design of a highly stable protein [105]. Our definition of Smotifs is particularly suitable to account for the modularity of protein segments, so we developed Frag`r'Us, a method designed to 350 300 Cumulative frequency 250 200 150 100 50 0 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 Year Figure 8Saturation of geometry of Smotifs. Cumulative frequency distribution as a function of time for - (red), - (blue), - (green), and - (black) Smotifs. Adapted from [39]. 204Bonet et al.: Smotifs as structural local descriptors of supersecondary elements Structure-based protein design using Smotifs (Frag`r'Us) Loops sharing the same geometry can have different conformations; in other words, the gap between two fixed flanking secondary structures can be spanned by loops having different conformations. If we assume that the sampling of loop conformations and geometries in current databases is rather comprehensive, then the question was whether our collection of Smotifs can be used to sample the conformation of short regions amenable to de novo protein design. To answer to this question, we developed Frag`r'Us, a tool designed to sample loop conformations in a modular fashion using the geometrical constrains of Smotifs [41]. The sampling of the conformation is done at geometry level. Upon defining the geometry of the Smotif of interest (i.e., the region to be remodeled), the library of Smotifs is queried using the geometrical values of the query Smotifs. In its current implementation, Frag`r'Us considers that two Smotifs have a similar geometry if the difference between their corresponding geometry parameters: d, , , and values is less than or equal to 0.5 Å and 5°, 5°, and 10°, respectively. As a result of this comparison, a list of Smotifs with equivalent geometry but different conformation is generated. These Smotifs conform to the starting point in the downstream computational design process. Frag`r'Us was benchmarked against a data set of pairs of proteins (wild-type and redesigned), where the structures of both the initial protein prior to the design (i.e., scaffold) and the designed protein were known. The test set included a wide range of examples in de novo protein design including a Diels-Alderase [70], a human guanine deaminase [101], the catalytic loops in ()8-barrel scaffolds [107], two retro-aldol enzymes [108, 109], a triosephosphate isomerase [110], and the motifs to target antibody b12 [111]. A detailed description of each individual case, regions, and information about scaffold and designed proteins can be found in the supplementary material of our publication describing this work [41]. The remodeled regions of the designed proteins were compared with the loop conformations of Smotifs matching the geometry of the flanking secondary structures. In all cases, except one, the conformation of Smotifs extracted from backdated libraries could recapitulate the conformation of the designed fragment. A particular and challenging case was the remodeling of a 24-residue-long insertion flanked by a -hairpin [70]. In the original work, the remodeling of these regions was achieved by combining Rosetta [112] and crowd sourcing (Foldit game) [113]. Besides the size of the fragment, the remodeled regions contained a helix-turn-helix motif. Even in this case, Frag`r'Us was able to provide a satisfactory Smotif whose loop conformation closely resembled the conformation of the designed version by using just the geometrical restraints (Figure 9). Protein structure prediction using Smotifs and experimental data Smotifs can be used for protein structure prediction following a fragment assembly approach. The underlying hypothesis is that patterns of indirect structural data characterizing the connecting loop region in a Smotif will determine the relative orientation of flanking secondary structures and thus will be informative for the selection of an entire supersecondary structure element (Smotif). The notion for this hypothesis emerges from the success of inverse application (see above), in which the conformations of loops were successfully modeled by the fit of the corresponding Smotif, specifically the flanking secondary structure residues, in the relevant structural environments in the template and target structures [36, 37]. One possible avenue to make a significant advance in protein structure modeling is to incorporate indirect structural data from high-throughput experiments [114]. A growing number of methods incorporate a variety of easily obtainable NMR data as restraints to guide protein structure modeling or simulation. Many of these methods focus on backbone NMR chemical shift (CS) assignments. Obtaining CS is a necessary and straightforward first step in the classic NMR structure determination process. Within the framework of developing the TALOS program, it was shown that CS data can guide the Initial-final Final-frags'r'Us Loop 1 Figure 9Example of Frag`r'Us prediction. Ribbon representation of the superposed structure of scaffold (PDB code 3i1c, chain A), engineered (PDB code 3u0s, chain A) proteins, and the best matching candidate Smotif (PDB code 2p3p, chain A, residues 242­271). A and B indicate the starting and end points of the remodeled loop. Green, blue, and red depict the native, engineered proteins, and Smotifs. Adapted from [41]. Bonet et al.: Smotifs as structural local descriptors of supersecondary elements205 selection of tripeptide segments with similar conformations and provide preferences/restraints for main-chain dihedral angles [115, 116]. The highly successful Rosetta ab initio fragment assembly program [117] combines CS data and sparse NOE restraints (approximately one per residue) to steer the selection and filtering of three and nine residue fragments besides taking into account sequence similarity measures of these fragments [118]. In a similar approach by Gong et al. [119], experimentally determined CS and sequence patterns were used to search the protein database for consecutively overlapping six-residue-long backbone fragments, which then were "stitched" together using Monte Carlo simulation [119]. In more recent applications, CS-Rosetta was shown to be successful in delivering high-quality models when using CS data in combination with sequence information [120, 121]. Similar ideas are implemented in the CHESHIRE method, which first predicts secondary structures of three and nine residue fragments using CS data and then combines these fragments into larger ones by matching sequence information, secondary structures, and CS patterns [122]. NMR CS data were converted into forces in molecular dynamics simulations and were successfully used to fold short polypeptide chains or to refine partially unfolded structures [123, 124]. An important advance for that work was the development of the CamShift method [125] that quickly predicts CS values from structures. Besides CamShift, several other approaches are available that calculate theoretical CS values for a given structure, such as SHIFTX2 [126], SPARTA+ [127], and PROSHIFT [128]. GENMR [129] is a very fast modeling implementation that combines homology models with CS and/or NOE data. The component of GENMR that relies on structure calculation using CS and sequence information without NOE data is CS23D [130]. CS23D incorporates various other methods, such as threading, homology modeling, or small fragment assembly using the Rosetta program. SmotifCS algorithm for hybrid modeling of protein structures As Smotifs are backbone-only defined fragments a relation needs to be made between a target sequence and the backbone-only library of Smotifs. One possible way to do this is hybrid modeling, where a limited amount of easily obtainable, indirect experimental data is used to select Smotifs for structure modeling. We used CS assignments from NMR studies for the target protein and developed a novel algorithm to predict the structure of protein by combining Smotifs and CS information: SmotifCS [40]. The schematic representation of the SmotifCS algorithm, described in more detail below, is illustrated in Figure 10. In this application, first we need to precalculate all backbone atom (N, HN, H, C, C, C) theoretical CS for all Smotifs in our library using SPARTA+ [127]. Next, the structure prediction algorithm relies on another precalculated database that contains the relative weights of structural information conveyed by a given normalized CS. Predicted CS values aggregated from all library Smotifs Figure 10Overview of the SmotifCS prediction pipeline. Smotifs and NMR data are combined to select the combination of Smotifs to predict the structure of the protein [40]. 206Bonet et al.: Smotifs as structural local descriptors of supersecondary elements were divided into groups based on atom type, residue type, and preceding residue type, resulting in 6 0 0 400 ×2 ×2 =2 categories. For each category, CS values were normalized by subtracting the random coil value. The relative weight of structural information conveyed by a given CS (categorized by atom type, residue type, and preceding residue type) is calculated as the difference between the statistical propensities of the "most favored" and "second most favored" secondary structural conformations. To identify the relative orientation of regular secondary structures within a Smotif, we analyzed the CS patterns of the loop segments and the three flanking secondary structure residues on each side of the loop. To select candidate Smotifs from our library, we compared the experimental CS of each query Smotif and the theoretical CS of available Smotifs in our library. TALOS+ [116] predicted theoretical [f|] angles (also obtained from the experimental CS) were used to assign each loop residue of the query Smotif in one of the 11 possible locations within the Ramachandran map [37]. The string of Ramachandran map sublocations constituted the "fingerprint" of loop segments that was compared to similar fingerprints derived from the Smotifs in our library. The best matching Smotif fingerprints were then ranked by their CS match "score" calculated as the sum of weighted squared differences between the CS of the query and library Smotifs. After a suitable set of candidates has been selected for each putative Smotif in the query structure, a full enumeration of the structures is carried out by joining every possible combination of these Smotifs. The lengths of the secondary structures of the sampled Smotifs are extended or shortened as necessary to fit the query sequence. In the process of joining Smotifs, a limited number of steric clashes are allowed. The candidate structures resulting from the full enumeration are evaluated using a linear scoring function with the following components: radius of gyration using C carbons, a distance-dependent statistical potential function [131­133], an implicit solvation potential [134], and a knowledge-based long-range backbone hydrogen-bonding potential [135]. All components were converted into statistical Z-scores before combining them with weights optimized on a set of decoy structures. The best 200 structures from this ranking were relaxed using Modeller [136] to resolve steric clashes and maintain stereochemistry. The accuracy of final models was evaluated using RMSD and GDT_TS scores [137] with respect to the experimental solution structure. The method was tested on a data set of 102 proteins obtained from the Biological Magnetic Resonance Data Bank [138]. The test set is the currently largest nonredundant data set of experimentally known structures for which CS data are publically available and where all structures represent a different SCOP fold category [139]. The results are presented as a distribution of GDT_TS scores [137] for the superposed backbone atoms of experimental structure compared with the top ranked model for the entire lengths of the protein (Figure 11). The top ranked models have GDT_TS scores in the range of 20% to 80%. The number of proteins where the best sampled models have GDT_TS 50% is 47. This means that, for about half of the cases, a high-quality homology model is generated and, for all cases, at least a topologically correct fold is produced. Although the SmotifCS method is unique in its approach, as it is not using any sequence information, we compared SmotifCS to CSRosetta on a randomly selected subset of 15 proteins. This comparison normally would not be completely relevant, as our approach does not use sequence information at all. Smotifs are used with their backbone geometries, and we generated backbone-only models, whereas Rosetta relies on fragments collected from sequentially related structures. To establish comparable conditions, we purged from the Rosetta fragment database all homologous PDB templates that were detected for a target protein using HHblits [140] and PsiBlast [141]. This eliminated, on average, 0.82% of the three residue fragments and 1.43% of the nine residue fragments that CSRosetta could use in modeling. In a head-to-head comparison on the randomly picked 15 test cases, the two methods show competitive performance with an average of 52.07 .08 and 55.07 .16 GDT_TS (SEM) for SmotifCS ±3 ±3 and CSRosetta, respectively. CSRosetta outperforms SmotifCS in five cases and SmotifCS outperforms CSRosetta in Figure 11Accuracy of models. Distribution of GDT_TS scores as a function of secondary structure assignment accuracy from CS data. The entire data set (102 proteins, black columns) is split into two: in 50 proteins, at least one secondary structure is incorrectly assigned (red), whereas, in 52 others, all Smotifs are captured correctly (green). Bonet et al.: Smotifs as structural local descriptors of supersecondary elements207 seven cases, with both performing comparably in three cases. In terms of required computational time, CSRosetta takes about a magnitude longer to perform the calculations for the same proteins, and this difference increases rapidly with protein size. The fact that SmotifCS, due to the large chunks of supersecondary structures it uses, does not scale exponentially with the increasing protein size makes it a promising approach to model larger proteins for which CS data can be collected. Individual modeling cases One of the better performances for a fold with a mixed composition of strands and helices (which usually pose a more difficult challenge) is observed in the case of 1 khm [142], with an overall GDT_TS score of 68.57 (Figure 12A). The general tendency that Smotifs are typically sampled from a range of unrelated folds underlines the algorithmic concept, where large modular building blocks are identified that are shared between unrelated folds that do not show any overall homology. It has been observed in similar studies that proteins with long loops pose the most difficult challenge for modeling. We also showed that Smotifs with long loops are less well sampled and, in general, harder to model. 2jya (Figure 12B) has particularly long loops (longest loop length is 22 residues, and total loop content is 72%). It is clear that, whereas the core of the protein made of regular secondary structures is well captured, the two long loops are poorly modeled resulting in an overall GDT_TS score of 39.53. If we calculate the RMSD for the whole model, we obtain 9.46 Å; if we calculate the RMSD of the structured core only, it is 1.50 Å. Finally, we explored the modeling of a designed protein (PDB code 2kl8 [143]). This case presents no bias with respect to the other already known experimental structures or topologies for sampling Smotifs. For 2kl8, we obtained a high-quality model with a GDT_TS score of 50.33 (Figure 12C). By definition, since 2kl8 is a unique fold, all Smotifs used to build this model come from unrelated proteins; in addition, all five Smotifs come from five unique folds. Conclusions and future perspective The folding of a protein follows several steps, of which one of the first is the transition from a random coil structure to the molten globule state. The exact role of supersecondary Figure 12Examples of modeling cases. Structural superposition of top ranked model (in pink) with the solution structures (in blue). PDB codes and overall GDT_TS score (in brackets) are shown. The templates from which the Smotifs are sampled are shown in gray with the Smotifs themselves colored according to their secondary structures. The PDB code, chain and residues contributing to the Smotif template, the SCOP identifier of the template (if available) and the RMSD between the template and the native Smotif are shown. structures at this point is unclear, because secondary structures are not yet completely defined, although interactions between residues separated by several peptide 208Bonet et al.: Smotifs as structural local descriptors of supersecondary elements bonds are already formed while undergoing a hydrophobic collapse. The folding of loops constitutes a mini-folding problem, where restraints on the stem flanks for the location of secondary structures play its role for the protein folding but not necessarily for the loop conformation. This is shown by the classification of Smotifs, as similar conformations are found in different types and disposition of supersecondary structures, whereas the same geometry of a supersecondary structure can have more than one conformation. We have shown that the information on protein structures stored in the last 20 years has completed many of the standard loop conformations formed by or 10 residues. <8 Consequently, algorithms that model loops using knowledge-based approaches became more and more successful. Loop modeling accuracies can be substantially improved by restraining the stem position of both flanking secondary structures and by focusing on specific residues within the loop sequences. A special application of fragments and suitably selected restraints is illustrated in a protein design example using Frag`r'Us, which compares favorably with other design approaches such as Rosetta. Another illustration how to take advantage of the nearly complete Smotif classes is the application of hybrid ab initio protein fold prediction (SmotifCS). It can be applied to reconstruct the protein fold by pieces, as long as we have enough restraints to locate Smotifs in the protein scaffold. The major challenge remains on accurately modeling the conformations of long loops (with more than 10 residues), as their sampling is still limited in the current databases. We expect a growing role of Smotif classification in the modeling of proteins in three directions: (1) modeling of nonregular secondary structures, (2) ab initio and newfold prediction, and (3) modeling of multimeric complexes. Within the topic of modeling of long loops, we also expect that the classification of Smotifs will help select new protein structure designs with modified loop conformations, whereas for new-fold prediction it will be crucial to find information or accurate mechanisms for the prediction of residue-residue contacts between sequentially distant residues in the protein. Furthermore, swapping loop conformations but restraining the flanking secondary tails will help increase the combinatorial possibilities of protein design. Acknowledgments: This article is partially based on our previous publications [ref. 30­39, 41]. NFF acknowledges support from ACCIO, Generalitat of Catalunya under the TecnioSpring Program, project number TECSPR13-1-0008, REA grant agreement 600388. This work was supported by NIH grants GM094665 and GM096041 to AF. JB and BO acknowledge support from the Spanish Ministry of Economy under grant BIO2011-22568. Author contributions: All authors have accepted responsibility for the entire content of this submitted manuscript and approved submission. Research funding: None declared. Employment or leadership: None declared. Honorarium: None declared. Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the report for publication.

Journal

Bio-Algorithms and Med-Systemsde Gruyter

Published: Dec 19, 2014

References