Access the full text.
Sign up today, get DeepDyve free for 14 days.
The divergence entropy: and measuring the distance between observed/ theoretical and observed/random distributions was applied to identify the category of protein structures in respect to the hydrophobic core in protein molecules. The naive interpretation was applied treating the proteins of < as the molecules of hydrophobic core accordant with the theoretically assumed. The proteins of > are treated as representing the hydrophobic core not accordant with the assumed one. The large scale computing was performed (PDB data set) to reveal whether other than simple inequality relation should be used for this identification. The cluster analysis was applied to identify the relation versus as the discrimination factor to classify the category of proteins in respect to their structural form of hydrophobic core. KEYWORDS: cluster analysis, divergence entropy, protein structure, hydrophobic core Introduction Identification of hydrophobic core in protein is important for analysis of tertiary structure stabilization. Many models were applied to define and identify the hydrophobic core to mention the most important one called "oil drop" introduced by Kauzmann [1]. The ,,fuzzy oil drop" model, which introduces the 3-D Gauss function to express the structure of hydrophobicity distribution in protein body was described in details in [2]. The quantitative estimation of differences between theoretical (idealized) hydrophobicity distribution and the empirically observed one (based on the location of hydrophobic and hydrophilic residues in protein molecule) is necessary to reveal and differentiate proteins molecules in respect to hydrophobic core formation. Methods Model description The observed hydrophobicity distribution is calculated according to function introduced by Levitt [3]. Each residue collects all hydrophobic interaction in distance lower than 9A (cutoff for hydrophobic interaction). The resultant hydrophobic interaction is attributed to "effective atom" (mean position of atoms of side chain). The theoretical distribution accordant with the Gauss function is calculated for any point in space in protein body after special orientation in coordinate system: the geometric center is located in origin of coordinate system with longest diagonal of the molecule oriented along one of axis (say X-axis) and longest diagonal between points on the plane perpendicular to original axis (say YZ plane). For such orientation the calculation of the size of ellipsoid expressed by parameter for each direction (each axis) can be calculated. The theoretical hydrophobicity is calculated for positions of "effective atoms". The difference between these two distributions can be calculated since both distributions (theoretical and empirical) are standardized. The Kullback-Leibler divergence entropy has been introduced to measure the distance between idealized and empirical hydrophobicity distribution [4]. The divergence entropy is calculated according to following equation: / Where Oi represents the observed hydrophobicity for i-th residue and Ti represents the theoretical hydrophobicity on i-th residue. However the value of calculated for particular protein can not be interpreted due to relative character of entropy in general. This is why the other target distribution has been introduced the constant one (R). This distribution assumes equal hydrophobicity for each residue in protein molecule. The comparison of two entropies and may be interpreted as distance between observed and theoretical distribution and the second one as distance between observed and constant distribution [5]. The comparison of these two values suggests the naive interpretation of hydrophobic core structure in protein. The relation < suggests the accordance between observed and theoretical distribution, while the protein of > is treated as molecule with no hydrophobic core accordant with the assumed one. Databases The redundant databases were generated taking all proteins present in PDB. Two databases were created distinguishing the proteins representing tertiary structure and the second one representing the quaternary structure (protein complexes). The quaternary structure of proteins is quite frequent. This is why the and values were calculated for complexes and individual chains independently. Statistical analysis The investigation was started with the use of k-means clustering algorithm to split each datasets into two clusters. It is expected that similarity among members of a cluster should be high while similarity between different clusters should be low. In this research, Euclidean distance was chosen as distance measure. Additionally, v-fold cross-validation algorithm (with v=10) was applied to compare the costs of the choice of only two groups to the lowest costs calculated by means of that algorithm. The lowest costs correspond to the optimal number of clusters which was recognized if a costs decrease was less than 5% in regarding to the previous model. Statistica Data Miner offers two types of algorithms to clustering the data, i.e. k-means and EM. A measure of costs depends on the applied algorithm. For k-means clustering costs are calculated as average distance of the observations in the testing sample from their cluster centers. For EM clustering an appropriate measure is average negative log-likelihood of the cross-validation data [6]. The general idea of v-fold cross validation algorithm is to divide the whole dataset into the random v samples. The observations belonging to v-1 folds create training sample, and the rest data testing sample. Then, the analysis is performed using training sample, and the results of it are applied to testing sample to compute some index of predictive validity. These steps are repeated v times and finally the results for the v replications are averaged to yield a single measure of the stability of the model. In such a way we obtain the validity of the model for predicting new observations [6]. The next step of the study focused on an application of classification tree algorithms to detect criteria for dividing the whole datasets into two determined classes. The results of cluster analysis were treated as a point of reference for them. The traditional statistical methods commonly applied to discover a discriminant pattern were excluded due to distribution assumptions. Normality is not fulfill for both and variable. In such a situation classification trees are a very attractive analysis alternative. They belong to supervised divisive hierarchical methods and are situated on the boundary between predictive and descriptive methods [7]. In this research there were used two types of classification tree, i.e. C&RT (Classification and Regression Trees) and CHAID (Chi-squared Automatic Interaction Detector). It can be pointed out two basic differences between these algorithms. The first one concerns on splitting criteria applied to make the best separation of each node. C&RT uses the Gini index while CHAID uses chi-squared test. Moreover, C&RT model is always binary i.e. each node can be split into two child nodes only. That restriction does not concern CHAID trees [7,8]. In the study, misclassification error was used as splitting conditions for pruning the tree for both types of models. Whereas, stopping criteria were determined as a mixture of some rules. The algorithm was stopped if a minimum number of cases for a node was less than 2,5% of all observations. Additionally, it was decided that the tree should not have more than 10 levels and more than 1000 nodes. Furthermore, v-fold crossvalidation were applied (with v=10) but this time the aim of it was to prevent over fitting the data and in consequence to be able to generalize the model for new items. Gains charts were used to compare the results of both tree models. They show the percentage of observations correctly classified for the given cluster. Results Figure 1 presents the clusters obtained with k-means analysis for both sets of data. Left panel corresponds to the data belonging to PDB complex file and this on the right to individual chains. As we can see the data from the first set was divided mostly by distance. The average value of distance is quite the same for both clusters. It is equal 0,281447±0,080308 and 0,293941±0,067792 for cluster 1 and cluster 2 respectively. Furthermore, items with distance above ca. 0,42 were assigned to cluster 1 (N=25.330; 37,22%) while the rest data to cluster 2 (N=42.726; 62,78%). Moreover, cluster 1 characterizes not only a higher average value but also a higher dispersion of the data (0,550814±0,127473) than cluster 2 (0,253306±0,081126). For more information about descriptive statistics see appendix 1. In case of the data belonging to PDB chains file the basic role plays distance as opposed to distance for PBD complex file. The average value of distance does not differ very much in cluster 1 and 2 whereas the difference between means of distance is quite high. In cluster 1 average and standard deviation of the latter variable is 0,375013±0,049692 and in cluster 2 is 0,265442±0,040381. Figure 1. k-means analysis results for PDB complex (on the left) and for PDB chains (on the right) Assuming an arbitrary division of the datasets into two clusters only we have to accept higher costs than if we chose optimal number of clusters generated by v-fold cross validation. The optimal number was counted to 8 for the first set of data (with costs equal to 0,0312) and 9 for the second one (with costs equal to 0,0373). Costs are almost twice higher than when the data were split into two groups only. They are 0,0496 and 0,06 for the data belonging to PDB complex file and for PDB chains file respectively. But, it should be noted that they are still at a very low level. More levels of costs and the number of clusters corresponding to these costs are presented in Figure 2. Figure 2. Costs and corresponding to them number of clusters Figure 3 and Figure 4 show C&RT and CHAID trees respectively obtained for the set of data belonging to the PDB complex file. Two features of these trees are easy to see. Firstly, both trees have very simple structure. Secondly, the variable separating the data most clearly is distance for both models. The C&RT model, as the binary tree, has two leaves only. The root contains 68.056 individuals. Of these, 25.330 (37,22%) belongs to cluster 1 and 42.726 (62,78%) to cluster 2. The threshold value for distance is 0,401192. Therefore, an item is classified as cluster 2 if the distance is below the threshold and as cluster 1 if it is above. Good classification rate for the first leaf is 99,90% and for the second one it amounts to 99,43%. Taking into consideration the CHAID model we can see that the root is divided into six child nodes, but only five of them we treat as leaves. It should be seen that the first and the last one contain correctly classified elements only. The first leaf (on the left) contains 40.836 items belonging to cluster 2 ( distance is lower than 0,388033), while the last one contains 20.414 objects classified as cluster 1 ( distance is above 0,442988). Incorrectly assigned elements can occur when distance is between these values (0,388033<0,442988). Following Figure 4 in such situation we have to take a closer look at distance. An item will be assigned to cluster 2 if distance of it is lover than 0,403148 (3th leaf). Bad classification rate is than ca. 11%. An item with distance above that value will be classifies as cluster 1. It should be noted that good classification rate is different in situation when that distance is in the range of 0,403148 to 0,419587, and if it is above 0,419587. In the second situation we can be absolutely sure that assuming the item to the cluster 1 is the right choice, whereas in the first one we may be only on ca. 97% confident our selection. C&RT model N=68056 N=42726 N=25330 <= 0,401192 N=42623 N=42580 > 0,401192 N=25433 N=25287 N=43 N=146 Figure 3. C&RT model for PDB complex file CHAID Model N=68056 N=42726 N=25330 <= 0,388033 N=40836 N=40836 N=4916 N=0 N=1890 N=0 <= 0,442988 N=6806 > 0,442988 N=20414 N=20414 <= 0,403148 N=2043 N=1824 N=219 N=1977 <= 0,419587 N=2043 > 0,419587 N=2720 N=2720 N=66 N=0 Figure 4. CHAID model for PDB complex file Gains charts (Figure 5) and Table 1 show that Classification and Regression Trees are more effective than Chi-squared Automatic Interaction Detector, although good classification rate of both models is at a very high level. It is equal to 99,72% for C&RT tree and 99,58% for CHAID tree. Hence, bad classification rate for the first model is 0,28% whereas for the second one 0,42%. It means that classification rules, obtained from C&RT model, failed in 189 objects, while the rules derived from CHAID model failed in 285 cases. Table 1. Results of classification trees (PDB complex file) C&RT Model Predicted 25 287 43 (37,16%) (0,06%) 146 42 580 (0,21%) (62,57%) 25 433 42 623 (37,54%) (62,63%) CHAID Model cluster 1 25 111 (36,90%) 66 (0,10%) 25 177 (36,99%) cluster 2 219 (0,32%) 42 660 (62,68%) 42 879 (63,01%) Row total 25 330 (37,22%) 42 726 (62,78%) 68 056 (100,00%) Observed All groups Figure 5. Gains chart for cluster 1 and 2 (PDB complex file) Figure 6 and Figure 7 present classification rules for the data belonging to PDB chains fails obtained by means of classification trees. The tree from figure 6 presenting the results for C&RT model, have a very simple structure. It has three levels and three leaves only. The root contains 303.207 individuals among which 117.046 objects (38,60%) belongs to cluster 1 and 186.161 elements (61,40%) to cluster 2. The best separation of the data is provided by variable. The first threshold value is 0,318626. Hence, an item is classified as cluster 2 when distance is below that threshold. Good classification rate is than equal to 99,06%. The second threshold, for these observations with distance above 0,318626, is 0,323500 and concern the same measure (i.e. ). Hence, an object will be classified as cluster 1 if it is below this value (GCR (Good Classification Rate) =70,96%), otherwise as cluster 2 (GCR=99,09%). The tree made by means of CHAID algorithm is also not complicated. It has only two levels with 5 leaves. The variable separates the data most clearly is distance. The threshold value is equal to 0,317296. An item should be recognized as component of cluster 2 if the measure is below the threshold, and as component of cluster 1 if it is above it. Such division suggests it should be only two leaves (not five according to the model). Higher number of leaves is the result of different rate of good classification. It is 100%; 97,65%; 83,97%; 99,75% and 99,99% for leaves from 1 to 5 respectively. C&RT Model N=303207 N=186161 N=117046 <= 0,318626 N=184423 N=182689 N=115312 > 0,318626 N=118784 N=1734 N=3472 <= 0,323500 N=8508 N=6035 N=2473 N=109277 > 0,323500 N=110276 N=999 Figure 6. C&RT model for PDB chains file CHAID model N=303207 N=186161 N=117046 <= 0,301429 N=151613 <= 0,317296 N=30323 <= 0,335699 N=30322 <= 0,359543 N=30323 > 0,359543 N=60626 N=151613 N=29609 N=25462 N=30247 N=60623 N=0 N=714 N=4860 N=76 N=3 Figure 7. CHAID model for PDB chains file To sum up this part it seems that we obtained better results using C&RT algorithm than by means of CHAID algorithm but only for classification to cluster 1. The left gains chart presented in Figure 8 shows that the field under the line corresponding to C&RT tree is bigger than the one corresponding to CHAID model. Following the rules obtained by means of first kind of tree we make mistake in 3.472 cases (of 186.161;1,77%), while following the second one in 4.939 cases (2,65%; Table 2). In the other hand if we look at items classified as cluster 2, whereas they actually belong to cluster 1, the situation is reverse. The CHAID algorithm gives the better results than the C&RT one. In general 98,28% of all items were assigned to the right cluster according to C&RT rules and 98,14% according to CHAID model. Percentage of badly classified objects is 1,72% for the first model, i.e. 5.206 individuals were assigned to the wrong cluster. For the second model percentage of badly classified is equal to 1,86% what corresponds to the 5.653 individuals. Table 2. Results of classification trees (PDB chains file) C&RT Model Predicted 115 312 1734 (38,03%) (0,57%) 3 472 182 689 (1,15%) (60,25%) 118 784 184 423 (39,18%) (60,82%) CHAID Model cluster 1 116 332 (38,37%) 4 939 (1,63%) 121 271 (40,00%) cluster 2 714 (0,24%) 181 222 (59,77%) 181 936 (60,00%) Row total 117 046 (38,60%) 186 161 (61,40%) 303 207 (100,00%) Observed All groups Figure 8. Gains chart for cluster 1 and 2 (PDB chains file) Table 3. Descriptive statistics for the data belonging to PDB complex N Both groups 68056 25330 42726 0,25331 0,08113 0,55081 0,29394 0,12747 0,06779 0,36404 0,28145 0,17568 0,08031 Mean 0,28929 SD 0,07295 95% CI 0,288740,28984 0,362720,36536 0,280460,28244 0,549240,55238 0,293300,29458 0,252540,25408 Mediana Min- Max 0,000030,27897 1,78128 0,000000,33778 1,58998 0,049110,27059 1,78128 0,398280,51938 1,58998 0,000030,28448 0,85110 0,000000,24789 0,41593 Q1-Q3 0,249260,31239 0,224430,47406 0,243640,30068 0,455920,61137 0,253130,31888 0,185660,32196 Cluster 1 Cluster 2 Table 4. Descriptive statistics for the data belonging to PDB chains N Both groups 30321 11705 18616 0,25901 0,12586 0,23244 0,26544 0,10734 0,04038 0,24875 0,37501 0,11976 0,04969 Mean 0,30774 SD 0,06928 95% CI 0,307490,30799 0,248320,24918 0,374730,37530 0,231820,23305 0,265260,26563 0,258430,25958 Mediana Min- Max 0,277260,30142 0,80264 0,033800,21938 1,97670 0,313370,36153 0,80264 0,050380,20483 1,97670 0,027730,27222 0,37248 0,033800,23077 1,89201 Q1-Q3 0,263330,34671 0,165610,29992 0,337780,39829 0,161660,27258 0,245950,29537 0,169080,31595 Cluster 1 Cluster 2 Conclusions The standard clustering procedure suggests the number of clusters higher than 2. However model assumes distinguishing solely two clusters. It is related to practical application which expects the selection of proteins with hydrophobic core accordant with the assumed model (3-D Gauss function). Interesting result is that both clusters for individual chains appeared to represent the relation < . It means that the general mechanism for individual chain folding seems to be the process of hydrophobic core creation. The quaternary structure generation seems to be divided into two quite separate group. One of them follows the assumed model collecting the protein complexes of common hydrophobic core accordant with the 3-D Gauss shaped form. The second one collects the complexes with hydrophobic core highly discordant versus the assumed one. The results support strongly the "fuzzy oil drop" model as the model for protein structure generation. It seems that it may be treated as the general model for both: individual chain folding and protein complexation. However the latter one seems to define two distinct groups of protein complexes. This observation suggests al least two different mechanisms for generation of protein-protein complexes. The further analysis helping to link the quaternary structure generation with biological activity of the complex, its cellular localization or number of units in multi-molecular complex is planned in a close future. This paper presents the analysis of redundant set of proteins (complete PDB). The analysis of nonredundant protein data base is planned to be performed in a close future. Acknowledgements The project was financially supported by Jagiellonian University Medical College grant system # K/ZDS/001531.
Bio-Algorithms and Med-Systems – de Gruyter
Published: Jan 1, 2012
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.