Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Phylogenetic aspects of the concept of intelligent life design

Phylogenetic aspects of the concept of intelligent life design This paper presents a new treatment of molecular evolutionary model as a product of intelligent changes. The aim of this paper is to obtain a life design system, drawing on processes occurring in nature regardless of explanations of the origins of life. The idea of intelligent design and molecular relationship is considered as a basic concept of the intelligent life design system, using some analogies taken from molecular evolutionary models. Three steps of life design system are outlined; however, the main subject is an attempt to find certain similar effects of the design system processes and the processes simulated with basic evolutionary substitution models: Jukes-Cantor; Felsenstein; and Hasegawa, Kishino, and Yano (HKY). An idea of gene reduction has been applied, from more complex (taking into account information density) biological systems to less complex, specialised biological systems. Two steps have been taken into consideration: a test stage in the virtual world and an adaptation finishing process after running the systems in the real world. Two algorithms have been applied. The first one has applied similarity related to an accommodation process to required conditions in the virtual and the real world. The second algorithm has applied accommodation to required conditions separately (expressed as amino acid substitution) in the first step, using a convenient criterion, and further (similar to observable) accommodation in the real world. A phylogenetic tree, similar to a real one, has been calculated using the above method for mammals, for mtDNA, with the maximum likelihood method, and with the aid of PhyML for the HKY model. This paper is an introduction showing an aspect of the life design system, related to phylogenetic relationships. *Corresponding author: Zbigniew Krajewski, Faculty of Biomedical Engineering, Silesian University of Technology, ul. Roosevelta 40, Zabrze 41-800, Poland, Phone: +48 32 2777463, E-mail: zkrajewski@polsl.pl Introduction In plain words, the evolutionary model of genetic change is based on the assumption of the presence of isolated incidents of modification of individual DNA bases, such as substitutions, deletions, and additions. It is assumed that both deletions and additions are exceptionally rare, due to the possibility of damage to the coding frame. A more common phenomenon are synonymous substitutions that do not result in a change in the protein coding amino acid, or cause such changes more rarely, due to limited matching related to the modification of the protein functions. It is assumed that for individual genes (proteins), the genetic change process may be stable over , and on the basis of a specific number of changes in individual positions, it is possible to establish the topology and the divergence s of a hypothetical phylogenetic tree. Most often, for the purpose of establishing the evolutionary , only substitutions are taken into account, whereas additions and deletions are omitted, using sequencealigning methods. The models used for the creation of phylogenetic trees are most often based on the Markov chain concept. A stochastic process is a system of a variable function t, {X(t, ), tT, }. In the Markov process, for the X(t, ) process in the past, the future process, where t>t0, is characterised by the current process X(t0, ) [1]. The Markov chain is a Markov process where X(t, )S, for which the state space S is discrete. In other words, the Markov chain represents random transitions between discrete states, for which the future state depends only on the current state (the previous state), which can be illustrated with the following equation [1]: P( Xk+1 = j | Xk = i , Xk-1 = i1 , Xk-2 = i2 , ...) = P( Xk+1 = j | Xk = i ), (1) where P(Xk+1=j|Xk=i)=pij is the probability of transition from state i to state j. 146Krajewski: Phylogenetic aspects of the concept of intelligent life design For gene substitution methods, Markov chains with a finite number of states are used, with the transition probabilities expressed in a matrix: pAA p P = CA pGA pTA pAC pAG pAT pCC pCG pCT , pGC pGG pGT pTC pTG pTT Hypothetical ancestor t Sequence 1 t Sequence 2 Figure 1:One step model. (2) qii = i pii and qij = i pij + j p ji . Assuming chain reversibility, i.e. ipij=jpji, we obtain ^ pii = and ^ pij = kij = for i j . ^ 2 i 2 kii + i j kij ^ qij ^ qii 2 kii = ^ i 2 kii + i j kij (7) where A, C, G, and T represent adenine, cytosine, guanine, and thymine, respectively. For two K-long sequences, 10 combinations with n + k - 1 repetitions of nucleotide pairs are possible , k where n=4, k=2, i.e. 4 for which no change occurs (A-A, C-C, G-G, T-T) and 6 for which a change occurs, independently of the direction (A-C, A-G, A-T, C-G, C-T, G-T). The probability of the occurrence of any of the above arrangements for two K-long nucleotides is described with the multinomial distribution: P( k A-A , kC -C , kG -G , kT -T , k A-C , k A-G , k A-T , kC -G , kC -T , kG -T ) = K! k A-A ! kC -C ! kG -G ! kT -T ! k A-C ! k A-G ! k A-T ! kC -G ! kC -T ! kG -T ! k k k k k k k k k k (8) (9) Due to very minor changes in the genetic material that occur in one generation, it is more convenient to use the continuous model. Using the Chapman-Kolmogorov equation as follows (Figure 2): P( x 2 = j x0 = i ) = P( x 2 = j x 1 = n x0 = i ) -G q AA-AA qCC-CC qGG-GG qTT-TT q AA-CC q AA-G q AA-TT qCC-GG qCC-TT qGG-TT , (3) = P( x 2 = j x0 = i | x 1 = n ) P( x 1 = n ) = P( x 2 = j | x0 = i x 1 = n ) P( x0 = i | x 1 = n ) = P( x 2 = j | x 1 = n ) P( x0 = i | x 1 = n ) P( x 1 = n ) = pnj pni n . n n where kA-A, kC-C, kG-G, kT-T, kA-C, kA-G, kA-T, kC-G, kC-T and kG-T are numbers of pairs A-A, C-C, G-G, T-T, A-C, A-G, A-T, C-G, C-T, and G-T in the investigated sequences and qA-A, qC-C, qG-G, qT-T, qA-C, qA-G, qA-T, qC-G, qC-T, and qG-T are their respective probabilities of occurrence. By allocating numbers to A, C, G, and T bases (1, 2, 3, and 4, respectively), it is possible to obtain the following likelihood function L [1]: L = kij lnqij + kiilnqii . i< j i P( x 1 = n ) (10) For the continuous Markov process qij = P( x 2 = j x0 = i ) + P( x 2 = i x0 = j ) = 2 pnj ( t ) pni ( t ) n 2 qii = P( x 2 = i x0 = i ) = pni ( t ) n . n n (4) for i< j By maximising the likelihood function, appropriate parameters are defined, qij and qii, with an established experimental result kij, kii. When solving the equation, we may estimate the parameters [1] k k ^ ^ (5) qii = ii and qij = ij . K K For a stationary probability distribution, the probability of occurrence of individual states may be estimated as follows: 2 kii + i j kij (6) ^ i = . 2K Assuming that a substitution occurs at one step (Figure 1), the probabilities of occurrence of states ii and ij are Using the reversibility property npni=ipin and ChapmanKolmogorov equation: x1=1 x0=i x2=j x1=2 x1=3 x1=n Figure 2:Chapman-Kolmogorov equation scheme. Krajewski: Phylogenetic aspects of the concept of intelligent life design147 pij ( s + t ) = P( Xt = j X s = n | X0 = i ) = P( Xt = j | X s = n ) P( X s = n | X0 = i ) = pin ( s ) pnj ( t ). (11) the same, and are represented by P matrix for the discrete model and Q for the -continuous model as follows: 1- 3 -3 1- 3 -3 . P = , Q = 1- 3 -3 1- 3 -3 (19) This simplest model requires the stationary distribution to be uniform; hence, the probability of particular nucleotide occurrence in DNA chain i is the same at 0.25. The P matrix represents the probability of substitutions of particular nucleotides in one step, while the Q matrix represents the parameters of a matrix differential equation [see Eq. (18)] for the -continuous model with a dependent solution derived below [see Eq. (22)]. A more convenient Felsenstein model allows to set up a stationary distribution =[A, C, G, T], and the P and Q matrices are as follows: 1 - ujCTG uj C uj G ujT uj 1 - uj ATG ujG ujT A , P = uj A 1 - uj ACT ujT ujC 1 - uj ACG ujC ujG uj A -ujCTG ujC uj G ujT uj -uj ATG ujG ujT A , Q = uj A ujC ujT - uj ACT ujC ujG -uj ACG uj A where j ATG =j A +jT +j G j ACT =j A +jT +jT . jCTG =jC +jT +jG As a result, we obtain qij = 2 pnj ( t ) pni ( t ) n = 2 i pin ( t ) pnj ( t ) = 2 i pin ( t ) pnj ( t ) = 2 i pij ( 2 t ) dla i < j (12) similarly qii = i pii ( 2 t ). (13) Therefore, we may estimate the transition functions on the basis of the following correlations: ^ pij ( 2 t ) = and ^ pii ( 2 t ) = ^ qii . ^ i (15) ^ 2 i ^ qij (14) The probability of occurrence of any of the states ij is determined on the basis of the sum of probabilities qij, d = qij ( 2 t ) = P( X0 = i X 1 = j )( 2 t ) = i pij ( 2 t ). i j i j i j (16) With the assumption of reversibility of the Markov chain ipij(2t)=jpji(2t) d = 2 i pij ( 2 t ) = 1- i pii ( 2 t ). i< j i (20) (17) Using the Chapman-Kolmogorov equation, it is possible to define the transition matrix derivation P(t) ­ forward, P ( t ) = lim P ( t + t ) - P ( t ) P( t ) P( t ) - P( t ) = lim t0 t0 t t P ( t ) - P ( 0 ) = P( t ) lim = P( t ) P ( 0 ), t0 t j ACG =j A +jC +jG (18) and P(t)=P(0)P(t) ­ backward. By determining Q=P(0), the above equation can be written out as P(t)=QP(t), where matrix Q is called "intensity matrix of the -continuous Markov chain." The most commonly used gene substitution models are Jukes-Cantor (J-C); Felsenstein; and Hasegawa, Kishino, and Yano (HKY) models. Below, transition matrices P are presented for these discrete- models, and matrices Q for the relevant continuous models [1, 2]. For the J-C model, the probabilities of substitution between each type of nucleotide are The most flexible of these tree models, the HKY model, was used to adjust the substitution model to occurring differences between the transversion and transition probabilities [3]. The stationary distribution is as for the Felsenstein model; hence, the J-C model is a special case of the Felsenstein model (for uniform stationary distribution), and the Felsenstein model is a special case of the HKY model (for equal probabilities of transversion and transition occurrence): 1 -jCTG vjC ujG vjT vj 1 -j ATG vj G ujT A , P = uj A vjC 1 -j ACT vjT ujC vjG 1 -j ACG vj A -jCTG vjC ujG vjT vj ujT -j ATG vjG A , Q = uj A vjC -j ACT vjT ujC vjG -j ACG vj A 1 -jCTG vjC ujG vjT vj 1 -j ATG vjG ujT A P = 148Krajewski: Phylogenetic aspects of the,concept of intelligent life design uj A vjC 1 -j ACT vjT ujC vjG 1 -j ACG vj A -jCTG vjC ujG vjT vj -j ATG vjG ujT A , Q = uj A vjC -j ACT vjT ujC vjG -j ACG vj A jCTG = ujG + v( j C +jT ) j ATG = ujT + v( j A +j G ) j ACT = uj A + v( jC +jT ) j ACG = ujC + v( j A +jG ). (21) For individual models, it is necessary to obtain a solution to the matrix differential equation according to the following equation: P(t)=eQt. With matrix Q diagonalisation [4­7]: k ( Qt ) , Q=UDU­1 and using the expansion eQt = k=0 k! we obtain P(t) as follows: P( t ) = eQt =Ue DtU -1 . (22) For the Felsenstein model: jC j j 0 - G - T 1 - jA jA jA 0 1 1 , D= 0 0 U= 0 1 0 1 0 0 1 0 0 1 jA jC jG jT -j 1 -j C -j G -jT . U -1 = A -j -j 1 -j -j C G T A -j A -j C -j G 1 -jT 0 0 -u 0 0 0 0 0 , -u 0 0 -u (28) Calculating the respective probabilities for P(t) matrix, we obtain [1] pij ( t ) =j j ( 1- e( - ut ) ) for i j and pii ( t ) =j j + ( 1-j j ) e( - ut ) . (29) The probability of occurrence of any of the ij states ^ d = 1- i pii ( 2^ ) = 1- j i2 - e -2 ut ( j i ( 1-j i )), t i i i Therefore, it is necessary to diagonalise matrix Q, calculating the eigenvalues for matrix D and eigenvectors for matrix U and matrix U­1. Having determined the eigenvalues and eigenvectors for matrices in each model, we obtain [1] for the J-C model: 1 1 U = 1 1 0 0 -1 -1 -1 0 0 0 -4 0 1 0 0 0 , , D = 0 0 -4 0 0 1 0 0 0 1 0 -4 0 0 0.25 0.25 0.75 -0.25 -0.25 , -0.25 0.75 -0.25 -0.25 -0.25 0.75 0.25 (30) ^ 1 d ut =- ln 1- , 2 2 i j<i ( j ij j ) (31) ^ where d is calculated as for J-C [Eq. (27)]. For the HKY model, the respective matrices are as follows: C + T 0 - G 1 - A + G A T 1 0 - 1 U = C , C + T 1 - 0 1 A + G 1 1 1 0 0 0 0 0 0 -v 0 0 , D= 0 0 - v( + ) - u( + ) 0 A G C T 0 0 0 - v( C + T ) - u( A + G ) 0.25 -0.25 U -1 = -0.25 -0.25 (23) Calculating the respective probabilities for P(t) matrix [Eq. (22)], with the use of the above matrices (formula 23), we obtain pij ( t ) = 0.25 -0.25 e -4 t for i j and pii ( t ) = 0.25 + 0.75 e -4 t , (24) where 1 3 3 3 d = 1- i pii ( 2 t ) = 1- - e -8 t = - e -8 t , 4 4 4 4 i i.e. 1 4 ^ t =- ln 1- d , 8 3 (25) (32) A C G T C ( A + G ) T ( A + G ) - - G A C + T C + T -1 U = C C , (33) 0 - 0 C + T C + T A A 0 0 - + A + G A G (26) ^ where d is estimated as follows: ^ d = i j<i K kij . (27) Krajewski: Phylogenetic aspects of the concept of intelligent life design149 where P(t) below is a probability transition matrix as a solution of a differential equation, derived with the eigenvalue method [see Eqs. (22), (32), and (33)]. A ( C + T ) a1 A + a 2 A + G G + a4 ( A + G ) ( a1 - a 2 ) A P( t ) = a + a A ( C + T ) 1 A 2 A + G A - a4 ( A + G ) ( a1 - a 2 ) A a2 = a2 ( t ) = e - vt a1 = a1 ( t ) = e 0 that the system would implement the genetic solutions known on Earth, but the stage of development and testing of organisms would take place in the virtual world. The G ( C + T ) A + G G ( a1 - a2 ) T T ( A + G ) a1 T + a2 C + T T -a3 ( C + T ) , ( a1 - a2 ) T T ( A + G ) a1 T + a2 C + T C + a3 ( C + T ) ( a1 - a2 ) C C ( A + G ) C + T T a1 G + a2 - a4 ( A + G ) a1 C + a2 +a3 ( a1 - a2 ) G G ( C + T ) ( C + T ) ( a1 - a2 ) C C ( A + G ) C + T C a1 G + a2 (34) A + G A + a4 ( A + G ) ( a1 - a2 ) G a1 C + a2 -a3 ( C + T ) where a3 = a3 ( t ) = e a4 = a4 ( t ) = e ( - vt [( A + G ) +k ( C + T )]) ( - vt [( C + T ) +k ( A + G )]) u and k = . v Similarly ^ d = 1- i pii ( 2^ ) t + + 2 2 2 t = 1- 2 - a2 ( 2^ ) ( 2 + G ) C T +( C + T ) A G A A + G C + T C T A G t t , - 2 a3 ( 2 ^ ) - 2 a4 ( 2^ ) (35) C + T A + G ^ t where d is calculated as for J-C [Eq. (27)], and ^ is calculated numerically. inheritance of characteristics and further adaptation take place already at the stage of implementation of life. Considering the complexity of life and of the conditions for its origins on Earth, only a simplified model was planned, facilitating the project and its virtual implementation. The main objective of the project is a full implementation of the model in three separate stages: 1. The stage of life planning and preparation of appropriate environmental tools; 2. The stage of life development and adaptation according to the planned assumptions; 3. The stage of implementation, with account taken of the mechanisms of adaptation to the variable conditions of the natural environment. Stage 1 assumes the possibility of planning and simulating the origination of life and the natural environment, with the use of known elements and mechanisms of life occurring in nature, although certain modifications are possible. At this stage, assumptions and appropriate tools are prepared. It is believed that this stage will allow the planning of life as a complete set of independent systems meeting the objectives of the origination and sustainment of life, as a whole, in a complete form. Computer methods and theory The outline of the system concept described in this study is a proposal of a system with which it would be possible to plan, develop, and further adapt life on an uninhabited planet. As the evolutionary model of life on Earth (with a lengthy period of development and the difficult conditions under which it occurred) makes it impossible to plan and further control the development of life, it was assumed 150Krajewski: Phylogenetic aspects of the concept of intelligent life design The purpose of stage 2 is to test and adapt the organisms designed at stage 1 to the planned conditions of their functioning. This stage allows for distribution of the information provided at stage 1, useful for each of the environmental niches. It is expected that adaptation will take place with the use of the virtual model of the natural environment. Stage 3 is related to the implementation of the designed, adapted, and tested organisms, as a realisation of the complete natural environment that allows for the sustainment of life as a whole. This stage should facilitate the implementation of mechanisms of adaptation to changes in the natural environment. The aim of this study is the creation of a tool showing the possibilities for simulation of stage 2. The simulation is regarded in the context of possible execution and implementation of gene substitution methods. According to the evolutionary model of creation and development of life on Earth, organisms developed from the least complex to more complex forms, through a series of changes in the process of continuous adaptation to the changing environmental conditions. As the processes occurring over a long period of can be extremely difficult to reconstruct, it is assumed that a designer of a biological system will prepare, implement, and control the processes of life origination and development. This model uses an analogy with computer systems, where hardware and software is distinguished. The hardware will represent genes coding appropriate proteins. The software will represent information related to the intensity of expression of individual genes or complexes thereof. The model used in this case will feature the distribution and modification of hardware information (selection and modification of individual genes), from a system with the highest level of hardware complexity to subsystems performing specialised tasks. The manner in which the hardware is used is controlled by the software ­ in our model, this is the information responsible for the gene expression level. In order for the adaptation process to take place in an appropriately short span, both the distribution and the modification of individual genes should be virtual, to simulate the occurrence of complex environmental conditions, for individual stages of adaptation. The process of gene distribution is conditional on their expression, increasing or reducing the chances for further distribution, appropriately for genes showing or not showing expression. It is assumed that in the adaptation process, an important role is played by appropriate receptors activating individual genes, depending on the environmental conditions established for the tests. This model assumes that the hardware (ready sets of genes) is not only distributed, but it is also modified, i.e. adapted to the modified gene complex and to the level of gene expression. The level of expression of particular genes in the adaptation process is crucial for the complex of used genes. In order to simplify the model, it is presumed that the level of modification of particular genes will depend mainly on the level of their reduction in the adaptation process. In the application simulating stage 2, it is assumed that the original set of genes will comprise N=2000 systems (sets of dependent genes) with M=1000 genes (Figure 3). Each gene has a fixed expression value, determined at 1000 (UPE value ­ abbreviation taken from the similarity to the upstream promoter elements). Simulation of the relevant environmental conditions for individual tests is done with the application of a predetermined template, containing a map of systems and genes, by assigning fixed expression values (UPE value). The adaptation process is conducted by adding the difference between the template UPE value and the system UPE value with a determined weight: tab _ gen _UPE = tab _ gen _UPE + sigma ( tab _ templ _UPE - tab _ gen _UPE ), (36) where tab_gen_UPE is the UPE value array for the adapted gene array, tab_templ_UPE is the UPE value array for the template, and sigma is an adaptation coefficient within the range of (0,1). A function for the adaptation assessment is calculated: E = ( tab _ gen _UPE - tab _ templ _UPE ) 2 . (37) The template contains genes specific for a particular organism, as well as genes of common systems that can be used by numerous organisms. Different examples of the adaptation process algorithms are possible; however, they will not be discussed in this paper, as the application is intended to illustrate the functioning of the assumed stage 2 model. Simulation of a gradual reduction and changes in the intensity of the characteristics of the test environment at individual stages of adaptation (gene distribution) is carried out through a gradual reduction of systems and genes in the template. The template is modified in such a way as to ensure that an appropriate set of modified genes from the previous stage constitutes a basis for the further reduction and modification of genes at the next stage. This guarantees the continuity of distribution and the homology of the modified genes. In the test process, individual gene systems are combined into larger systems (similarly to biological systems, described as chromosomes; Figure 4). It is also assumed that during the tests, it is possible to obtain favourable conditions for a combination Krajewski: Phylogenetic aspects of the concept of intelligent life design151 Destination system set System 1 System 2 System 3 System L 2 L 1 Template System 1 System 2 System 3 System N UPE=981 UPE=138 UPE=279 Original system set System 1 System 2 System 3 System N Figure 3:Model of stage II. Chromosome KP Level P Chromosome K3 Level 3 Chromosome 2 Chromosome K2 Level 2 System 1 Chromosome 2 System 2 Chromosome 3 System 3 Chromosome K1 System L Level 1 Figure 4:Gene and chromosome distribution scheme. of a complex of systems created at a certain stage, into a new system, giving a new functionality to an organism and allowing it to adapt to the environmental conditions. In the application, the chromosome combination is simulated by a random combination of several chromosomes, with a concurrent random modification of the template 152Krajewski: Phylogenetic aspects of the concept of intelligent life design (UPE value) for genes in the new chromosome (the process of adaptation of the new system will be aimed at the development of a new functionality). In the model presented above, the level of odification depends primarily on the level of gene reduction. Additions, deletions, and substitutions depend on the progress of the adaptation process. Let us assume that the simulations of gene reduction, besides certain similarities to the genetic systems encountered in nature, should take into account the evolutionary model of gene substitution used in the construction of the phylogenetic tree. For this purpose, we will conduct some simple computational experiments, using substitution procedures and the J-C, Felsenstein, and HKY models (see Introduction). The general algorithm formula used for simulation purposes is presented in Figure 5. The simulation was run for three different sets of parameters, to verify the operation of the model, using each of the substitution methods. For the purposes of this simulation, a randomly generated sequence of 10,000 nucleotides, with stationary distribution appropriate for the model, was used. In the first experiment, a simulation was run for the J-C model, with the application of the following parameters: =2e­5, A,C,G,T=0.25, cycle tc=1. As the J-C model is a special case as compared with the other models, for verification purposes it was assumed that u felsen = vHKY = 1,2 ,3,4 and k=1 0.25 (see Introduction). In the tests, a simple phylogenetic tree was used with normalised branch lengths of 5 (Figure 6), which, multiplied by the number of cycles and the cycle , correspond to the evolutionary s for the model concerned (actual ). The actual ta is set as ta = t tc ncycles, (38) where t is the length of branches, tc is the of one cycle, and ncycle is the number of cycles. In the second step, the simulation procedure was tested for the Felsenstein model, with a change in the probabilities of occurrence of individual nucleotides. Parameter values were changed only for A=0.1, C=0.6, G=0.1, and T=0.2. In the third test, the HKY model was used; for the Felsenstein model parameters, the transition/transversion ratio k=100 was changed. The above gene substitution algorithm shows the course of mutation unrestricted by the genetic selection pressure, allowing for a change of any DNA position strictly in accordance with the planned substitution model. However, under natural conditions, such a substitution with respect to the coding regions occurs only in synonymous mutations. With respect to mutations that change the "sense" of the coded protein, a restriction occurs, mainly in connection with the preservation of its functionality. In order to illustrate the conservative effect of the pressure to preserve the protein functionality, a similarity criterion was used for the coded amino acids, with the application of relevant biochemical properties, such as hydrophobicity, hydrophilicity, mass of side chain, but also pK1 (constant of -COOH group), pK2 (constant of -NH3+ group), and pI (isoelectric point of amino acid at 25 °C) [8­13]. According to this criterion, a change of a relevant nucleotide may occur, if the similarity between the coded, the current, and the new protein is less than the assumed value *rand (see Figure 7). The rand randomness factor was introduced to illustrate the differences related to genetic and environmental variability (although a very similar effect can be obtained without the randomness factor, with an appropriately reduced value ). The following amino acid similarity criterion was used [9, 11, 14, 15]: Loop cycles Loop DNA size N Require DNA base substitution with P prob. Y END Loop DNA size Substitute single base END Loop cycles Figure 5:An algorithm of the substitution simulation. Krajewski: Phylogenetic aspects of the concept of intelligent life design153 DNA0 2 5 3 DNA3 DNA2 3 DNA1 Figure 6:Tested tree, t=5 (2+2x3 and 5). Ji ,j = 1 k [ h ( R ) - hl ( Rj )] 2 , k l= 1 l i (39) where Ji,j is the similarity measure for amino acids i, j; hl(Ri) is any biochemical value describing Ri amino acid; and k is the number of factors hl(Ri). Before using the relevant biochemical properties, their standardisation must be carried out according to the formula below: hl ( Ri ) = h0 ( R ) h ( Ri ) - i=1 l i 20 . 0 20 20 h ( R ) 0 l i i=1 [ hl ( Ri )] - i=1 20 20 0 l 20 to the evolution process), where the length of individual branches will reflect the intensity of changes, irrespective of the period in which these changes took place. In this model, we assume that any of the so-called hardware changes are related to the construction, i.e. the set of genes used, whereas the so-called software changes determine their manner of use, i.e. expression. It is expected that in the adaptation process, the ultimate manner of use of the target set of genes is not as important as the determination of the hardware part. In this model, as at the stage of selection of the appropriate set of hardware, in the adaptation process (tests), a relevant system of genes is selected for the purpose of their appropriate further use, i.e. for the purpose of adjustment of appropriate gene systems and their expression. According to this model, at the stage of adaptation, the structure of coded proteins is adjusted to the new system of genes. Therefore, it is assumed that nucleotide substitutions may take place at the design stage, and their intensity depends on the changes in the structure of an organism at the adaptation stage, i.e. the number of reduced genes: t = c, P( c ) = eQc , where c is the number of reduced genes. (41) (40) Results and discussion Simulation of the evolution process Test 1. Figure 8 shows the line, depending on the number of cycles, for the phylogenetic tree for the J-C, The above evolutionary models, where uninhibited or limited nucleotide change occurs according to the set parameters, allow for the determination of the evolutionary s of the phylogenetic tree. In the gene reduction model, the tree will show the process of adaptation (similarly Loop cycles Loop DNA size N Require DNA base substitution with P Y N Similarity (aminoacid, new_aminoacid) < rand Y END Loop DNA size Substitute single base END Loop cycles Figure 7:An algorithm of the substitution simulation for a restricted model. 154Krajewski: Phylogenetic aspects of the concept of intelligent life design J-C, Fe, HKY 7 6 5 4 3 2 1 0 Number of cycles Figure 8: between respective branches of the tree (see Figure 6) as a function of the number of cycles for the J-C, Felsenstein, and HKY models. Felsenstein, and HKY models. The presented in the three implemented substitution methods fluctuates around the applied in the simulation procedure (the real was divided by the number of cycles in order to obtain a fixed value, as for the length of the tree branches in Figure 6). At the final stage, the indications become increasingly inaccurate; the probability d approximates 0.75 (Figure 9), showing minimal variations in the final phase. All models show the same value. Test 2. Following a change of the parameters of the distribution of particular nucleotides (see "Computer methods and theory" section), correct indications are registered for the Felsenstein and HKY models (Figures 10 and 11). The probability d (Figure 12) for the preset 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Number of cycles Figure 9:d-Value between respective branches as a function of the number of cycles. Fe, HKY Number of cycles Figure 10: between respective branches for the Felsenstein and HKY models. Krajewski: Phylogenetic aspects of the concept of intelligent life design155 J-C Number of cycles Figure 11: between respective branches for the J-C model. 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Number of cycles Figure 12:d-Value between respective branches. parameters A,C,G,T approximates d = 1- i i2 = 0.58 (see "Test 1"). Test 3. The last test is carried out following a change of parameter k (see "Computer methods and theory" section). Only the diagram for the HKY model shows the correct values (Figures 13­15). The probability value limd approximates 0.58, as for the Felsenstein model, but t much more slowly [see Eq. (35) and Figure 16]. The system was normalised in such a manner as to ensure that the initial value is 5, as for the tree in Figure 6. The values of parameters C1, C2, C3, and C4 are 2.95, 2.92, 4.93, and 2, respectively (see Figure 17). In this simulation, the estimated values C12, C13, and C23 for the HKY model were 3.13, 5.08, and 5.05, respectively. Gene reduction model simulation In the gene reduction model, the HKY substitution model was used (the tree simulated in "Test 3"). It was assumed that the original number of genes was 2e6 (2000 systems with 1000 genes each). Next, the system was reduced by 12e5 genes, and divided into 20,000 and 30,000 gene systems. The second branch was reduced directly to 25,000 genes. Thus, the value of reduction C is C=5 NS - NR . NS (42) Simulation of the functionality preservation pressure As with all mutations of the coding sequence nucleotides, changing the amino acids in the protein chain may result in a change or a loss of the protein functionality. In this simulation, the uninhibited change of nucleotides will be limited mainly to synonymous changes or changes to amino acids with similar biochemical properties. Unlike with other tests, the plan or the lengths of the branches of the phylogenetic tree will not be determined (although a discussion could be interesting), and the main focus will be on the determination of the number of nucleotide substitutions in the tested protein chain. 156Krajewski: Phylogenetic aspects of the concept of intelligent life design HKY 7 6 5 4 3 2 1 0 Number of cycles Figure 13: between respective branches for the HKY model. 160 140 120 100 80 60 40 20 0 J-C Figure 14: between respective branches for the J-C model. 250 200 150 100 50 0 Number of cycles Felsenstein Number of cycles Figure 15: between respective branches for the Felsenstein model. From the designer's point of view, if experimental methods were used, the designed model of reduction in the adaptation process would be consuming and would probably require enormous resources allowing for experiments to eliminate the relevant genes. It will be better to use virtual methods, and following elimination, to carry out implementation and adaptation in the real world. Therefore, it would be necessary to simulate substitution of genes that may be the products of these two processes. It was assumed that the sole unrestricted substitutions may occur with a probability as in the assumed model; however, the process is disturbed by the present pressure. For this purpose, tests were conducted to measure only the number of the relevant substitutions Krajewski: Phylogenetic aspects of the concept of intelligent life design157 Number of cycles Figure 16:d-Value between respective branches. C4 C3 C1 C2 Figure 17:Tree for the gene reduction model. depending on the similarity coefficient, the number of cycles in relation to the set similarity coefficient, and for synonymous substitutions. As in study [3], tables were provided for position 3, and for 1 and 2. Test determining the relation of the number of substitutions and the variable value of the similarity coefficient for a fixed number of cycles The test was carried out for the HKY model, for the following set of parameters k=100, =5e­7, and A,C,G,T=0.1, 0.6, 0.1, and 0.2, respectively. Substitutions were tested for 10 measurements, for the growing value of the similarity coefficient, and the fixed number of cycles. Table 1 shows that transitions for position 3 are determined at a similar level, both for substitutions between DNA chain No. 1 and 3, 2 and 3, and 1 and 2. The other substitutions, for similarity coefficient 0.1, occur sporadically. Results of measurements across the entire range of values for transitions in position 3, and 1 and 2, are shown in Figures 18 and 19. It is clear that transitions in position 3 remain at a similar level, slightly above 100. For positions 1 and 2, with an increase in the similarity coefficient, the number of transitions clearly grows from approximately 0 to >200. Test determining the relation of the number of substitutions and the number of cycles for synonymous substitutions (similarity coefficient=0) The occurrence of synonymous substitutions in connection with an increase in the number of cycles was tested for the following set of parameters: k=1e4, =1e­7, and A,C,G,T=0.25. Figure 20 shows that with an increase in the number of cycles, the number of transitions grows to approximately 100 (the number of transitions for positions 1 and 2 is very small, and therefore it was not shown). Similarly, for parameters k=1, =5e­4, and A,C,G,T=0.25, and a large number of cycles, the number of transitions for position 3 is approximately 100 (Table 2). Table 1:Transversions and transitions are presented in the lower left part and the upper right part, respectively, for sequence of 1000 nucleotides. DNA1 DNA2 DNA3 DNA1 0 (0) 2 (2) 1 (4) DNA2 2 (100) 0 (0) 1 (6) DNA3 4 (118) 6 (112) 0 (0) Test determining the relation of the number of substitutions and the number of cycles for a fixed value of the similarity coefficient The following set of parameters was adopted for the simulation, for the fixed similarity coefficient: k=1e4, =1e­7, A,C,G,T=0.25, and =0.3. The number of transitions for positions 3, and 1 and 2, between DNA chain 1, 3 and 2, and 3, increases with the increase in the number of cycles, The first numbers relate to substitutions in the 1st and 2nd positions of codon; the numbers in parentheses relate to the 3rd position of codon. 158Krajewski: Phylogenetic aspects of the concept of intelligent life design 160.00 Number of transitions 140.00 120.00 100.00 80.00 60.00 40.00 20.00 0.00 0.1 0.2 0.3 0.4 Similarity coefficient Figure 18:Number of transitions between respective branches of the tree (see Figure 5) in the 3rd position as a function of the similarity coefficient. 250.00 Number of transitions 200.00 150.00 100.00 50.00 0.00 0.1 0.2 0.3 0.4 1 and 2 position Similarity coefficient Figure 19:Number of transitions between respective branches of the tree in the 1st and the 2nd position as a function of the similarity coefficient. 140.00 Number of transitions 120.00 100.00 80.00 60.00 40.00 20.00 0.00 Number of cycles Figure 20:Number of transitions in the 3rd position of codon. Table 2:Table of substitutions for a new set of parameters (see the text). DNA1 DNA2 DNA3 DNA1 15 (94) 12 (96) DNA2 8 (97) 17 (100) DNA3 9 (104) 11 (103) and reaches the value of approximately 160 and 120, respectively (Figures 21 and 22). Simulation of mitochondrial DNA substitution This simulation was run as a combination of two processes. The first one was related to the adjustment stage, Krajewski: Phylogenetic aspects of the concept of intelligent life design159 Number of transitions Gen13 Gen23 Gen12 Number of cycles Figure 21:Number of transitions for the 3rd position. 160 Number of transitions 140 120 100 80 60 40 20 0 1 and 2 position Gen13 Gen23 Gen12 Number of cycles Figure 22:Number of transitions for positions 1 and 2. and the second one with the adaptation stage in the process of implementation. It was assumed that the nucleotide substitution in the implementation process would be consistent with the model in "Test determining the relation of the number of substitutions and the number of cycles for a fixed value of the similarity coefficient". The process related to the adjustment stage is unknown and was created ad hoc (see below). The algorithm (Figure 23) uses two amino acid chains with the same initial sequence corresponding to the genetic information of the nucleotide chain. The first amino acid chain will only be a template simulating the direction of changes. The amino acids in the template chain will be substituted at random, with the fixed probability p. The nucleotide chain substitution is done in such a manner as to ensure that the similarity of the template amino acid to the new amino acid coded by a codon of the nucleotide chain is lower than the preset value , and to meet the following minimisation criterion: N N 80 - 30 = u 3 - 1 + N v 3 2 + v 12 - , N u 12 N u 12 80 (43) where Nu3 is the number of transversions in position 3, Nu12 is the number of transversions in positions 1 and 2, Nv3 is the number of transitions in position 3, Nv12 is the number of transitions in positions 1 and 2, and Nu12 is the number of transversions in positions 1 and 2. N v 12 should equal It was assumed that the ratio N u 12 80 - 30 N v 12 ( ­ for humans, after subtraction of the 80 N u 12 fixed value of approximately 30 from the maximum value Nv12, to the maximum value Nu12, i.e. approximately 80; see Table 1 from Ref. [3]). For the purpose of value calculation, the indicative values of hardware distance between tree nodes (Figure 24 in this study and Figure 1 from Ref. [3]) were used: 1-2, 2-3, 3-4, 4-5, 5-6, 6-man, 6-chimp, 5-gorilla, 4-orang, 3-gibbon, 2-bovine, and 1-mouse accordingly 3.90, 20.67, 3.12, 4.68, 1.17, 2.73, 2.73, 3.9, 8.58, 11.70, 32.37, and 36.27. The results of the algorithm are shown in Table 3. The process of further adjustment (adaptation) at the implementation stage could take place, e.g. according to the algorithm in "Test determining the relation of the 160Krajewski: Phylogenetic aspects of the concept of intelligent life design Loop hardware distance Loop protein size N Require amino acid substitution with p prob. Y N Similarity(amino acid, new_aminoacid) < Y Choose amino acid codon with the best rule fulfilment. END Loop DNA size Substitute amino acid with the best codon selected END Loop distance Figure 23:Algorithm of the simulation of mtDNA substitutions at stage II. 6 Man Mouse Bovine Gibbon Orangutan Gorilla Chimpanzee is shown in Figure 25. The tree was created with PhyML3.1, with the application of the maximum likelihood (ML) method for the HKY, and Mega 5.2.2 [16]. The scale is only indicative, determined from the point of divergence between mouse and bovine [3]. Figure 24:Hypothetical tree of mammalians with indicative branch lengths. Conclusions The concept described here should not only give the opportunity to design life, but it also should use similar genetic processes. Its results should be similar to the system occurring in nature. The aim of this study was to show the similarities between these systems, in terms of nucleotide substitution processes, and a similar distribution of substitutions was obtained for both systems. For the purpose of the first simulation, it was assumed that one process was responsible for the distribution number of substitutions and the number of cycles for synonymous substitutions (similarity coefficient=0)". The sample distribution of transitions and transversions, from the same initial sequence, is shown in Table 4. The algorithm is used to calculate the distance from the amino acid template and the DNA determined at the adaptation stage. The final result of the application of both algorithms is shown in Table 5. The phylogenetic tree for the DNA chains obtained as a result of these processes Table 3:Substitution table for simulations for mammalians of stage II. Mouse Bovine Gibbon Orang Gorilla Chimp Man Mouse 78 (68) 79 (70) 79 (68) 79 (68) 78 (68) 77 (67) Bovine 62 (2) 73 (65) 72 (64) 72 (64) 71 (63) 70 (63) Gibbon 65 (3) 57 (1) 31 (29) 28 (27) 28 (26) 27 (25) Orang 65 (3) 57 (1) 25 (0) 21 (20) 20 (19) 19 (18) Gorilla 64 (3) 57 (1) 23 (0) 17 (0) 7 (7) 6 (6) Chimp 64 (3) 56 (1) 23 (0) 17 (0) 6 (0) 3 (3) Man 63 (3) 56 (1) 22 (0) 16 (0) 5 (0) 3 (0) Krajewski: Phylogenetic aspects of the concept of intelligent life design161 Table 4:Substitution table for simulations for mammalians of stage III. Mouse Bovine Gibbon Orang Gorilla Chimp Man Mouse 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) Bovine 27 (62) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) Gibbon 29 (55) 28 (57) 0 (0) 0 (0) 0 (0) 0 (0) Orang 27 (58) 26 (60) 27 (57) 0 (0) 0 (0) 0 (0) Gorilla 29 (61) 27 (64) 28 (59) 27 (59) 0 (0) 0 (0) Chimp 27 (60) 26 (60) 27 (59) 24 (59) 27 (61) 0 (0) Man 28 (58) 25 (59) 27 (56) 26 (61) 27 (61) 25 (61) Table 5:Substitution table for simulations for mammalians after stage II and III. Mouse Bovine Gibbon Orang Gorilla Chimp Man Mouse 78 (68) 79 (70) 79 (68) 79 (68) 78 (68) 77 (67) Bovine 80 (49) 73 (65) 72 (64) 72 (64) 71 (63) 70 (63) Gibbon 83 (47) 78 (49) 31 (29) 28 (27) 28 (26) 27 (25) Orang 84 (46) 76 (48) 50 (48) 21 (20) 20 (19) 19 (18) Gorilla 82 (50) 77 (50) 47 (52) 45 (54) 7 (7) 6 (6) Chimp 82 (47) 75 (46) 49 (46) 42 (49) 32 (50) 3 (3) Man 80 (50) 74 (51) 46 (49) 42 (52) 31 (56) 28 (52) Mouse Bovine Gibbon Orangutan Gorilla Man Chimpanzee 60 0.15 40 0.10 20 0.05 0 0.00 Myr Figure 25:Phylogenetic tree for mammalians after stage II and III. of substitutions at both stages (adaptation and implementation), which could be simulated with the Markov process. This simulation was run with the application of the HKY model. It is evident that, in this case, the assumption that the substitution occurred on average at a steady speed over would be untrue, as it is assumed that mutation changes in genes depend on the new system of genes, and the adaptation process is virtual. The implementation process would be a continuation of the adaptation process. For the purpose of the second simulation, both processes were divided. At the adaptation stage, the mutation changes depended on the changes in the gene system. An algorithm realising a simple dependency function with the minimisation criterion was used. Amino acid substitutions were simulated by random changes of the template amino acids, depending on the changes in the gene system. The implementation stage is consistent with the inheritance mechanisms, and is based on the assumption of the occurrence of random substitutions, stable over . For the nucleotide substitution simulation procedure, the HKY model was adopted. The algorithm restricting substitutions to synonymous substitutions or to ones changing the amino acids in a specified scope of similarity disturbs this process. The result of the simulation is a distribution of substitutions resembling the one in Ref. [3] (this study was not intended to show the exact similarity, and both the algorithm and the certain input variable values were selected indicatively). The obtained nucleotide chains with the distribution of substitutions presented in this paper facilitated the creation of the phylogenetic tree. The ML method was used with the HKY model. Therefore, it was shown that the occurrence of nucleotide substitutions depending on the number of reduced genes may give results similar to those in evolutionary processes. The system requires a simulation of the life origination process at each of the specified stages, and presents an interesting concept of the design and creation of life with the use of a computer. Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission. Research funding: None declared. Employment or leadership: None declared. Honorarium: None declared. 162Krajewski: Phylogenetic aspects of the concept of intelligent life design Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the report for publication. 8. Cai Y, Zhou G, Chou K. Support vector machines for predicting membrane protein types by using functional domain composition. Biophys J 2003;84:3257­63. 9. Chou K, Cai Y. Prediction of protease types in a hybridization space. Biochem Biophys Res Commun 2006;339:1015­20. 10. Zhang G, Li H, Gao J, Fang B. Predicting lipase types by improved Chou's pseudo-amino acid composition. Protein Pept Lett 2008;15:1132­7. 11. Chou K, Cai Y. Predicting protein quaternary structure by pseudo amino acid composition. Proteins Struct Funct Genet 2003;53:282­9. 12. Krajewski Z, Tkacz E. Feature selection of protein structural classification using SVM classifier. Biocybern Biomed Eng 2013;33:47­61. 13. Krajewski Z, Tkacz E. Protein structural classification based on pseudo amino acid composition using SVM classifier. Biocybern Biomed Eng 2013;33:77­87. 14. Chou K. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Struct Funct Bioinform 2001;43:246­55. 15. Chou K, Cai Y. Predicting subcellular localization of proteins by hybridizing functional domain composition and pseudo-amino acid composition. J Cell Biochem 2004;91:1197­1203. 16. Hall BG. Phylogenetic trees made easy. University of Rochester, Emeritus and Bellingham Research Institute. Sunderland, MA: Sinauer Associates Inc., 2001. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bio-Algorithms and Med-Systems de Gruyter

Phylogenetic aspects of the concept of intelligent life design

Bio-Algorithms and Med-Systems , Volume 11 (3) – Sep 1, 2015

Loading next page...
 
/lp/de-gruyter/phylogenetic-aspects-of-the-concept-of-intelligent-life-design-HPUo5FVxeE

References (18)

Publisher
de Gruyter
Copyright
Copyright © 2015 by the
ISSN
1895-9091
eISSN
1896-530X
DOI
10.1515/bams-2015-0018
Publisher site
See Article on Publisher Site

Abstract

This paper presents a new treatment of molecular evolutionary model as a product of intelligent changes. The aim of this paper is to obtain a life design system, drawing on processes occurring in nature regardless of explanations of the origins of life. The idea of intelligent design and molecular relationship is considered as a basic concept of the intelligent life design system, using some analogies taken from molecular evolutionary models. Three steps of life design system are outlined; however, the main subject is an attempt to find certain similar effects of the design system processes and the processes simulated with basic evolutionary substitution models: Jukes-Cantor; Felsenstein; and Hasegawa, Kishino, and Yano (HKY). An idea of gene reduction has been applied, from more complex (taking into account information density) biological systems to less complex, specialised biological systems. Two steps have been taken into consideration: a test stage in the virtual world and an adaptation finishing process after running the systems in the real world. Two algorithms have been applied. The first one has applied similarity related to an accommodation process to required conditions in the virtual and the real world. The second algorithm has applied accommodation to required conditions separately (expressed as amino acid substitution) in the first step, using a convenient criterion, and further (similar to observable) accommodation in the real world. A phylogenetic tree, similar to a real one, has been calculated using the above method for mammals, for mtDNA, with the maximum likelihood method, and with the aid of PhyML for the HKY model. This paper is an introduction showing an aspect of the life design system, related to phylogenetic relationships. *Corresponding author: Zbigniew Krajewski, Faculty of Biomedical Engineering, Silesian University of Technology, ul. Roosevelta 40, Zabrze 41-800, Poland, Phone: +48 32 2777463, E-mail: zkrajewski@polsl.pl Introduction In plain words, the evolutionary model of genetic change is based on the assumption of the presence of isolated incidents of modification of individual DNA bases, such as substitutions, deletions, and additions. It is assumed that both deletions and additions are exceptionally rare, due to the possibility of damage to the coding frame. A more common phenomenon are synonymous substitutions that do not result in a change in the protein coding amino acid, or cause such changes more rarely, due to limited matching related to the modification of the protein functions. It is assumed that for individual genes (proteins), the genetic change process may be stable over , and on the basis of a specific number of changes in individual positions, it is possible to establish the topology and the divergence s of a hypothetical phylogenetic tree. Most often, for the purpose of establishing the evolutionary , only substitutions are taken into account, whereas additions and deletions are omitted, using sequencealigning methods. The models used for the creation of phylogenetic trees are most often based on the Markov chain concept. A stochastic process is a system of a variable function t, {X(t, ), tT, }. In the Markov process, for the X(t, ) process in the past, the future process, where t>t0, is characterised by the current process X(t0, ) [1]. The Markov chain is a Markov process where X(t, )S, for which the state space S is discrete. In other words, the Markov chain represents random transitions between discrete states, for which the future state depends only on the current state (the previous state), which can be illustrated with the following equation [1]: P( Xk+1 = j | Xk = i , Xk-1 = i1 , Xk-2 = i2 , ...) = P( Xk+1 = j | Xk = i ), (1) where P(Xk+1=j|Xk=i)=pij is the probability of transition from state i to state j. 146Krajewski: Phylogenetic aspects of the concept of intelligent life design For gene substitution methods, Markov chains with a finite number of states are used, with the transition probabilities expressed in a matrix: pAA p P = CA pGA pTA pAC pAG pAT pCC pCG pCT , pGC pGG pGT pTC pTG pTT Hypothetical ancestor t Sequence 1 t Sequence 2 Figure 1:One step model. (2) qii = i pii and qij = i pij + j p ji . Assuming chain reversibility, i.e. ipij=jpji, we obtain ^ pii = and ^ pij = kij = for i j . ^ 2 i 2 kii + i j kij ^ qij ^ qii 2 kii = ^ i 2 kii + i j kij (7) where A, C, G, and T represent adenine, cytosine, guanine, and thymine, respectively. For two K-long sequences, 10 combinations with n + k - 1 repetitions of nucleotide pairs are possible , k where n=4, k=2, i.e. 4 for which no change occurs (A-A, C-C, G-G, T-T) and 6 for which a change occurs, independently of the direction (A-C, A-G, A-T, C-G, C-T, G-T). The probability of the occurrence of any of the above arrangements for two K-long nucleotides is described with the multinomial distribution: P( k A-A , kC -C , kG -G , kT -T , k A-C , k A-G , k A-T , kC -G , kC -T , kG -T ) = K! k A-A ! kC -C ! kG -G ! kT -T ! k A-C ! k A-G ! k A-T ! kC -G ! kC -T ! kG -T ! k k k k k k k k k k (8) (9) Due to very minor changes in the genetic material that occur in one generation, it is more convenient to use the continuous model. Using the Chapman-Kolmogorov equation as follows (Figure 2): P( x 2 = j x0 = i ) = P( x 2 = j x 1 = n x0 = i ) -G q AA-AA qCC-CC qGG-GG qTT-TT q AA-CC q AA-G q AA-TT qCC-GG qCC-TT qGG-TT , (3) = P( x 2 = j x0 = i | x 1 = n ) P( x 1 = n ) = P( x 2 = j | x0 = i x 1 = n ) P( x0 = i | x 1 = n ) = P( x 2 = j | x 1 = n ) P( x0 = i | x 1 = n ) P( x 1 = n ) = pnj pni n . n n where kA-A, kC-C, kG-G, kT-T, kA-C, kA-G, kA-T, kC-G, kC-T and kG-T are numbers of pairs A-A, C-C, G-G, T-T, A-C, A-G, A-T, C-G, C-T, and G-T in the investigated sequences and qA-A, qC-C, qG-G, qT-T, qA-C, qA-G, qA-T, qC-G, qC-T, and qG-T are their respective probabilities of occurrence. By allocating numbers to A, C, G, and T bases (1, 2, 3, and 4, respectively), it is possible to obtain the following likelihood function L [1]: L = kij lnqij + kiilnqii . i< j i P( x 1 = n ) (10) For the continuous Markov process qij = P( x 2 = j x0 = i ) + P( x 2 = i x0 = j ) = 2 pnj ( t ) pni ( t ) n 2 qii = P( x 2 = i x0 = i ) = pni ( t ) n . n n (4) for i< j By maximising the likelihood function, appropriate parameters are defined, qij and qii, with an established experimental result kij, kii. When solving the equation, we may estimate the parameters [1] k k ^ ^ (5) qii = ii and qij = ij . K K For a stationary probability distribution, the probability of occurrence of individual states may be estimated as follows: 2 kii + i j kij (6) ^ i = . 2K Assuming that a substitution occurs at one step (Figure 1), the probabilities of occurrence of states ii and ij are Using the reversibility property npni=ipin and ChapmanKolmogorov equation: x1=1 x0=i x2=j x1=2 x1=3 x1=n Figure 2:Chapman-Kolmogorov equation scheme. Krajewski: Phylogenetic aspects of the concept of intelligent life design147 pij ( s + t ) = P( Xt = j X s = n | X0 = i ) = P( Xt = j | X s = n ) P( X s = n | X0 = i ) = pin ( s ) pnj ( t ). (11) the same, and are represented by P matrix for the discrete model and Q for the -continuous model as follows: 1- 3 -3 1- 3 -3 . P = , Q = 1- 3 -3 1- 3 -3 (19) This simplest model requires the stationary distribution to be uniform; hence, the probability of particular nucleotide occurrence in DNA chain i is the same at 0.25. The P matrix represents the probability of substitutions of particular nucleotides in one step, while the Q matrix represents the parameters of a matrix differential equation [see Eq. (18)] for the -continuous model with a dependent solution derived below [see Eq. (22)]. A more convenient Felsenstein model allows to set up a stationary distribution =[A, C, G, T], and the P and Q matrices are as follows: 1 - ujCTG uj C uj G ujT uj 1 - uj ATG ujG ujT A , P = uj A 1 - uj ACT ujT ujC 1 - uj ACG ujC ujG uj A -ujCTG ujC uj G ujT uj -uj ATG ujG ujT A , Q = uj A ujC ujT - uj ACT ujC ujG -uj ACG uj A where j ATG =j A +jT +j G j ACT =j A +jT +jT . jCTG =jC +jT +jG As a result, we obtain qij = 2 pnj ( t ) pni ( t ) n = 2 i pin ( t ) pnj ( t ) = 2 i pin ( t ) pnj ( t ) = 2 i pij ( 2 t ) dla i < j (12) similarly qii = i pii ( 2 t ). (13) Therefore, we may estimate the transition functions on the basis of the following correlations: ^ pij ( 2 t ) = and ^ pii ( 2 t ) = ^ qii . ^ i (15) ^ 2 i ^ qij (14) The probability of occurrence of any of the states ij is determined on the basis of the sum of probabilities qij, d = qij ( 2 t ) = P( X0 = i X 1 = j )( 2 t ) = i pij ( 2 t ). i j i j i j (16) With the assumption of reversibility of the Markov chain ipij(2t)=jpji(2t) d = 2 i pij ( 2 t ) = 1- i pii ( 2 t ). i< j i (20) (17) Using the Chapman-Kolmogorov equation, it is possible to define the transition matrix derivation P(t) ­ forward, P ( t ) = lim P ( t + t ) - P ( t ) P( t ) P( t ) - P( t ) = lim t0 t0 t t P ( t ) - P ( 0 ) = P( t ) lim = P( t ) P ( 0 ), t0 t j ACG =j A +jC +jG (18) and P(t)=P(0)P(t) ­ backward. By determining Q=P(0), the above equation can be written out as P(t)=QP(t), where matrix Q is called "intensity matrix of the -continuous Markov chain." The most commonly used gene substitution models are Jukes-Cantor (J-C); Felsenstein; and Hasegawa, Kishino, and Yano (HKY) models. Below, transition matrices P are presented for these discrete- models, and matrices Q for the relevant continuous models [1, 2]. For the J-C model, the probabilities of substitution between each type of nucleotide are The most flexible of these tree models, the HKY model, was used to adjust the substitution model to occurring differences between the transversion and transition probabilities [3]. The stationary distribution is as for the Felsenstein model; hence, the J-C model is a special case of the Felsenstein model (for uniform stationary distribution), and the Felsenstein model is a special case of the HKY model (for equal probabilities of transversion and transition occurrence): 1 -jCTG vjC ujG vjT vj 1 -j ATG vj G ujT A , P = uj A vjC 1 -j ACT vjT ujC vjG 1 -j ACG vj A -jCTG vjC ujG vjT vj ujT -j ATG vjG A , Q = uj A vjC -j ACT vjT ujC vjG -j ACG vj A 1 -jCTG vjC ujG vjT vj 1 -j ATG vjG ujT A P = 148Krajewski: Phylogenetic aspects of the,concept of intelligent life design uj A vjC 1 -j ACT vjT ujC vjG 1 -j ACG vj A -jCTG vjC ujG vjT vj -j ATG vjG ujT A , Q = uj A vjC -j ACT vjT ujC vjG -j ACG vj A jCTG = ujG + v( j C +jT ) j ATG = ujT + v( j A +j G ) j ACT = uj A + v( jC +jT ) j ACG = ujC + v( j A +jG ). (21) For individual models, it is necessary to obtain a solution to the matrix differential equation according to the following equation: P(t)=eQt. With matrix Q diagonalisation [4­7]: k ( Qt ) , Q=UDU­1 and using the expansion eQt = k=0 k! we obtain P(t) as follows: P( t ) = eQt =Ue DtU -1 . (22) For the Felsenstein model: jC j j 0 - G - T 1 - jA jA jA 0 1 1 , D= 0 0 U= 0 1 0 1 0 0 1 0 0 1 jA jC jG jT -j 1 -j C -j G -jT . U -1 = A -j -j 1 -j -j C G T A -j A -j C -j G 1 -jT 0 0 -u 0 0 0 0 0 , -u 0 0 -u (28) Calculating the respective probabilities for P(t) matrix, we obtain [1] pij ( t ) =j j ( 1- e( - ut ) ) for i j and pii ( t ) =j j + ( 1-j j ) e( - ut ) . (29) The probability of occurrence of any of the ij states ^ d = 1- i pii ( 2^ ) = 1- j i2 - e -2 ut ( j i ( 1-j i )), t i i i Therefore, it is necessary to diagonalise matrix Q, calculating the eigenvalues for matrix D and eigenvectors for matrix U and matrix U­1. Having determined the eigenvalues and eigenvectors for matrices in each model, we obtain [1] for the J-C model: 1 1 U = 1 1 0 0 -1 -1 -1 0 0 0 -4 0 1 0 0 0 , , D = 0 0 -4 0 0 1 0 0 0 1 0 -4 0 0 0.25 0.25 0.75 -0.25 -0.25 , -0.25 0.75 -0.25 -0.25 -0.25 0.75 0.25 (30) ^ 1 d ut =- ln 1- , 2 2 i j<i ( j ij j ) (31) ^ where d is calculated as for J-C [Eq. (27)]. For the HKY model, the respective matrices are as follows: C + T 0 - G 1 - A + G A T 1 0 - 1 U = C , C + T 1 - 0 1 A + G 1 1 1 0 0 0 0 0 0 -v 0 0 , D= 0 0 - v( + ) - u( + ) 0 A G C T 0 0 0 - v( C + T ) - u( A + G ) 0.25 -0.25 U -1 = -0.25 -0.25 (23) Calculating the respective probabilities for P(t) matrix [Eq. (22)], with the use of the above matrices (formula 23), we obtain pij ( t ) = 0.25 -0.25 e -4 t for i j and pii ( t ) = 0.25 + 0.75 e -4 t , (24) where 1 3 3 3 d = 1- i pii ( 2 t ) = 1- - e -8 t = - e -8 t , 4 4 4 4 i i.e. 1 4 ^ t =- ln 1- d , 8 3 (25) (32) A C G T C ( A + G ) T ( A + G ) - - G A C + T C + T -1 U = C C , (33) 0 - 0 C + T C + T A A 0 0 - + A + G A G (26) ^ where d is estimated as follows: ^ d = i j<i K kij . (27) Krajewski: Phylogenetic aspects of the concept of intelligent life design149 where P(t) below is a probability transition matrix as a solution of a differential equation, derived with the eigenvalue method [see Eqs. (22), (32), and (33)]. A ( C + T ) a1 A + a 2 A + G G + a4 ( A + G ) ( a1 - a 2 ) A P( t ) = a + a A ( C + T ) 1 A 2 A + G A - a4 ( A + G ) ( a1 - a 2 ) A a2 = a2 ( t ) = e - vt a1 = a1 ( t ) = e 0 that the system would implement the genetic solutions known on Earth, but the stage of development and testing of organisms would take place in the virtual world. The G ( C + T ) A + G G ( a1 - a2 ) T T ( A + G ) a1 T + a2 C + T T -a3 ( C + T ) , ( a1 - a2 ) T T ( A + G ) a1 T + a2 C + T C + a3 ( C + T ) ( a1 - a2 ) C C ( A + G ) C + T T a1 G + a2 - a4 ( A + G ) a1 C + a2 +a3 ( a1 - a2 ) G G ( C + T ) ( C + T ) ( a1 - a2 ) C C ( A + G ) C + T C a1 G + a2 (34) A + G A + a4 ( A + G ) ( a1 - a2 ) G a1 C + a2 -a3 ( C + T ) where a3 = a3 ( t ) = e a4 = a4 ( t ) = e ( - vt [( A + G ) +k ( C + T )]) ( - vt [( C + T ) +k ( A + G )]) u and k = . v Similarly ^ d = 1- i pii ( 2^ ) t + + 2 2 2 t = 1- 2 - a2 ( 2^ ) ( 2 + G ) C T +( C + T ) A G A A + G C + T C T A G t t , - 2 a3 ( 2 ^ ) - 2 a4 ( 2^ ) (35) C + T A + G ^ t where d is calculated as for J-C [Eq. (27)], and ^ is calculated numerically. inheritance of characteristics and further adaptation take place already at the stage of implementation of life. Considering the complexity of life and of the conditions for its origins on Earth, only a simplified model was planned, facilitating the project and its virtual implementation. The main objective of the project is a full implementation of the model in three separate stages: 1. The stage of life planning and preparation of appropriate environmental tools; 2. The stage of life development and adaptation according to the planned assumptions; 3. The stage of implementation, with account taken of the mechanisms of adaptation to the variable conditions of the natural environment. Stage 1 assumes the possibility of planning and simulating the origination of life and the natural environment, with the use of known elements and mechanisms of life occurring in nature, although certain modifications are possible. At this stage, assumptions and appropriate tools are prepared. It is believed that this stage will allow the planning of life as a complete set of independent systems meeting the objectives of the origination and sustainment of life, as a whole, in a complete form. Computer methods and theory The outline of the system concept described in this study is a proposal of a system with which it would be possible to plan, develop, and further adapt life on an uninhabited planet. As the evolutionary model of life on Earth (with a lengthy period of development and the difficult conditions under which it occurred) makes it impossible to plan and further control the development of life, it was assumed 150Krajewski: Phylogenetic aspects of the concept of intelligent life design The purpose of stage 2 is to test and adapt the organisms designed at stage 1 to the planned conditions of their functioning. This stage allows for distribution of the information provided at stage 1, useful for each of the environmental niches. It is expected that adaptation will take place with the use of the virtual model of the natural environment. Stage 3 is related to the implementation of the designed, adapted, and tested organisms, as a realisation of the complete natural environment that allows for the sustainment of life as a whole. This stage should facilitate the implementation of mechanisms of adaptation to changes in the natural environment. The aim of this study is the creation of a tool showing the possibilities for simulation of stage 2. The simulation is regarded in the context of possible execution and implementation of gene substitution methods. According to the evolutionary model of creation and development of life on Earth, organisms developed from the least complex to more complex forms, through a series of changes in the process of continuous adaptation to the changing environmental conditions. As the processes occurring over a long period of can be extremely difficult to reconstruct, it is assumed that a designer of a biological system will prepare, implement, and control the processes of life origination and development. This model uses an analogy with computer systems, where hardware and software is distinguished. The hardware will represent genes coding appropriate proteins. The software will represent information related to the intensity of expression of individual genes or complexes thereof. The model used in this case will feature the distribution and modification of hardware information (selection and modification of individual genes), from a system with the highest level of hardware complexity to subsystems performing specialised tasks. The manner in which the hardware is used is controlled by the software ­ in our model, this is the information responsible for the gene expression level. In order for the adaptation process to take place in an appropriately short span, both the distribution and the modification of individual genes should be virtual, to simulate the occurrence of complex environmental conditions, for individual stages of adaptation. The process of gene distribution is conditional on their expression, increasing or reducing the chances for further distribution, appropriately for genes showing or not showing expression. It is assumed that in the adaptation process, an important role is played by appropriate receptors activating individual genes, depending on the environmental conditions established for the tests. This model assumes that the hardware (ready sets of genes) is not only distributed, but it is also modified, i.e. adapted to the modified gene complex and to the level of gene expression. The level of expression of particular genes in the adaptation process is crucial for the complex of used genes. In order to simplify the model, it is presumed that the level of modification of particular genes will depend mainly on the level of their reduction in the adaptation process. In the application simulating stage 2, it is assumed that the original set of genes will comprise N=2000 systems (sets of dependent genes) with M=1000 genes (Figure 3). Each gene has a fixed expression value, determined at 1000 (UPE value ­ abbreviation taken from the similarity to the upstream promoter elements). Simulation of the relevant environmental conditions for individual tests is done with the application of a predetermined template, containing a map of systems and genes, by assigning fixed expression values (UPE value). The adaptation process is conducted by adding the difference between the template UPE value and the system UPE value with a determined weight: tab _ gen _UPE = tab _ gen _UPE + sigma ( tab _ templ _UPE - tab _ gen _UPE ), (36) where tab_gen_UPE is the UPE value array for the adapted gene array, tab_templ_UPE is the UPE value array for the template, and sigma is an adaptation coefficient within the range of (0,1). A function for the adaptation assessment is calculated: E = ( tab _ gen _UPE - tab _ templ _UPE ) 2 . (37) The template contains genes specific for a particular organism, as well as genes of common systems that can be used by numerous organisms. Different examples of the adaptation process algorithms are possible; however, they will not be discussed in this paper, as the application is intended to illustrate the functioning of the assumed stage 2 model. Simulation of a gradual reduction and changes in the intensity of the characteristics of the test environment at individual stages of adaptation (gene distribution) is carried out through a gradual reduction of systems and genes in the template. The template is modified in such a way as to ensure that an appropriate set of modified genes from the previous stage constitutes a basis for the further reduction and modification of genes at the next stage. This guarantees the continuity of distribution and the homology of the modified genes. In the test process, individual gene systems are combined into larger systems (similarly to biological systems, described as chromosomes; Figure 4). It is also assumed that during the tests, it is possible to obtain favourable conditions for a combination Krajewski: Phylogenetic aspects of the concept of intelligent life design151 Destination system set System 1 System 2 System 3 System L 2 L 1 Template System 1 System 2 System 3 System N UPE=981 UPE=138 UPE=279 Original system set System 1 System 2 System 3 System N Figure 3:Model of stage II. Chromosome KP Level P Chromosome K3 Level 3 Chromosome 2 Chromosome K2 Level 2 System 1 Chromosome 2 System 2 Chromosome 3 System 3 Chromosome K1 System L Level 1 Figure 4:Gene and chromosome distribution scheme. of a complex of systems created at a certain stage, into a new system, giving a new functionality to an organism and allowing it to adapt to the environmental conditions. In the application, the chromosome combination is simulated by a random combination of several chromosomes, with a concurrent random modification of the template 152Krajewski: Phylogenetic aspects of the concept of intelligent life design (UPE value) for genes in the new chromosome (the process of adaptation of the new system will be aimed at the development of a new functionality). In the model presented above, the level of odification depends primarily on the level of gene reduction. Additions, deletions, and substitutions depend on the progress of the adaptation process. Let us assume that the simulations of gene reduction, besides certain similarities to the genetic systems encountered in nature, should take into account the evolutionary model of gene substitution used in the construction of the phylogenetic tree. For this purpose, we will conduct some simple computational experiments, using substitution procedures and the J-C, Felsenstein, and HKY models (see Introduction). The general algorithm formula used for simulation purposes is presented in Figure 5. The simulation was run for three different sets of parameters, to verify the operation of the model, using each of the substitution methods. For the purposes of this simulation, a randomly generated sequence of 10,000 nucleotides, with stationary distribution appropriate for the model, was used. In the first experiment, a simulation was run for the J-C model, with the application of the following parameters: =2e­5, A,C,G,T=0.25, cycle tc=1. As the J-C model is a special case as compared with the other models, for verification purposes it was assumed that u felsen = vHKY = 1,2 ,3,4 and k=1 0.25 (see Introduction). In the tests, a simple phylogenetic tree was used with normalised branch lengths of 5 (Figure 6), which, multiplied by the number of cycles and the cycle , correspond to the evolutionary s for the model concerned (actual ). The actual ta is set as ta = t tc ncycles, (38) where t is the length of branches, tc is the of one cycle, and ncycle is the number of cycles. In the second step, the simulation procedure was tested for the Felsenstein model, with a change in the probabilities of occurrence of individual nucleotides. Parameter values were changed only for A=0.1, C=0.6, G=0.1, and T=0.2. In the third test, the HKY model was used; for the Felsenstein model parameters, the transition/transversion ratio k=100 was changed. The above gene substitution algorithm shows the course of mutation unrestricted by the genetic selection pressure, allowing for a change of any DNA position strictly in accordance with the planned substitution model. However, under natural conditions, such a substitution with respect to the coding regions occurs only in synonymous mutations. With respect to mutations that change the "sense" of the coded protein, a restriction occurs, mainly in connection with the preservation of its functionality. In order to illustrate the conservative effect of the pressure to preserve the protein functionality, a similarity criterion was used for the coded amino acids, with the application of relevant biochemical properties, such as hydrophobicity, hydrophilicity, mass of side chain, but also pK1 (constant of -COOH group), pK2 (constant of -NH3+ group), and pI (isoelectric point of amino acid at 25 °C) [8­13]. According to this criterion, a change of a relevant nucleotide may occur, if the similarity between the coded, the current, and the new protein is less than the assumed value *rand (see Figure 7). The rand randomness factor was introduced to illustrate the differences related to genetic and environmental variability (although a very similar effect can be obtained without the randomness factor, with an appropriately reduced value ). The following amino acid similarity criterion was used [9, 11, 14, 15]: Loop cycles Loop DNA size N Require DNA base substitution with P prob. Y END Loop DNA size Substitute single base END Loop cycles Figure 5:An algorithm of the substitution simulation. Krajewski: Phylogenetic aspects of the concept of intelligent life design153 DNA0 2 5 3 DNA3 DNA2 3 DNA1 Figure 6:Tested tree, t=5 (2+2x3 and 5). Ji ,j = 1 k [ h ( R ) - hl ( Rj )] 2 , k l= 1 l i (39) where Ji,j is the similarity measure for amino acids i, j; hl(Ri) is any biochemical value describing Ri amino acid; and k is the number of factors hl(Ri). Before using the relevant biochemical properties, their standardisation must be carried out according to the formula below: hl ( Ri ) = h0 ( R ) h ( Ri ) - i=1 l i 20 . 0 20 20 h ( R ) 0 l i i=1 [ hl ( Ri )] - i=1 20 20 0 l 20 to the evolution process), where the length of individual branches will reflect the intensity of changes, irrespective of the period in which these changes took place. In this model, we assume that any of the so-called hardware changes are related to the construction, i.e. the set of genes used, whereas the so-called software changes determine their manner of use, i.e. expression. It is expected that in the adaptation process, the ultimate manner of use of the target set of genes is not as important as the determination of the hardware part. In this model, as at the stage of selection of the appropriate set of hardware, in the adaptation process (tests), a relevant system of genes is selected for the purpose of their appropriate further use, i.e. for the purpose of adjustment of appropriate gene systems and their expression. According to this model, at the stage of adaptation, the structure of coded proteins is adjusted to the new system of genes. Therefore, it is assumed that nucleotide substitutions may take place at the design stage, and their intensity depends on the changes in the structure of an organism at the adaptation stage, i.e. the number of reduced genes: t = c, P( c ) = eQc , where c is the number of reduced genes. (41) (40) Results and discussion Simulation of the evolution process Test 1. Figure 8 shows the line, depending on the number of cycles, for the phylogenetic tree for the J-C, The above evolutionary models, where uninhibited or limited nucleotide change occurs according to the set parameters, allow for the determination of the evolutionary s of the phylogenetic tree. In the gene reduction model, the tree will show the process of adaptation (similarly Loop cycles Loop DNA size N Require DNA base substitution with P Y N Similarity (aminoacid, new_aminoacid) < rand Y END Loop DNA size Substitute single base END Loop cycles Figure 7:An algorithm of the substitution simulation for a restricted model. 154Krajewski: Phylogenetic aspects of the concept of intelligent life design J-C, Fe, HKY 7 6 5 4 3 2 1 0 Number of cycles Figure 8: between respective branches of the tree (see Figure 6) as a function of the number of cycles for the J-C, Felsenstein, and HKY models. Felsenstein, and HKY models. The presented in the three implemented substitution methods fluctuates around the applied in the simulation procedure (the real was divided by the number of cycles in order to obtain a fixed value, as for the length of the tree branches in Figure 6). At the final stage, the indications become increasingly inaccurate; the probability d approximates 0.75 (Figure 9), showing minimal variations in the final phase. All models show the same value. Test 2. Following a change of the parameters of the distribution of particular nucleotides (see "Computer methods and theory" section), correct indications are registered for the Felsenstein and HKY models (Figures 10 and 11). The probability d (Figure 12) for the preset 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Number of cycles Figure 9:d-Value between respective branches as a function of the number of cycles. Fe, HKY Number of cycles Figure 10: between respective branches for the Felsenstein and HKY models. Krajewski: Phylogenetic aspects of the concept of intelligent life design155 J-C Number of cycles Figure 11: between respective branches for the J-C model. 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Number of cycles Figure 12:d-Value between respective branches. parameters A,C,G,T approximates d = 1- i i2 = 0.58 (see "Test 1"). Test 3. The last test is carried out following a change of parameter k (see "Computer methods and theory" section). Only the diagram for the HKY model shows the correct values (Figures 13­15). The probability value limd approximates 0.58, as for the Felsenstein model, but t much more slowly [see Eq. (35) and Figure 16]. The system was normalised in such a manner as to ensure that the initial value is 5, as for the tree in Figure 6. The values of parameters C1, C2, C3, and C4 are 2.95, 2.92, 4.93, and 2, respectively (see Figure 17). In this simulation, the estimated values C12, C13, and C23 for the HKY model were 3.13, 5.08, and 5.05, respectively. Gene reduction model simulation In the gene reduction model, the HKY substitution model was used (the tree simulated in "Test 3"). It was assumed that the original number of genes was 2e6 (2000 systems with 1000 genes each). Next, the system was reduced by 12e5 genes, and divided into 20,000 and 30,000 gene systems. The second branch was reduced directly to 25,000 genes. Thus, the value of reduction C is C=5 NS - NR . NS (42) Simulation of the functionality preservation pressure As with all mutations of the coding sequence nucleotides, changing the amino acids in the protein chain may result in a change or a loss of the protein functionality. In this simulation, the uninhibited change of nucleotides will be limited mainly to synonymous changes or changes to amino acids with similar biochemical properties. Unlike with other tests, the plan or the lengths of the branches of the phylogenetic tree will not be determined (although a discussion could be interesting), and the main focus will be on the determination of the number of nucleotide substitutions in the tested protein chain. 156Krajewski: Phylogenetic aspects of the concept of intelligent life design HKY 7 6 5 4 3 2 1 0 Number of cycles Figure 13: between respective branches for the HKY model. 160 140 120 100 80 60 40 20 0 J-C Figure 14: between respective branches for the J-C model. 250 200 150 100 50 0 Number of cycles Felsenstein Number of cycles Figure 15: between respective branches for the Felsenstein model. From the designer's point of view, if experimental methods were used, the designed model of reduction in the adaptation process would be consuming and would probably require enormous resources allowing for experiments to eliminate the relevant genes. It will be better to use virtual methods, and following elimination, to carry out implementation and adaptation in the real world. Therefore, it would be necessary to simulate substitution of genes that may be the products of these two processes. It was assumed that the sole unrestricted substitutions may occur with a probability as in the assumed model; however, the process is disturbed by the present pressure. For this purpose, tests were conducted to measure only the number of the relevant substitutions Krajewski: Phylogenetic aspects of the concept of intelligent life design157 Number of cycles Figure 16:d-Value between respective branches. C4 C3 C1 C2 Figure 17:Tree for the gene reduction model. depending on the similarity coefficient, the number of cycles in relation to the set similarity coefficient, and for synonymous substitutions. As in study [3], tables were provided for position 3, and for 1 and 2. Test determining the relation of the number of substitutions and the variable value of the similarity coefficient for a fixed number of cycles The test was carried out for the HKY model, for the following set of parameters k=100, =5e­7, and A,C,G,T=0.1, 0.6, 0.1, and 0.2, respectively. Substitutions were tested for 10 measurements, for the growing value of the similarity coefficient, and the fixed number of cycles. Table 1 shows that transitions for position 3 are determined at a similar level, both for substitutions between DNA chain No. 1 and 3, 2 and 3, and 1 and 2. The other substitutions, for similarity coefficient 0.1, occur sporadically. Results of measurements across the entire range of values for transitions in position 3, and 1 and 2, are shown in Figures 18 and 19. It is clear that transitions in position 3 remain at a similar level, slightly above 100. For positions 1 and 2, with an increase in the similarity coefficient, the number of transitions clearly grows from approximately 0 to >200. Test determining the relation of the number of substitutions and the number of cycles for synonymous substitutions (similarity coefficient=0) The occurrence of synonymous substitutions in connection with an increase in the number of cycles was tested for the following set of parameters: k=1e4, =1e­7, and A,C,G,T=0.25. Figure 20 shows that with an increase in the number of cycles, the number of transitions grows to approximately 100 (the number of transitions for positions 1 and 2 is very small, and therefore it was not shown). Similarly, for parameters k=1, =5e­4, and A,C,G,T=0.25, and a large number of cycles, the number of transitions for position 3 is approximately 100 (Table 2). Table 1:Transversions and transitions are presented in the lower left part and the upper right part, respectively, for sequence of 1000 nucleotides. DNA1 DNA2 DNA3 DNA1 0 (0) 2 (2) 1 (4) DNA2 2 (100) 0 (0) 1 (6) DNA3 4 (118) 6 (112) 0 (0) Test determining the relation of the number of substitutions and the number of cycles for a fixed value of the similarity coefficient The following set of parameters was adopted for the simulation, for the fixed similarity coefficient: k=1e4, =1e­7, A,C,G,T=0.25, and =0.3. The number of transitions for positions 3, and 1 and 2, between DNA chain 1, 3 and 2, and 3, increases with the increase in the number of cycles, The first numbers relate to substitutions in the 1st and 2nd positions of codon; the numbers in parentheses relate to the 3rd position of codon. 158Krajewski: Phylogenetic aspects of the concept of intelligent life design 160.00 Number of transitions 140.00 120.00 100.00 80.00 60.00 40.00 20.00 0.00 0.1 0.2 0.3 0.4 Similarity coefficient Figure 18:Number of transitions between respective branches of the tree (see Figure 5) in the 3rd position as a function of the similarity coefficient. 250.00 Number of transitions 200.00 150.00 100.00 50.00 0.00 0.1 0.2 0.3 0.4 1 and 2 position Similarity coefficient Figure 19:Number of transitions between respective branches of the tree in the 1st and the 2nd position as a function of the similarity coefficient. 140.00 Number of transitions 120.00 100.00 80.00 60.00 40.00 20.00 0.00 Number of cycles Figure 20:Number of transitions in the 3rd position of codon. Table 2:Table of substitutions for a new set of parameters (see the text). DNA1 DNA2 DNA3 DNA1 15 (94) 12 (96) DNA2 8 (97) 17 (100) DNA3 9 (104) 11 (103) and reaches the value of approximately 160 and 120, respectively (Figures 21 and 22). Simulation of mitochondrial DNA substitution This simulation was run as a combination of two processes. The first one was related to the adjustment stage, Krajewski: Phylogenetic aspects of the concept of intelligent life design159 Number of transitions Gen13 Gen23 Gen12 Number of cycles Figure 21:Number of transitions for the 3rd position. 160 Number of transitions 140 120 100 80 60 40 20 0 1 and 2 position Gen13 Gen23 Gen12 Number of cycles Figure 22:Number of transitions for positions 1 and 2. and the second one with the adaptation stage in the process of implementation. It was assumed that the nucleotide substitution in the implementation process would be consistent with the model in "Test determining the relation of the number of substitutions and the number of cycles for a fixed value of the similarity coefficient". The process related to the adjustment stage is unknown and was created ad hoc (see below). The algorithm (Figure 23) uses two amino acid chains with the same initial sequence corresponding to the genetic information of the nucleotide chain. The first amino acid chain will only be a template simulating the direction of changes. The amino acids in the template chain will be substituted at random, with the fixed probability p. The nucleotide chain substitution is done in such a manner as to ensure that the similarity of the template amino acid to the new amino acid coded by a codon of the nucleotide chain is lower than the preset value , and to meet the following minimisation criterion: N N 80 - 30 = u 3 - 1 + N v 3 2 + v 12 - , N u 12 N u 12 80 (43) where Nu3 is the number of transversions in position 3, Nu12 is the number of transversions in positions 1 and 2, Nv3 is the number of transitions in position 3, Nv12 is the number of transitions in positions 1 and 2, and Nu12 is the number of transversions in positions 1 and 2. N v 12 should equal It was assumed that the ratio N u 12 80 - 30 N v 12 ( ­ for humans, after subtraction of the 80 N u 12 fixed value of approximately 30 from the maximum value Nv12, to the maximum value Nu12, i.e. approximately 80; see Table 1 from Ref. [3]). For the purpose of value calculation, the indicative values of hardware distance between tree nodes (Figure 24 in this study and Figure 1 from Ref. [3]) were used: 1-2, 2-3, 3-4, 4-5, 5-6, 6-man, 6-chimp, 5-gorilla, 4-orang, 3-gibbon, 2-bovine, and 1-mouse accordingly 3.90, 20.67, 3.12, 4.68, 1.17, 2.73, 2.73, 3.9, 8.58, 11.70, 32.37, and 36.27. The results of the algorithm are shown in Table 3. The process of further adjustment (adaptation) at the implementation stage could take place, e.g. according to the algorithm in "Test determining the relation of the 160Krajewski: Phylogenetic aspects of the concept of intelligent life design Loop hardware distance Loop protein size N Require amino acid substitution with p prob. Y N Similarity(amino acid, new_aminoacid) < Y Choose amino acid codon with the best rule fulfilment. END Loop DNA size Substitute amino acid with the best codon selected END Loop distance Figure 23:Algorithm of the simulation of mtDNA substitutions at stage II. 6 Man Mouse Bovine Gibbon Orangutan Gorilla Chimpanzee is shown in Figure 25. The tree was created with PhyML3.1, with the application of the maximum likelihood (ML) method for the HKY, and Mega 5.2.2 [16]. The scale is only indicative, determined from the point of divergence between mouse and bovine [3]. Figure 24:Hypothetical tree of mammalians with indicative branch lengths. Conclusions The concept described here should not only give the opportunity to design life, but it also should use similar genetic processes. Its results should be similar to the system occurring in nature. The aim of this study was to show the similarities between these systems, in terms of nucleotide substitution processes, and a similar distribution of substitutions was obtained for both systems. For the purpose of the first simulation, it was assumed that one process was responsible for the distribution number of substitutions and the number of cycles for synonymous substitutions (similarity coefficient=0)". The sample distribution of transitions and transversions, from the same initial sequence, is shown in Table 4. The algorithm is used to calculate the distance from the amino acid template and the DNA determined at the adaptation stage. The final result of the application of both algorithms is shown in Table 5. The phylogenetic tree for the DNA chains obtained as a result of these processes Table 3:Substitution table for simulations for mammalians of stage II. Mouse Bovine Gibbon Orang Gorilla Chimp Man Mouse 78 (68) 79 (70) 79 (68) 79 (68) 78 (68) 77 (67) Bovine 62 (2) 73 (65) 72 (64) 72 (64) 71 (63) 70 (63) Gibbon 65 (3) 57 (1) 31 (29) 28 (27) 28 (26) 27 (25) Orang 65 (3) 57 (1) 25 (0) 21 (20) 20 (19) 19 (18) Gorilla 64 (3) 57 (1) 23 (0) 17 (0) 7 (7) 6 (6) Chimp 64 (3) 56 (1) 23 (0) 17 (0) 6 (0) 3 (3) Man 63 (3) 56 (1) 22 (0) 16 (0) 5 (0) 3 (0) Krajewski: Phylogenetic aspects of the concept of intelligent life design161 Table 4:Substitution table for simulations for mammalians of stage III. Mouse Bovine Gibbon Orang Gorilla Chimp Man Mouse 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) Bovine 27 (62) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) Gibbon 29 (55) 28 (57) 0 (0) 0 (0) 0 (0) 0 (0) Orang 27 (58) 26 (60) 27 (57) 0 (0) 0 (0) 0 (0) Gorilla 29 (61) 27 (64) 28 (59) 27 (59) 0 (0) 0 (0) Chimp 27 (60) 26 (60) 27 (59) 24 (59) 27 (61) 0 (0) Man 28 (58) 25 (59) 27 (56) 26 (61) 27 (61) 25 (61) Table 5:Substitution table for simulations for mammalians after stage II and III. Mouse Bovine Gibbon Orang Gorilla Chimp Man Mouse 78 (68) 79 (70) 79 (68) 79 (68) 78 (68) 77 (67) Bovine 80 (49) 73 (65) 72 (64) 72 (64) 71 (63) 70 (63) Gibbon 83 (47) 78 (49) 31 (29) 28 (27) 28 (26) 27 (25) Orang 84 (46) 76 (48) 50 (48) 21 (20) 20 (19) 19 (18) Gorilla 82 (50) 77 (50) 47 (52) 45 (54) 7 (7) 6 (6) Chimp 82 (47) 75 (46) 49 (46) 42 (49) 32 (50) 3 (3) Man 80 (50) 74 (51) 46 (49) 42 (52) 31 (56) 28 (52) Mouse Bovine Gibbon Orangutan Gorilla Man Chimpanzee 60 0.15 40 0.10 20 0.05 0 0.00 Myr Figure 25:Phylogenetic tree for mammalians after stage II and III. of substitutions at both stages (adaptation and implementation), which could be simulated with the Markov process. This simulation was run with the application of the HKY model. It is evident that, in this case, the assumption that the substitution occurred on average at a steady speed over would be untrue, as it is assumed that mutation changes in genes depend on the new system of genes, and the adaptation process is virtual. The implementation process would be a continuation of the adaptation process. For the purpose of the second simulation, both processes were divided. At the adaptation stage, the mutation changes depended on the changes in the gene system. An algorithm realising a simple dependency function with the minimisation criterion was used. Amino acid substitutions were simulated by random changes of the template amino acids, depending on the changes in the gene system. The implementation stage is consistent with the inheritance mechanisms, and is based on the assumption of the occurrence of random substitutions, stable over . For the nucleotide substitution simulation procedure, the HKY model was adopted. The algorithm restricting substitutions to synonymous substitutions or to ones changing the amino acids in a specified scope of similarity disturbs this process. The result of the simulation is a distribution of substitutions resembling the one in Ref. [3] (this study was not intended to show the exact similarity, and both the algorithm and the certain input variable values were selected indicatively). The obtained nucleotide chains with the distribution of substitutions presented in this paper facilitated the creation of the phylogenetic tree. The ML method was used with the HKY model. Therefore, it was shown that the occurrence of nucleotide substitutions depending on the number of reduced genes may give results similar to those in evolutionary processes. The system requires a simulation of the life origination process at each of the specified stages, and presents an interesting concept of the design and creation of life with the use of a computer. Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission. Research funding: None declared. Employment or leadership: None declared. Honorarium: None declared. 162Krajewski: Phylogenetic aspects of the concept of intelligent life design Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the report for publication. 8. Cai Y, Zhou G, Chou K. Support vector machines for predicting membrane protein types by using functional domain composition. Biophys J 2003;84:3257­63. 9. Chou K, Cai Y. Prediction of protease types in a hybridization space. Biochem Biophys Res Commun 2006;339:1015­20. 10. Zhang G, Li H, Gao J, Fang B. Predicting lipase types by improved Chou's pseudo-amino acid composition. Protein Pept Lett 2008;15:1132­7. 11. Chou K, Cai Y. Predicting protein quaternary structure by pseudo amino acid composition. Proteins Struct Funct Genet 2003;53:282­9. 12. Krajewski Z, Tkacz E. Feature selection of protein structural classification using SVM classifier. Biocybern Biomed Eng 2013;33:47­61. 13. Krajewski Z, Tkacz E. Protein structural classification based on pseudo amino acid composition using SVM classifier. Biocybern Biomed Eng 2013;33:77­87. 14. Chou K. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Struct Funct Bioinform 2001;43:246­55. 15. Chou K, Cai Y. Predicting subcellular localization of proteins by hybridizing functional domain composition and pseudo-amino acid composition. J Cell Biochem 2004;91:1197­1203. 16. Hall BG. Phylogenetic trees made easy. University of Rochester, Emeritus and Bellingham Research Institute. Sunderland, MA: Sinauer Associates Inc., 2001.

Journal

Bio-Algorithms and Med-Systemsde Gruyter

Published: Sep 1, 2015

There are no references for this article.