Distribution of Base Pair Alternations in a Periodic DNA Chain: Application of Polya Counting to a Physical System

Malcolm Hillebrand; Guy Paterson-Jones; George Kalosakas; Charalampos Skokos

doi:10.1134/S1560354718020016

Distribution of Base Pair Alternations in a Periodic DNA Chain: Application of Polya Counting to a Physical System

Hillebrand, Malcolm;Paterson-Jones, Guy;Kalosakas, George;Skokos, Charalampos 2018-05-16 00:00:00 ISSN 1560-3547, Regular and Chaotic Dynamics, 2018, Vol. 23, No. 2, pp. 1–16. c Pleiades Publishing, Ltd., 2018. Distribution of Base Pair Alternations in a Periodic DNA Chain: Application of Po´lya Counting to a Physical System 1* 1 Malcolm Hillebrand , Guy Paterson-Jones , 2 1 George Kalosakas , and Charalampos Skokos Department of Mathematics and Applied Mathematics, University of Cape Town, Rondebosch, Cape Town 7701, South Africa Department of Materials Science, University of Patras, Rio GR-26504, Greece Received October 13 2017; accepted December 11, 2017 Abstract—In modeling DNA chains, the number of alternations between Adenine-Thymine (AT) and Guanine-Cytosine (GC) base pairs can be considered as a measure of the heterogeneity of the chain, which in turn could aﬀect its dynamics. A probability distribution function of the number of these alternations is derived for circular or periodic DNA. Since there are several symmetries to account for in the periodic chain, necklace counting methods are used. In particular, Po´lya’s Enumeration Theorem is extended for the case of a group action that preserves partitioned necklaces. This, along with the treatment of generating functions as formal power series, allows for the direct calculation of the number of possible necklaces with a given number of AT base pairs, GC base pairs and alternations. The theoretically obtained probability distribution functions of the number of alternations are accurately reproduced by Monte Carlo simulations and ﬁtted by Gaussians. The eﬀect of the number of base pairs on the characteristics of these distributions is also discussed, as well as the eﬀect of the ratios of the numbers of AT and GC base pairs. MSC2010 numbers: 05A15, 92D20 DOI: 10.0000/S1560354718000013 Keywords: DNA models, Po´lya’s Counting Theorem, Heterogeneity, Necklace Combinatorics 1. Introduction Single circular DNA molecules are abundant in nature. The whole genome in a typical bacterium is usually contained in a closed DNA molecule, while in eucaryotes the organelle DNA, inside the mitochondria and chloroplasts, is also found in the same form [1, 23]. Also plasmids, either naturally found in bacteria, or used as vectors in gene cloning, are smaller circular DNA segments. Apart from these cases, in considering the dynamics and other properties of DNA chains, it is often useful to model the chain using periodic boundary conditions in order to avoid ﬁnite size or edge eﬀects. For example, periodic boundary conditions have been used to study denaturation bubbles and the melting behavior of DNA [2, 6, 13, 37, 39, 43], probability distributions of thermal openings in the double strand [7, 18], bubble opening proﬁles in promoter regions which regulate gene transcription [3–5, 11, 12, 16, 20], binding sites of DNA-associated proteins [26, 38], various dynamical and nonlinear properties of DNA [21, 27, 28, 40, 41, 44], as well as charge transport in DNA [10, 14, 17, 19, 33]. A DNA chain consists of a series of base pairs, where each base pair is either Adenine-Thymine (AT) or Guanine-Cytosine (GC). Currently, we are investigating the inﬂuence of diﬀerent factors on the chaoticity of periodic DNA chains [36]. One of the examined quantities is the number of base pair alternations, which can be considered as a quantiﬁer of the system’s heterogeneity. In this work we focus on the rigorous mathematical treatment of alternation counting in periodic DNA sequences. To study periodic DNA, we will consider the DNA necklace associated to a DNA chain, E-mail: malcolm.hillebrand@gmail.com arXiv:1805.06245v1 [math.CO] 16 May 2018 2 Hillebrand et al. where the ﬁrst and the last base pairs in the chain will become neighbors. This periodicity presents some modeling challenges - if one considers two distinct chains of DNA, it may still be the case that their corresponding necklaces are the same, as one may be merely a rotation or reﬂection of the other. Such symmetries need to be addressed if any conclusions are to be made about the structure and the dynamics of DNA necklaces. In particular, we are concerned with the number α of base pair alternations in the necklace, where an alternation is deﬁned to be a point at which an AT base pair neighbors a GC base pair or vice versa. Consider, for instance, the DNA chain shown in Fig. 1. Representing a GC base pair (black bead) with a 0 and an AT base pair (white bead) with a 1, the 0 0 0 0 1 0 1 1 0 0 1 Fig. 1. An example of a DNA chain. GC base pairs are represented by black beads and the number 0, while AT base pairs are represented by white beads and the number 1. In the DNA necklace corresponding to this chain, the AT base pair at the far right neighbors the GC base pair at the far left. ¯¯¯ ¯ ¯¯ chain can be written in the form (1)00001011001(0). Here, we have given the leftmost base pair at each alternation point an overbar, and used brackets to denote the fact that in the corresponding DNA necklace the ﬁrst and last base pairs are neighbors. This necklace is illustrated in Fig. 2, and counting the number of overbars we see that there are α = 6 alternations. Fig. 2. The DNA necklace corresponding to the chain of Fig. 1. This necklace has α = 6 alternations. It is worth noting that a base pair alternation corresponds to the appearance of the particular sequences (often referred to as “words”) 01 or 10 in a DNA chain. Word occurrence probabilities have already been studied in the literature (see e.g. [22, 24, 30–32, 34, 35] and references therein), with emphasis on the appearance of patterns with unexpectedly high or low frequencies, as well as on repeating sequences. However these studies concern the case of linear DNA segments, or in other words DNA chains with ﬁxed boundary conditions. The periodic boundary conditions we consider in our study make the problem of counting alternations (or more generally the appearance of speciﬁc words) in circular DNA segments much more complicated than in the case of linear DNA segments due to the appearance of additional symmetries in the DNA structures imposed by rotations and/or reﬂections. Each base pair in a DNA necklace can contribute at most 2 alternations, depending on which neighbors it diﬀers from. Supposing that the number of AT and GC base pairs in the necklace is given by N and N respectively, this yields the restriction 0 ≤ α ≤ min{2N , 2N }. We AT GC AT GC note that in the extreme case of a homogeneous chain composed of base pairs of the same kind α = 0, while if both types of base pairs are present in the DNA chain the smallest possible value of alternations is α = 2. The later corresponds to a chain having all AT (and consequently GC) base pairs grouped together. Furthermore, if we traverse the necklace pair by pair until we end up where we started, we must necessarily switch between AT and GC base pairs an even number of times. Thus α = 2M for some M ∈ N. Now the natural question is: what is the probability that a random DNA necklace with a speciﬁed number of AT and GC base pairs, N and N respectively, has a speciﬁed number of AT GC alternations α? Or in other words, how many possible combinations of such base pairs are there REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 Po´lya Counting in Periodic DNA Chains 3 that yield α alternations once the cyclic and reﬂective symmetries are taken into account? In what follows we answer these questions and provide an algorithm for computing the number of distinct DNA necklaces satisfying these constraints. The paper is organized in the following way: In Sect. 2, the mathematical background is laid out, leading into a Po´lya Enumeration Theorem for bipartite sets. In Sect. 3 an explicit algorithm for calculating the number of distinct DNA necklaces with given values of α, N and N is AT GC described, while in Sect. 4 we compare the theoretical results to those obtained from Monte-Carlo simulations and investigate the eﬀect of the N and N values on the characteristics of the AT GC probability distribution function (pdf) of α. Finally, in Sect. 5 we summarize our results, while in the Appendix we provide a Python computer code implementing the algorithm of Sect. 3. 2. Theoretical Treatment Our problem can be neatly related to the combinatorics of necklaces. Eﬀectively, we are interested in the number of distinct necklaces with N = N + N beads, where N of the beads are white, AT GC AT N of the beads are black, and there are α alternations between the colors. We consider necklaces GC to be the same if they can be reﬂected or rotated into one another, and beads of the same color are treated as indistinguishable. Because of this, we can equivalently think of a necklace with α alternations as a necklace of α containers, where each container carries some number of black or white beads of the same color, and adjacent containers have diﬀerent colors. This idea is illustrated in Fig. 3. Fig. 3. The necklace of containers corresponding to the DNA necklace of Fig. 2. The numbers in each container represent the number of consecutive black or white beads in that segment of the necklace. We will refer to containers carrying black beads as black containers, and similarly for white containers. Counting the number of distinct necklaces with the given constraints can thus be reformulated as the problem of assigning numbers of beads to α containers, such that the total of the numbers in the black and white containers is equal to N and N respectively. Two such GC AT assignments will be considered equivalent if the containers can be rotated or reﬂected into one another in such a way as to preserve both the colors and numbers of beads they contain. Enumerating such assignments is simpler than enumerating necklaces, as we have one less constraint - the number of alternations is now implicit in the formulation of the problem. To perform this enumeration we will require some tools from Po´lya counting theory - in particular, we will need a version of the Po´lya Enumeration Theorem for sets partitioned into two parts, which we will refer to as bipartite sets. For completeness’ sake, we present this material below. 2.1. Group Actions Let A be a set. Then we deﬁne the symmetric group on A to be the set of permutations of A: S = {ϕ : A → A | ϕ is a bijection}. (2.1) A cycle is a permutation ϕ ∈ S such that there exist distinct elements {x , x , . . . , x } ∈ A and: A 1 2 k x if x = x for some 1 ≤ i < k i+1 i ϕ(x) = (2.2) x if x = x 1 k x otherwise. REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 4 Hillebrand et al. We denote such a cycle suggestively as (x x . . . x ), and say that ϕ ∈ S is a k-cycle if 1 2 k A ϕ = (x x . . . x ) for some x ∈ S . Two cycles (x x . . . x ) and (y y . . . y ) are said 1 2 k i A 1 2 k 1 2 l to be disjoint if the sets {x , x , . . . , x } and {y , y , . . . , y } are disjoint. 1 2 k 1 2 l If A is a ﬁnite set, every element of S can be written as a composition of cycles; in general, however, this cannot be done uniquely. On the other hand, we have the following fundamental structure theorem for elements of ﬁnite symmetric groups (see for example [15]): Theorem (Cycle Decomposition Theorem). If A is a ﬁnite set, then every element ϕ ∈ S can be written as a product of pairwise disjoint cycles, unique up to order of the cycles: ϕ = (x x . . . x ) · · · (x x . . . x ). 11 12 1k n1 n2 nk 1 n Given a group G and a set A, a group action of G on A is a homomorphism Γ : G → S . In G A other words, elements of G are identiﬁed with permutations of A in a manner that preserves the group structure. To simplify the notation, we will write gx instead of Γ (g)(x) for the action of g ∈ G on some x ∈ A. The orbit of an element x ∈ A under the group action Γ is deﬁned to be the set Orb = {gx | G x g ∈ G}, and its stabilizer is given by the subgroup Stab = {g ∈ G | gx = x}. Given some g ∈ G, we denote its set of ﬁxed points by Fix = {x ∈ A | gx = x}. 2.2. Po´lya’s Counting Theory One can often rephrase counting problems in terms of computing the number of distinct orbits of some group action. Po´lya’s counting theory can be thought of as a tool for making these computations systematic and expedient. A fundamental lemma on which this theory is built is the following [9]: Lemma 1 (Burnside’s Lemma). The number of distinct orbits in a group action of a ﬁnite group G on A is given by the average number of ﬁxed points of elements of G: #Orbits = |Fix |. (2.3) |G| g∈G A basic problem in combinatorics is the following. Suppose one has a ﬁnite set of objects A, and one wishes to color them with colors from another set Ω. How many distinct ways are there of coloring the objects up to some kind of symmetry? This can be recast in the language of group actions. The set of possible colorings is given by Ω = {ϕ : A → Ω | ϕ a function}, and the symmetry is given by a group action Γ on A. This group action passes naturally to a group action Γ on Ω , deﬁned by gϕ : x 7→ ϕ(gx). The question now reduces to counting the number of distinct orbits of this latter action. In this simpliﬁed case, Burnside’s lemma is often suﬃcient to answer the question. We can generalize this problem slightly, however. Suppose that each color has an associated weight, given by a function ω : Ω → N. Given a coloring ϕ : A → Ω of the objects, we deﬁne its total weight to be the sum: |ϕ| = ω ◦ ϕ(x). (2.4) x∈A How many distinct colorings of A with a given total weight are there, up to symmetries given by some group action Γ ? Note that the total weight of any coloring in a given orbit is the same, as elements of g merely permute the set A. Thus, the problem boils down to calculating the number of distinct orbits with a given total weight. Po´lya identiﬁed two necessary ingredients for a systematic answer to this question: generating functions, and an understanding of the cycle structure of elements of G [29]. Deﬁnition (Generating Function). Let ω : Ω → N be an assignment of weights to some set Ω. Suppose further that there are at most a ﬁnite number of elements of any given weight, that is, −1 |ω (n)| is ﬁnite for every n ∈ N. Then the generating function of ω is given by the polynomial: −1 i f (x) = |ω (i)| x . (2.5) i=0 REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 Po´lya Counting in Periodic DNA Chains 5 Generating functions are useful as they encode combinatorial data - in this case the number of colors of a given weight - as algebraic objects. In particular, we will need the following lemma: Lemma 2. Let ω : Ω → N and ω : Ω → N be assignments of weights to the sets Ω and Ω 1 1 2 2 1 2 respectively. Deﬁne an assignment of weights to the set Ω × Ω by ω : (x , x ) 7→ ω (x ) + ω (x ). 1 2 1 2 1 1 2 2 Then f (x) = f (x) · f (x). ω ω ω 1 2 Given a group action Γ and an element g ∈ G, we denote by C (g) the number of k-cycles in G k the unique disjoint cycle decomposition of Γ (g). We can now encode information about the cycle structure of elements of G in the following multivariate polynomial: Deﬁnition (Cycle Index). Let G be a ﬁnite group. Then the cycle index of a group action Γ on a ﬁnite set A of cardinality n is given by the polynomial [8]: C (g) C (g) 1 2 C (g) Z (x , x , . . . , x ) = x x · · · x . (2.6) G 1 2 n 1 2 |G| g∈G This cycle index will allow us to eﬃciently compute the number of distinct orbits of the group action. With this in mind, we are now in a position to state a version of the Po´lya counting theorem, answering the generalized problem given earlier: Theorem (Po´lya Enumeration Theorem). Let A be a ﬁnite set of objects, Ω a set of colors, ω : Ω → N an assignment of weights to the colors with generating function f , and Γ a group ω G action of a ﬁnite group G on A. Then Γ passes naturally to a group action Γ on Ω , and a G G generating function by total weight for the number of distinct orbits of Γ is given by: 2 n Orbits (x) = Z f (x), f (x ), . . . , f (x ) . (2.7) ˜ G w w w 2.3. Po´lya Enumeration Theorem for Bipartite Sets By considering multivariate generating functions, the Po´lya enumeration theorem can be generalized to the case where the colors take weights in N . We will generalize the theorem in a diﬀerent direction, however. Suppose we have a partition of A into two parts, A = X ⊔ Y , and a group action Γ on A. We would like to consider the problem of counting distinct colorings of A under this symmetry, with the additional constraint that we color elements of X from a set Ω , and elements of Y from a set Ω . To this end, we will say that a coloring ϕ : A → Ω ⊔ Ω is valid Y X Y if ϕ(x) ∈ Ω ⇐⇒ x ∈ X and ϕ(x) ∈ Ω ⇐⇒ x ∈ Y . X Y There is an obstruction to this, however - the group action may map elements in X to elements in Y or vice versa. In this case, the extension of Γ to the set of possible colorings is no longer well-deﬁned, as there is no natural way to compare the sets of colors Ω and Ω . Fortunately, X Y this is the only obstruction to proving a Po´lya-type theorem for this problem. This motivates the following deﬁnition: Deﬁnition (Partition-Preserving Group Action). Let A = X ⊔ Y , and let Γ be a group action on A. Then we say that Γ is partition-preserving if for every g ∈ G, gx ∈ X ⇐⇒ x ∈ X and gx ∈ Y ⇐⇒ x ∈ Y . The importance of this property is as follows. Suppose we have a group action Γ on A = X ⊔ Y , and some element g ∈ G. Then Γ (g) has a unique disjoint cycle decomposition given by Γ (g) = G G C · C · . . . · C . If Γ is partition-preserving then each cycle C is contained entirely in either X 1 2 G i or Y , and Γ is in fact partition-preserving if and only if this is the case for every g ∈ G. If Γ is partition-preserving, then we deﬁne C (g) to be the number of k-cycles in the disjoint cycle decomposition of Γ (g) that are contained in X, and we deﬁne C (g) analogously. We will now deﬁne an analogue of the cycle index polynomial for the case of partition-preserving group actions. This will allow us to keep track of the cycle structure of elements of the group as well as which partition part each cycle acts on: REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 6 Hillebrand et al. Deﬁnition (Bipartite Cycle Index). Let G be a ﬁnite group and A = X ⊔ Y a ﬁnite set of cardinality n. Then the bipartite cycle index of a partition-preserving group action Γ on A is deﬁned to be the polynomial: 1 X X Y Y C (g) C (g) C (g) C (g) 1 n 1 n Z (x , . . . , x , y , . . . , y ) = x · · · x y · · · y . (2.8) G 1 n 1 n 1 n 1 n |G| g∈G We can now generalize Po´lya’s theorem to the case of partition-preserving group actions. We note that this theorem is used implicitly in [29] without proof. Theorem 1 (Bipartite Po´lya Enumeration Theorem). Let Γ be a partition preserving group action of a ﬁnite group G on a ﬁnite set A = X ⊔ Y . Let Ω = Ω ⊔ Ω be a set of colors, and let X Y + + ω : Ω → N and ω : Ω → N be their assigned weights with respective generating functions X X Y Y f and f . If Φ is the set of valid colorings of A, then Γ passes naturally to a group action Γ X Y G G on Φ, and a generating function by total weight for the number of orbits of Γ is given by: k k Orbits (x) = Z f (x), . . . , f (x ), f (x), . . . , f (x ) . (2.9) ˜ G X X Y Y Proof. We pass to a group action Γ on Φ as follows. Given a valid coloring ϕ ∈ Φ and an element g ∈ G, we deﬁne the action of g on ϕ by gϕ : x 7→ ϕ(gx). To compute a generating function for the number of orbits of Γ by total weight, we will determine the generating functions for the number of ﬁxed points of each g ∈ G by total weight. Consider some g ∈ G. As A is ﬁnite, there exists a unique disjoint cycle decomposition Γ (g) = C · C · . . . · C , where each C is a cycle in the symmetric group S . Now suppose that g ﬁxes 1 2 k i A some valid coloring ϕ ∈ Φ; that is, gϕ = ϕ. Then, assuming the cycle C = (x x . . . x ) for some i 1 2 k x ∈ A, we have by deﬁnition that ϕ(x ) = (gϕ)(x ) = ϕ(gx ) = ϕ(x ), and hence every element i i i i i+1 in the cycle must have the same color under ϕ. The number of colorings of C that are ﬁxed by g is k k i i thus given by the generating function f (x ) if C lies in X, and f (x ) if C lies in Y . We note X i Y i that one of these two cases must occur for every cycle as Γ is partition-preserving. By lemma 2, then, the number of valid colorings of A that are ﬁxed by g is given by the generating function: X X Y Y C (g) C C C k k 1 k 1 k Fix (x) = f (x) · · · f (x )f (x) · · · f (x ). (2.10) X X Y Y By Burnside’s lemma, the number of orbits of Γ of a particular weight is given by the average number of ﬁxed colorings of that weight by elements g ∈ G. Applying Burnside’s lemma for each possible weight, the number of orbits of Γ is thus given by the generating function: Orbits (x) = Fix (x) |G| g∈G X Y X Y 1 C C C (g) C k k 1 k 1 k = f (x) · · · f (x )f (x) · · · f (x ) X X Y Y |G| g∈G k k = Z f (x), . . . , f (x ), f (x), . . . , f (x ) . (2.11) G X X Y Y We note that as a corollary of this proof, we can recover a bivariate generating function from a b this expression, where the coeﬃcient of x y represents the number of distinct colorings with total weight a in Ω , and total weight b in Ω : X Y Corollary. A bivariate generating function by total weight in Ω and Ω , for the number of X Y distinct colorings of A, is given by: k k Orbits (x, y) = Z f (x), . . . , f (x ), f (y), . . . , f (y ) . (2.12) ˜ G X X Y Y REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 Po´lya Counting in Periodic DNA Chains 7 2.4. The Dihedral Group, its Cycle Index and its Extension To apply these results to the problem of counting distinct DNA necklaces, we will need to describe the relevant group action and compute its (bipartite) cycle index. The set of elements acted on by the group is given by the α containers in the DNA necklace and this set can be partitioned into two groups: containers of black beads and containers of white beads. We consider two DNA necklaces to be the same if one can be rotated or reﬂected into the other. These symmetries can be described by an action of the dihedral group, which we will denote by D , where we have α = 2M. The 2M rotational and reﬂective symmetries are what distinguishes the case of periodic DNA chains from linear, ﬁxed boundary condition chains studied in [31] and elsewhere. A fundamental fact about D is that it is generated by two elements r and s, where r is a 2M reﬂection satisfying r = 1, and s is a rotation of order M. Therefore, to describe a group action of D on a DNA necklace it suﬃces to give the action of r and s. In Fig. 4 the action of such a 2M rotation on the necklace is illustrated, while in Figs. 5 and 6 the action of a reﬂection is illustrated for the cases where M is odd and even respectively. It is clear that the resulting group action is partition-preserving. Fig. 4. The action of a rotation s ∈ D on the DNA necklace. 2M To compute the bipartite cycle index of this group action, we will treat reﬂections and rotations separately. To begin with, we can see from Fig. 4 that rotations act symmetrically on the black and white containers in the DNA necklace. Thus, the terms of the cycle index polynomial corresponding to rotations will be symmetric in the x and y . The natural action of the cyclic group C on the i i M M containers in a partition is given by [25]: M/d Z (x , . . . , x ) = ϕ(d)x , (2.13) C 1 M M d d|M where ϕ(d) is deﬁned to be the number of natural numbers less that d that are coprime to it (the Euler totient function). Note that 1 is considered to be coprime to all natural numbers, and so in particular ϕ(d) > 0. Exactly half of the elements of D are rotations, and thus the rotational part 2M M/d M/d of the bipartite cycle index Z is given by ϕ(d)x y . 2M d|M 2 d d The reﬂective part of the group D , on the other hand, acts diﬀerently depending on the parity 2M of M. Suppose ﬁrst that M is odd, in which case a typical reﬂection is illustrated in Fig. 5. Each of the M possible reﬂections occur across an axis consisting of one black container and one white container, both of which are ﬁxed by the reﬂection. The rest of the containers are split into 2-cycles, and thus the bipartite cycle index Z for odd M is given by: 2M 1 1 M/d M/d (M−1)/2 (M−1)/2 Z (x , . . . , x , y , . . . , y ) = ϕ(d)x y + x y x y . (2.14) D 1 M 1 M 1 1 2M 2 2 d d 2 2 d|M If M is even, a typical reﬂection is illustrated in Fig. 6. In this case, each possible reﬂection occurs across an axis consisting of either two white containers or two black containers. The rest of the containers again split into 2-cycles. Thus the bipartite cycle index Z for even M is given by: 2M 1 1 1 M/d M/d (M−2)/2 M/2 (M−2)/2 M/2 2 2 Z (x , . . . , x , y , . . . , y ) = ϕ(d)x y + x x y + y y x . D 1 M 1 M 1 1 2M d d 2 2 2 2 2 4 4 d|M (2.15) REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 8 Hillebrand et al. Fig. 5. The action of a reﬂection r ∈ D on the DNA necklace, for the case where M is odd. 2M Fig. 6. The action of a reﬂection r ∈ D on the DNA necklace, for the case where M is even. 2M 2.5. Generating Functions as Formal Power Series In our particular application of Po´lya theory, the elements we are coloring are the α containers in the DNA necklace and the color of a particular container is deﬁned to be the number of black or white beads it contains. As each container must contain at least one bead, the set of colors is given by N . We are interested in the total number of black and white beads, so the weight of each color will be given quite simply by ω(n) = n for each n ∈ N . This weighting corresponds to the 2 3 generating function (2.5) f (x) = x + x + x + · · · . To compute the number of distinct DNA necklaces with N white beads and N black beads, AT GC N N AT GC we need to calculate the coeﬃcient of x y in (2.12), where the bivariate cycle index is given by the appropriate Z(D ) from Sect. 2.4 and the weight generating function is given by f (x). 2M ω n 2 3 n This requires us to calculate the coeﬃcients of speciﬁc terms in f (x) = (x + x + x + . . . ) for potentially large n. However, doing this expansion naively requires many computing steps, whose number grows exponentially fast as n increases. Thus, this approach is impractical. Fortunately, there exists a way to bypass this problem: treating f (x) as a formal power series, we can manipulate it into a form that makes such computations signiﬁcantly faster. An introduction to the theory of formal power series can be found, for instance, in [42]. For our purposes, we will only need the fact that a form of the binomial theorem holds in this setting: −n n Lemma 3. Letting (1 − x) denote the formal inverse of (1 − x) , we have: n + k − 1 −n k (1 − x) = x . (2.16) n − 1 k=0 This implies the following useful lemma regarding powers of f (x): ∞ n+k−1 n n n+k Lemma 4. As a formal power series f (x) can be written as f (x) = x . ω ω k=0 n−1 2 3 Proof. Note that xf (x) = x + x + · · · = f (x) − x. Rearranging this for f (x), we see that ω ω ω −1 n n −n f (x) = x(1 − x) , and hence f (x) = x (1 − x) . The result now follows from lemma 3. ω ω REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 Po´lya Counting in Periodic DNA Chains 9 In contrast to naively expanding powers of f (x), computing binomial coeﬃcients is computation- ally inexpensive, taking at most a linear number of steps in n. We now list a few results that will come in handy later, when we describe an explicit algorithm for computing the number of distinct DNA necklaces with the given constraints. r a b Lemma 5. The coeﬃcient of x in f (x ) is given by: 1 if b = 0 and a = 0 h i 0 if b = 0 and a > 0 a b f (x ) = (2.17) 0 if b > 0 and a ∤ r or r < ab r/a−1 otherwise. b−1 r a b a b 1 1 2 2 Lemma 6. The coeﬃcient of x in f (x ) · f (x ) is given by: ω ω h i h i h i a b a b a b a b 1 1 2 2 1 1 2 2 f (x ) · f (x ) = f (x ) f (x ) . (2.18) ω ω ω ω r k r−k k=0 3. The Algorithm for Computing the Number of Distinct Valid Necklaces Now we are able to evaluate the number of distinct necklaces, which correspond to a particular value of alternations α. The algorithm is fairly straightforward and eﬃcient. Its implementation requires the following steps: a) Set constraint parameters, N , N , and α = 2M. AT GC b) Choose partitioned cycle index polynomial of the Dihedral group based on parity of M. If M is odd, use (2.14), while for M even use (2.15). c) By the corollary to Po´lya’s Enumeration Theorem (2.12), we know that the number of necklaces, up to symmetry, is given by k k Orbits (x, y) = Z f (x), . . . , f (x ), f (y), . . . , f (y ) . (3.1) ˜ G X X Y Y If M is odd using the outcome of the previous step we get M/d d M/d d Orbits (x, y) = ϕ(d)f (x )f (y ) 2M d|M (M−1)/2 2 (M−1)/2 2 + f(x)f(y)f (x )f (y ). (3.2) If M is even, then we have M/d d M/d d Orbits (x, y) = ϕ(d)f (x )f (y ) 2M d|M 1 1 2 (M−2)/2 2 M/2 2 2 (M−2)/2 2 M/2 2 + f (x)f (x )f (y ) + f (y)f (y )f (x ). (3.3) 4 4 d) Every term in the polynomial produced by (3.1) will be of the form in (2.17) or (2.18). The number of necklaces with N white beads and N black beads is given by the coeﬃcient of AT GC N N AT GC the term x y . To calculate the total number of necklaces, simply sum over each of these terms appearing in the polynomial. REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 10 Hillebrand et al. A Python computer code implementating this algorithm is presented in the Appendix. In order to illustrate the application of this algorithm let us consider a simple, but not trivial case: We set α = 2M = 10, N = 8, N = 6. Clearly M = 5 is odd, so identifying white beads AT GC with AT base pairs and black beads with GC base pairs, we have the cycle index 1 1 2 2 ˜ ˜ ˜ Z(D ) = Z(C ) + x y (x ) (y ) 10 5 1 1 2 2 2 2 1 1 5/d 5/d 2 2 = ϕ(d)(x ) (y ) + x y (x ) (y ) . (3.4) d d 1 1 2 2 5 2 d|5 Now the partitioned Po´lya Enumeration Theorem tells us that we can put the generating functions d d f x and f y in place of the x and y respectively to ﬁnd the generating function of ﬁxed W B d d orbits. So we have 2 3 5 2 3 5 Orbits (x, y) = 1(x + x + x + . . . ) (y + y + y + . . . ) 2 · 5 5 10 15 5 10 15 + 4(x + x + x + . . . )(y + y + y + . . . ) 2 2 4 2 2 2 4 2 + (x + x + . . . )(x + x + . . . ) (y + y + . . . )(y + y + . . . ) . (3.5) Let us ﬁrst look at the cyclic part. Since 5 is prime, the only two integers that divide it are 1 and 5, so this polynomial will be 2 3 5 2 3 5 5 10 15 5 10 15 1(x + x + x + . . .) (y + y + y + . . .) + 4(x + x + x + . . .)(y + y + y + . . .) . 2 · 5 AT Now we try to extract the coeﬃcients of terms that are allowed. These are the terms in x and GC y and we can use (2.17) in order to calculate these coeﬃcients directly. In this case, there will 8 6 be no contribution from the second term, as there are no terms in x and y . So the total cyclic contribution will be (with r = 8 and r = 6 for the respective cases and a = 1, b = 5 for both) 1 N − 1 N − 1 1 5 7 175 GC AT = = . 10 5 − 1 5 − 1 10 4 4 10 Then the same coeﬃcient identifying process can be followed for the reﬂective part. Now the polynomial is given by 2 2 4 2 2 2 4 2 (x + x + . . .)(x + x + . . .) (y + y + . . .)(y + y + . . .) . So for both x and y the coeﬃcients will come from the product of two series, one of them squared. Thus, the relevant terms will come in a series of products given in (2.18). In y the sum of coeﬃcients 1 1 contracts to a single element. That contribution is simply = 1. In x however, there will be 0 1 2 6 4 4 terms from x · x as well as x · x . So then, the sum will be 1 3 3 1 + = 4, 0 1 0 1 1 175 giving a total contribution of (1 + 4) + = 20. Thus there are 20 DNA chains with 8 AT base 2 10 pairs, 6 GC base pairs and 10 alternations. 4. Numerical Results The developed algorithm for calculating the number of distinct DNA chains having α alternations can be used to produce the pdf of α, P(α), which afterwards can be compared to pdfs numerically obtained from Monte-Carlo (MC) simulations. In Figs. 7(a) and (b) we present such pdfs for a DNA chain containing N = 100 base pairs. In particular, we consider the case of N = 40, AT N = 60 in Fig. 7(a) and the case of N = 50, N = 50 in Fig. 7(b). From Figs. 7(a) and (b) GC AT GC we clearly see that the results obtained by the algorithm presented in Sect. 3 (empty circles) and REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 Po´lya Counting in Periodic DNA Chains 11 0.18 0.18 Monte Carlo Monte Carlo (a) (b) 0.16 0.16 Theoretical Theoretical N = 40 N = 50 AT AT 0.14 0.14 N = 60 N = 50 GC GC 0.12 0.12 0.10 0.10 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0.00 0.00 0 20 40 60 80 100 0 20 40 60 80 100 α α 0.12 (c) 0.10 0.08 0.06 0.04 0.02 0.00 10000 20000 30000 MC Fig. 7. Comparison of the pdf P (α) of the number of alternations α, obtained by the algorithm presented in Sect. 3 [empty circles in panels (a) and (b)] and by randomly created DNA chains of N = 100 base pairs through MC simulations [ﬁlled stars in panels (a) and (b)]. The pdfs for N = 40, N = 60 and N = 50, AT GC AT N = 50 are presented in panels (a) and (b) respectively. The number of MC simulations used in (a) and GC (b) are N = 20000. (c) The evolution of the average total absolute diﬀerence hdi between the theoretically MC and the numerically obtained pdfs as a function of N for the case of N = 50, N = 50. The values of MC AT GC hdi are obtained as the average of the quantity (4.1) evaluated for 5 diﬀerent sets of N runs. The error bars MC denote the corresponding standard deviations. by MC simulations of DNA chains with N = 100 base pairs (ﬁlled stars) agree very well. The slight diﬀerences between them are to be expected, as the number of possible chains is generally very large. For instance, in the case of N = 50, N = 50 and α = 50, the number of possible DNA AT GC chains is of the order of 10 possible necklaces. Thus, in general, the number of performed MC simulations cannot get close to the actual total number of possible chains. Nevertheless, although the results of Figs. 7(a) and (b) were obtained by only N = 20000 MC simulations they manage MC to capture the theoretically obtained pdf quite accurately. Of course it is expected that increasing the number of MC simulations will improve the accuracy of the numerical results. As a measure of this accuracy we can consider the total absolute diﬀerence d(N ) = |P (N , α) − P(α)|, (4.1) MC MC MC between the two distributions. In (4.1) P (N , α) is the probability of α alternations obtained MC MC by N MC simulations, P(α) is the one obtained theoretically, while the sum is performed over MC all possible values of α. From the results of Fig. 7(c) where we plot the averaged value of d(N ) MC over 5 sets of N MC simulations as a function of N we see that as the number of simulations MC MC increases, the numerical results get closer to the theoretical ones. REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 P(α) P(α) 12 Hillebrand et al. The results of Fig. 7 clearly show that in order to study the dynamical properties of DNA chains, statistical analysis performed over a few thousands of MC generated random chains (even of the order of 5000) would suﬃce, as such numbers of MC simulations are enough for capturing quite accurately the inﬂuence of alternations on the system’s dynamics. The shape of the pdfs in Figs. 7(a) and (b) suggests that they could possibly be ﬁtted by Gaussian distributions. This is actually true as we can see from the results of Fig. 8, where we performed such a ﬁt for the theoretically obtained pdf of Fig. 7(b). The Gaussian approximation of 0.18 Fitted Gaussian 0.16 Theoretical 0.14 0.12 0.10 0.08 0.06 0.04 0.02 0.00 0 20 40 60 80 100 Fig. 8. Fitting by a Gaussian of the theoretical pdf of Fig. 7(b) (empty circles) with N = 50, N = 50. AT GC The mean of the Gaussian is α = 50.5 and standard deviation σ = 5.1. 0 α the pdfs has several advantages as it allows us to easily quantify the inﬂuence of diﬀerent variables on the number of alternations. Let us ﬁrst look at the eﬀect of increasing the number of only one type of base pair, keeping constant the number of the other type of base pair. In Fig. 9 we present some pdfs of α for N = 100 and increasing values of N from 25 up to 2500. Starting from AT GC N = 100, N = 2500 N = 100, N = 75 0.30 AT GC AT GC N = 100, N = 500 N = 100, N = 50 AT GC AT GC N = 100, N = 100 N = 100, N = 25 AT GC AT GC 0.25 0.20 0.15 0.10 0.05 0.00 50 100 150 200 Fig. 9. Pdfs of α for ﬁxed number of AT base pairs (N = 100) and increasing values of N . Points AT GC correspond to the theoretically obtained values of the pdfs, while curves correspond to the Gaussian ﬁts of these points. Note that even for long DNA chains the value of α cannot exceed α = 200. REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 P(α) P(α) Po´lya Counting in Periodic DNA Chains 13 small values of N , we ﬁnd a very “lopsided” and narrow distribution which as N increases GC GC becomes gradually more symmetric and spreads out, up to a value of N = 200. Then, increasing GC N further, as the numbers of diﬀerent types of base pairs become more dissimilar we again ﬁnd GC gradually more unbalanced pdfs with sharp peaks. The very “lopsided” base pair distributions are obtained when the minority base pairs are signiﬁcantly less than the majority ones and therefore are spread out and isolated among the others. In this case the distribution is sharply peaked around the corresponding maximum possible number of alternations. For the N = 100, N = 25 case AT GC this number is α = 50, while for the N = 100, N = 2500 case it is α = 200. AT GC 250 8 0.35 (b) (c) (a) 7 0.30 6 0.25 5 0.20 α σ 4 0.15 3 0.10 0 2 0.05 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 N N N GC GC GC Fig. 10. The eﬀect of increasing the number N of the GC base pairs for a ﬁxed number of AT base pairs GC (N = 100) on the Gaussian ﬁt P (α) of the pdf values of α, and in particular on (a) the mean value α , AT G 0 (b) the standard deviation σ and (c) the maximum probability max [P (α)]. Some of these pdfs are shown α G in Fig. 9. These changes of the distributions are quantitatively presented in Fig. 10 through the variations of the ﬁtted Gaussian characteristics. The increase of the mean value α of the Gaussian ﬁts as the number N increases is shown in Fig. 10(a). The upper limit of α is 200, when N becomes GC 0 GC much larger than N . The dependence of the width (standard deviation) σ of the Gaussian ﬁts AT α on N is depicted in Fig. 10(b). The initial increase with N corresponds to the spreading out of GC GC the distributions when the numbers of base pairs become more similar. Further increase of the N GC values pushes the pdfs to the other extreme and the lopsidedness comes through again, resulting in narrower distributions (see Fig. 9). This results in the decrease of σ for large values of N . α GC Finally in Fig. 10(c) we observe that as N increases the maximum probability of the pdfs initially GC decreases rapidly and then increases slowly, in accordance with the results of Fig. 9 and of course with the fact that it is inversely proportional to the standard deviation of the Gaussian ﬁt. Let us now focus our attention on the eﬀect of the increment of the total number of base pairs N = N + N , i.e. the total ‘length’ of the DNA chain, when the ratio N : N is kept AT GC GC AT constant. Such cases are presented in Fig. 11, where we plot several pdfs for diﬀerent values of N but for ﬁxed ratios N : N . In particular, the values of the ratios N : N are 1 : 1 in panel GC AT GC AT (a) (b) (c) 0.200.20 0.20 N = 1000 N = 900 N = 1050 N :N = 2 : 1 N :N = 6 : 1 GC AT GC AT N :N = 1 : 1 GC AT N = 400 N = 450 N = 700 0.150.15 0.15 N = 200 N = 150 N = 350 0.100.10 0.10 0.050.05 0.05 0.000.00 0.00 100 200 300 400 500 600 100 200 300 400 500 600 100 200 300 400 500 600 α α α Fig. 11. Pdfs of α for ﬁxed ratios N : N = 1 : 1 (a), 2 : 1 (b) and 6 : 1 (c). Points correspond to the GC AT theoretically obtained values of the pdfs, while curves correspond to the Gaussian ﬁts of these points. (a), 2 : 1 in (b) and 6 : 1 in (c). In all cases the pdfs are ﬁtted by appropriate Gaussian distributions REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 P(α) max[P ( )] G 14 Hillebrand et al. 500 18 0.9 (a) Ratio 6:1 Ratio 6:1 Ratio 6:1 (b) 16 0.8 (c) Ratio 2:1 Ratio 2:1 Ratio 2:1 14 0.7 Ratio 1:1 Ratio 1:1 Ratio 1:1 12 0.6 10 0.5 α σ 8 0.4 6 0.3 4 0.2 2 0.1 0 0 0.0 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 N N N Fig. 12. The eﬀect of increasing the total number of base pairs N for ﬁxed ratios N : N on the parameters GC AT of the Gaussian ﬁt P (α) of the pdf for α: (a) the mean value α , (b) the standard deviation σ and (c) the G 0 α maximum probability max [P (α)]. Some of these pdfs are shown in Fig. 11. whose characteristics are plotted in Fig. 12 as a function of N. From the results of Figs. 11 and 12 we see that as the total number N of base pairs increases the pdfs become more broad, and consequently their maximum value decreases. This means that for large N more α values have a relatively high probability to appear in a randomly created DNA chain. In addition, increasing the ratio N : N results in a decrease of the spreading, as evidenced by the lower standard GC AT deviation in Fig. 12(b) and the higher maximum probability in Fig. 12(c). A linear relationship between N and the mean α is observed for all ratios, with the slope of the line inﬂuenced by the ratio. The slope m for each case is: m = 0.25 for ratio 6 : 1, m = 0.45 for 2 : 1 and m = 0.5 for 1 : 1. 5. Conclusions Motivated by the possibility that the number α of base pair alternations in a circular or periodic DNA chain might aﬀect the dynamics of the system, we have found a probability distribution for this number. Algorithms for such distributions are known for linear DNA sequences with ﬁxed boundary conditions [31]. The introduction of the periodic boundary conditions we consider in our study makes the counting of alternations a much more complicated problem due to the appearance of additional rotational and reﬂectional symmetries. To account for the additional complexity arising from these symmetries we have implemented Po´lya counting theory. In particular, extending Po´lya’s Enumeration Theorem for a partition-preserving group action on a partitioned set, we have constructed a well deﬁned algorithm for calculating the number of DNA chains having a given number of alternations for particular values of the number of AT (N ) and GC (N ) base pairs. AT GC The obtained theoretical results were compared with numerically constructed pdfs through MC simulations. We found that, in general, creating a few thousands of random DNA chains (around 5000) by MC simulations we can approximate quite accurately the theoretical pdf of α. This means that a statistical analysis of these DNA chains will suﬃce to uncover the potential inﬂuence of heterogeneity on the dynamic behavior of the considered DNA model. In addition, approximating the obtained pdfs by Gaussians we investigated the eﬀect of the number of the two base pairs, as well as their ratio on various characteristics of the pdfs, like their mean value, their standard deviation and their maximum. APPENDIX Here we present a Python computer code implementing the algorithm of Sect. 3. The function necklace count(n, B, W) returns the total number of possible necklaces under the symmetry constraints with 2n alternations, B black beads and W white beads. from math import gcd # Compute binomial c o e f f i c i e n t s in l i n e a r time . def binomial (n , k ) : i f k > n or k < 0: return 0 i f k = = 0: return 1 REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 max[P ( )] G Po´lya Counting in Periodic DNA Chains 15 i f k > n //2: return binomial (n , n−k) return (n ∗ binomial (n−1, k−1)) // k # Compute the Euler t o t i e n t function \ phi (n ) , which # g i v e s the number of i n t e g e r s 0 < d <= n t h a t are # r e l a t i v e l y prime to n . def t o t i e n t (n ) : count = 0 for d in range (1 , n+1): i f gcd (d , n) = = 1: count += 1 return count # Get the xˆ r c o e f f i c i e n t of our weight generating f u n c t i o n s f ( xˆm)ˆn , # where : # f ( x ) = x + xˆ2 + xˆ3 + . . . def weight gf ( r , m, n ) : i f n = = 0: i f r = = 0: return 1 return 0 i f r%m != 0: return 0 i f ( r //m) < n : return 0 return binomial ( ( r // m)−1, n−1) # Get the xˆ r c o e f f i c i e n t of a binary product of weight generating # f u n c t i o n s f ( xˆm1)ˆ n1 ∗ f ( xˆm2)ˆn2 , where : # f ( x ) = x + xˆ2 + xˆ3 + . . . def b i n ar y w ei gh t gf ( r , m1, n1 , m2, n2 ) : t o t a l = 0 for i in range (1 , r ) : t o t a l += weight gf ( i , m1, n1 ) ∗ weight gf ( r−i , m2, n2 ) return t o t a l # Compute the number of necklaces up to d i h e d r a l symmetry with # 2n a l t e r n a t i o n s , B b l a c k beads and W white beads . def necklace count (n , B, W) : # F i r s t we count the c o n t r i b u t i o n s from the c y c l i c part # of the c y c l e index . count = 0 for d in range (1 , n+1): i f n%d != 0: continue count += t o t i e n t (d) ∗ weight gf (B, d , n//d) ∗ weight gf (W, d , n//d) # Next we count the c o n t r i b u t i o n s from the d i h e d r a l part # of the c y c l e index . i f n%2 == 0: count += ( weight gf (B, 2 , n//2) ∗ b i n ar y w ei gh t gf (W, 1 , 2 , 2 , (n−2)//2) ∗ (n //2)) count += ( weight gf (W, 2 , n//2) ∗ b i n ar y w ei gh t gf (B, 1 , 2 , 2 , (n−2)//2) ∗ (n //2)) REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 16 Hillebrand et al. else : count += ( b i n ar y w ei gh t gf (B, 1 , 1 , 2 , (n−1)//2) ∗ b i n ar y w ei gh t gf (W, 1 , 1 , 2 , (n−1)//2) ∗ n) return count // (2∗n) Acknowledgements M.H. and G.P-J. acknowledge ﬁnancial assistance from the National Research Foundation (NRF) of South Africa towards this research. G.K. and Ch.S. were supported by the Erasmus+/ International Credit Mobility KA107 program. Ch.S. acknowledges support by the NRF of South Africa (IFRR and CPRR Programmes), the UCT (URC Conference Travel Grant) and thanks Hans-Peter Kunzi for useful discussions. REFERENCES 1. Alberts B., Bray D., Hopkin K., Johnson A., Lewis J., Raﬀ M., Roberts K., Walter P., Essential Cell Biology, 2nd Ed., Garland Science 2004. 2. Alexandrov, B.S., Gelev, V., Monisova, Y., Alexandrov, L.B., Bishop, A.R., Rasmussen, K.Ø., Usheva, A., Nucleic Acids Res. 37, 2405 (2009). 3. Alexandrov, B.S., Gelev, V., Yoo, S.W., Bishop, A.R., Rasmussen, K.Ø., Usheva, A., PLoS Comput. Biol. 5, e1000313 (2009). 4. Alexandrov, A.S., Gelev, V., Yoo, S.W., Alexandrov, L.B., Fukuyo, Yayoi. Bishop, A.R., Rasmussen, K.Ø., Usheva, A., Nucleic Acids Res. 38, 1790 (2010). 5. Apostolaki, A., Kalosakas, G., Phys. Biol. 8, 026006 (2011). 6. Ares, S., Voulgarakis, N.K., Rasmussen, K.Ø., Bishop, A.R., Phys. Rev. Lett. 94, 035504 (2005). 7. Ares, S., Kalosakas, G., Nano Lett. 7, 307 (2007). 8. Brualdi, R. A., Po´lya Counting. In: Introductory Combinatorics, 5th ed., Upper Saddle River, NJ: Prentice Hall, 2010 9. Burnside, W., Theory of groups of ﬁnite order, Cambridge: Cambridge University Press, 1897. 10. Chetverikov, A.P., Ebeling, W., Lakhno, V.D., Shigaev A.S., Velarde, M.G., Eur. Phys. J. B 89, 101 (2016). 11. Choi, C.H., Kalosakas, G., Rasmussen, K.Ø., Hiromura, M., Bishop, A.R., Usheva, A., Nucleic Acids Res. 32, 1584 (2004). 12. Choi, C.H., Rapti, Z., Gelev, V., Hacker, M.R., Alexandrov, B.S., Park, E.J., Park, J.S., Horikoshi, N., Smerzi, A., Rasmussen, K.Ø., Bishop, A.R., Usheva, A., Biophys. J. 95, 597 (2008). 13. Dauxois, T., Peyrard, M., Bishop, A.M, Phys. Rev. E 47, 684 (1993). 14. Hennig, D., Eur. Phys. J. B 30, 211 (2002). 15. Herstein, I. N., Abstract Algebra, 3rd ed., Wiley, 1999. 16. Huang, H.-H., Lindblad, P., J. Biol. Eng. 7, 10 (2013). 17. Kalosakas, G., Phys. Rev. E 84, 051905 (2011). 18. Kalosakas, G., Ares, S., J. Chem. Phys. 130, 235104 (2009). 19. Kalosakas, G., Ngai, K.L., Flach, S., Phys. Rev. E 71, 061901 (2005). 20. Kalosakas, G., Rasmussen, K.Ø., Bishop, A.R., Choi, C.H., Usheva, A., Europhys. Lett. 68, 127 (2004). 21. Kalosakas, G., Rasmussen, K.Ø., Bishop, A.R., Chem. Phys. Lett. 432, 291 (2006). 22. Kolpakov, R., Bana, G., Kucherov, G., Nuc. Ac. Res., 31, 3672 (2003) 23. Lewin B., Genes VIII, Pearson Prentice Hall 2004. 24. Li, W., Computers Chem. 21, 257 (1997). 25. van Lint, J. H., Wilson, R. M., Po´lya theory of counting. In: A Course in Combinatorics, Cambridge: Cambridge University Press, 1992 26. Nowak-Lovato, K., Alexandrov, L.B., Banisadr, A., Bauer, A.L., Bishop, A.R., Usheva, A., Mu, F., Hong-Geller, E., Rasmussen, K.Ø., Hlavacek, W.S., Alexandrov, B.S., PLoS Comput. Biol. 9, e1002881 (2013). 27. Peyrard, M., Nonlinearity 17, R1 (2004). 28. Peyrard, M., Fargo, J., Physica A 288, 199 (2000). 29. Po´lya G., Read R. C., Chemical Compounds. In: Combinatorial Enumeration of Groups, Graphs, and Chemical Compounds., New York: Springer, 1987. 30. R´egnier, M., Disc. App. Math. 104, 259 (2000). 31. Robin, S., Daudin, J.J., Journ. Appl. Prob. 36, 179 (1999) REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 Po´lya Counting in Periodic DNA Chains 17 32. Robin, S., Schbath, S., Journ. Comp. Biol. 8, 349 (2001). 33. Tabi, C.B., Dang Koko, A., Oumarou Doko, R., Ekobena Fouda, H.P., Kofane, T.C., Physica A 442, 498 (2016). 34. Schbath, S., ESAIM: Probability and Statistics 1, 1 (1995). 35. Schbath, S., Prum, B., de Turckheim, E., Journ. Comp. Biol. 2, 417 (1995). 36. Skokos, Ch., Hillebrand, M., Schwellnus, A., Kalosakas, G., in preparation, (2018). 37. Tapia-Rojo, R., Mazo, J.J., Falo, F., Phys. Rev. E 82, 031916 (2010). 38. Tapia-Rojo, R., Mazo, J.J., Hernandez, J.A., Peleato, M.L., Fillat, M.F., Falo, F., PLoS Comput. Biol. 10, e1003835 (2014). 39. Theodorakopoulos, N., Phys. Rev. E 77, 031919 (2008). 40. Voulgarakis, N.K., Kalosakas, G., Rasmussen, K.Ø., Bishop, A.R., Nano Lett. 4, 629 (2004). 41. Yakushevich, L.V., Nonlinear Physics of DNA, 2nd Ed., Wiley-VCH, 2004. 42. Zariski O., Samuel P., Polynomial and Power Series Rings. In: Commutative Algebra. Graduate Texts in Mathematics, vol 29. Berlin: Springer, 1960. 43. Zoli, M., J. Phys.: Condens. Matter 24, 195103 (2012). 44. Zoli, M., J. Theor. Biol. 354, 95 (2014). REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Statistics arXiv (Cornell University) http://www.deepdyve.com/lp/arxiv-cornell-university/distribution-of-base-pair-alternations-in-a-periodic-dna-chain-LorNzC1F0v

Loading next page...

References (57)

N. Theodorakopoulos (2008)
DNA denaturation bubbles at criticality.
Physical review. E, Statistical, nonlinear, and soft matter physics, 77 3 Pt 1
M. Régnier (2000)
A Unified Approach to Word Occurrence Probabilities. Combinatorial Molecular Biology
Discrete Appl. Math., 104
A. Apostolaki, G. Kalosakas (2011)
Targets of DNA-binding proteins in bacterial promoter regions present enhanced probabilities for spontaneous thermal openings
Physical Biology, 8
O. Zariski, P. Samuel (1960)
Polynomial and Power Series Rings
(2004)
Genes VIII
D. Hennig (2002)
Control of electron transfer in disordered DNA under the impact of viscous damping and an external periodic field
The European Physical Journal B - Condensed Matter and Complex Systems, 30
M. Peyrard, J. Farago (2000)
Nonlinear localization in thermalized lattices: application to DNA (
Physica A-statistical Mechanics and Its Applications, 288
B. Alexandrov, V. Gelev, S. Yoo, A. Bishop, K. Rasmussen, A. Usheva (2009)
Toward a Detailed Description of the Thermally Induced Dynamics of the Core Promoter
PLoS Computational Biology, 5
G. Kalosakas, K. Rasmussen, A. Bishop (2006)
Non-exponential decay of base-pair opening fluctuations in DNA
Chemical Physics Letters, 432
B. Alexandrov, V. Gelev, Y. Monisova, L. Alexandrov, A. Bishop, K. Rasmussen, A. Usheva (2009)
A nonlinear dynamic model of DNA with a sequence-dependent stacking term
Nucleic Acids Research, 37
Hsin-Ho Huang, P. Lindblad (2013)
Wide-dynamic-range promoters engineered for cyanobacteria
Journal of Biological Engineering, 7
B. Alexandrov, V. Gelev, S. Yoo, L. Alexandrov, Yayoi Fukuyo, A. Bishop, K. Rasmussen, A. Usheva (2009)
DNA dynamics play a role as a basal transcription factor in the positioning and regulation of gene transcription initiation
Nucleic Acids Research, 38
B. Alberts (1983)
Essential Cell Biology
Chu Choi, Z. Rapti, V. Gelev, M. Hacker, B. Alexandrov, Evelyn Park, Jae Park, N. Horikoshi, A. Smerzi, K. Rasmussen, A. Bishop, A. Usheva (2008)
Profiling the thermodynamic softness of adenoviral promoters.
Biophysical journal, 95 2
M. Régnier (2000)
A unified approach to word occurrence probabilities
Discret. Appl. Math., 104
O. Zariski, P. Samuel (1975)
Commutative Algebra: Vol. 2
M. Peyrard (2004)
Nonlinear dynamics and statistical physics of DNA
Nonlinearity, 17
Pam Wynn (2004)
Preparation
Spiritus: A Journal of Christian Spirituality, 4
S. Robin, Sophie Schbath (2002)
Numerical Comparison of Several Approximations of the Word Count Distribution in Random Sequences
Journal of computational biology : a journal of computational molecular cell biology, 8 4
K. Nowak-Lovato, L. Alexandrov, Afsheen Banisadr, A. Bauer, A. Bishop, A. Usheva, Fangping Mu, E. Hong-Geller, K. Rasmussen, W. Hlavacek, B. Alexandrov (2013)
Binding of Nucleoid-Associated Protein Fis to DNA Is Regulated by DNA Breathing Dynamics
PLoS Computational Biology, 9
G. Kalosakas (2011)
Charge transport in DNA: dependence of diffusion coefficient on temperature and electron-phonon coupling constant.
Physical review. E, Statistical, nonlinear, and soft matter physics, 84 5 Pt 1
C. Tabi, A. Koko, R. Doko, H. Fouda, T. Kofané (2016)
Modulated charge patterns and noise effect in a twisted DNA model with solvent interaction
Physica A-statistical Mechanics and Its Applications, 442
Katherine McCracken, P. Tran, David You, M. Slepian, Jeong‐Yeol Yoon (2013)
Shear- vs. nanotopography-guided control of growth of endothelial cells on RGD-nanoparticle-nanowell arrays
Journal of Biological Engineering, 7
Sophie Schbath (1997)
Compound Poisson approximation of word counts in DNA sequences
Esaim: Probability and Statistics, 1
P´olya Counting in Periodic DNA Chains
M. Erickson (2011)
The Pólya Theory of Counting
J. Lint, R. Wilson (2001)
A Course in Combinatorics: Pólya theory of counting
(2010)
Pólya Counting, in Introductory Combinatorics, 5th ed
R. Kolpakov, Ghizlane Bana, G. Kucherov (2003)
mreps: efficient and flexible detection of tandem repeats in DNA
Nucleic acids research, 31 13
G. Pólya, R. Read (1988)
Combinatorial Enumeration Of Groups, Graphs, And Chemical Compounds
Société industrielles, Numérisation mathématiques (1997)
ESAIM. Probability and statistics
A. Mees (2012)
Nonlinear Dynamics and Statistics
L. Yakushevich (2005)
Nonlinear Physics of DNA: YAKUSHEVICH:DNA 2ED O-BK
J.H. Lint, R.M. Wilson (1992)
A Course in Combinatorics
(2012)
Condens
M. Zoli (2012)
Anharmonic stacking in supercoiled DNA
Journal of Physics: Condensed Matter, 24
Saúl Ares, N. Voulgarakis, K. Rasmussen, A. Bishop (2004)
Bubble nucleation and cooperativity in DNA melting.
Physical review letters, 94 3
Chu Choi, G. Kalosakas, K. Rasmussen, M. Hiromura, A. Bishop, A. Usheva (2004)
DNA dynamically directs its own transcription initiation.
Nucleic acids research, 32 4
R. Tapia-Rojo, J. Mazo, F. Falo (2010)
Thermal and mechanical properties of a DNA model with solvation barrier.
Physical review. E, Statistical, nonlinear, and soft matter physics, 82 3 Pt 1
A. Uzman (2004)
Essential cell biology (2nd ed.)
Biochemistry and Molecular Biology Education, 32
T. Dauxois, M. Peyrard, A. Bishop (1993)
Dynamics and thermodynamics of a nonlinear model for DNA denaturation.
Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics, 47 1
R. Tapia-Rojo, J. Mazo, José Hernández, M. Peleato, M. Fillat, F. Falo (2014)
Mesoscopic Model and Free Energy Landscape for Protein-DNA Binding Sites: Analysis of Cyanobacterial Promoters
PLoS Computational Biology, 10
Sophie Schbath, B. Prum, E. Turckheim (1995)
Exceptional Motifs in Different Markov Chain Models for a Statistical Analysis of DNA Sequences
Journal of computational biology : a journal of computational molecular cell biology, 2 3
Wentian Li (1997)
The Study of Correlation Structures of DNA Sequences: A Critical Review
Computers & chemistry, 21 4
R. A. Brualdi (2010)
Introductory Combinatorics
G. Kalosakas, Kim Rasmussen, A. Bishop, Chu Choi, A. Usheva (2003)
Sequence-specific thermal fluctuations identify start sites for DNA transcription
EPL, 68
Saúl Ares, G. Kalosakas (2006)
Distribution of bubble lengths in DNA.
Nano letters, 7 2
M. Zoli (2013)
Twist versus nonlinear stacking in short DNA molecules.
Journal of theoretical biology, 354
S. Robin, J. Daudin (1999)
Exact distribution of word occurrences in a random sequence of letters
Journal of Applied Probability, 36
G. Kalosakas, K. Ngai, S. Flach (2005)
Breather-induced anomalous charge diffusion.
Physical review. E, Statistical, nonlinear, and soft matter physics, 71 6 Pt 1
G. Kalosakas, Saúl Ares (2009)
Dependence on temperature and guanine-cytosine content of bubble length distributions in DNA.
The Journal of chemical physics, 130 23
L. I︠A︡kushevich (1998)
Nonlinear Physics of DNA
A. Chetverikov, W. Ebeling, V. Lakhno, A. Shigaev, M. Velarde (2016)
On the possibility that local mechanical forcing permits directionally-controlled long-range electron transfer along DNA-like molecular wires with no need of an external electric field
The European Physical Journal B, 89
N. Voulgarakis, G. Kalosakas, K. Rasmussen, A. Bishop (2004)
Temperature-Dependent Signatures of Coherent Vibrational Openings in DNA
Nano Letters, 4
Thomas Judson, Stephen Austin (1968)
Abstract Algebra
The Mathematical Gazette, 52
(2018)
2 : 1 in (b) and 6 : 1 in (c)
William Burnside
Theory of Groups of Finite Order

ISSN: 1560-3547
eISSN: ARCH-3347
DOI: 10.1134/S1560354718020016
Publisher site: See Article on Publisher Site

Abstract

ISSN 1560-3547, Regular and Chaotic Dynamics, 2018, Vol. 23, No. 2, pp. 1–16. c Pleiades Publishing, Ltd., 2018. Distribution of Base Pair Alternations in a Periodic DNA Chain: Application of Po´lya Counting to a Physical System 1* 1 Malcolm Hillebrand , Guy Paterson-Jones , 2 1 George Kalosakas , and Charalampos Skokos Department of Mathematics and Applied Mathematics, University of Cape Town, Rondebosch, Cape Town 7701, South Africa Department of Materials Science, University of Patras, Rio GR-26504, Greece Received October 13 2017; accepted December 11, 2017 Abstract—In modeling DNA chains, the number of alternations between Adenine-Thymine (AT) and Guanine-Cytosine (GC) base pairs can be considered as a measure of the heterogeneity of the chain, which in turn could aﬀect its dynamics. A probability distribution function of the number of these alternations is derived for circular or periodic DNA. Since there are several symmetries to account for in the periodic chain, necklace counting methods are used. In particular, Po´lya’s Enumeration Theorem is extended for the case of a group action that preserves partitioned necklaces. This, along with the treatment of generating functions as formal power series, allows for the direct calculation of the number of possible necklaces with a given number of AT base pairs, GC base pairs and alternations. The theoretically obtained probability distribution functions of the number of alternations are accurately reproduced by Monte Carlo simulations and ﬁtted by Gaussians. The eﬀect of the number of base pairs on the characteristics of these distributions is also discussed, as well as the eﬀect of the ratios of the numbers of AT and GC base pairs. MSC2010 numbers: 05A15, 92D20 DOI: 10.0000/S1560354718000013 Keywords: DNA models, Po´lya’s Counting Theorem, Heterogeneity, Necklace Combinatorics 1. Introduction Single circular DNA molecules are abundant in nature. The whole genome in a typical bacterium is usually contained in a closed DNA molecule, while in eucaryotes the organelle DNA, inside the mitochondria and chloroplasts, is also found in the same form [1, 23]. Also plasmids, either naturally found in bacteria, or used as vectors in gene cloning, are smaller circular DNA segments. Apart from these cases, in considering the dynamics and other properties of DNA chains, it is often useful to model the chain using periodic boundary conditions in order to avoid ﬁnite size or edge eﬀects. For example, periodic boundary conditions have been used to study denaturation bubbles and the melting behavior of DNA [2, 6, 13, 37, 39, 43], probability distributions of thermal openings in the double strand [7, 18], bubble opening proﬁles in promoter regions which regulate gene transcription [3–5, 11, 12, 16, 20], binding sites of DNA-associated proteins [26, 38], various dynamical and nonlinear properties of DNA [21, 27, 28, 40, 41, 44], as well as charge transport in DNA [10, 14, 17, 19, 33]. A DNA chain consists of a series of base pairs, where each base pair is either Adenine-Thymine (AT) or Guanine-Cytosine (GC). Currently, we are investigating the inﬂuence of diﬀerent factors on the chaoticity of periodic DNA chains [36]. One of the examined quantities is the number of base pair alternations, which can be considered as a quantiﬁer of the system’s heterogeneity. In this work we focus on the rigorous mathematical treatment of alternation counting in periodic DNA sequences. To study periodic DNA, we will consider the DNA necklace associated to a DNA chain, E-mail: malcolm.hillebrand@gmail.com arXiv:1805.06245v1 [math.CO] 16 May 2018 2 Hillebrand et al. where the ﬁrst and the last base pairs in the chain will become neighbors. This periodicity presents some modeling challenges - if one considers two distinct chains of DNA, it may still be the case that their corresponding necklaces are the same, as one may be merely a rotation or reﬂection of the other. Such symmetries need to be addressed if any conclusions are to be made about the structure and the dynamics of DNA necklaces. In particular, we are concerned with the number α of base pair alternations in the necklace, where an alternation is deﬁned to be a point at which an AT base pair neighbors a GC base pair or vice versa. Consider, for instance, the DNA chain shown in Fig. 1. Representing a GC base pair (black bead) with a 0 and an AT base pair (white bead) with a 1, the 0 0 0 0 1 0 1 1 0 0 1 Fig. 1. An example of a DNA chain. GC base pairs are represented by black beads and the number 0, while AT base pairs are represented by white beads and the number 1. In the DNA necklace corresponding to this chain, the AT base pair at the far right neighbors the GC base pair at the far left. ¯¯¯ ¯ ¯¯ chain can be written in the form (1)00001011001(0). Here, we have given the leftmost base pair at each alternation point an overbar, and used brackets to denote the fact that in the corresponding DNA necklace the ﬁrst and last base pairs are neighbors. This necklace is illustrated in Fig. 2, and counting the number of overbars we see that there are α = 6 alternations. Fig. 2. The DNA necklace corresponding to the chain of Fig. 1. This necklace has α = 6 alternations. It is worth noting that a base pair alternation corresponds to the appearance of the particular sequences (often referred to as “words”) 01 or 10 in a DNA chain. Word occurrence probabilities have already been studied in the literature (see e.g. [22, 24, 30–32, 34, 35] and references therein), with emphasis on the appearance of patterns with unexpectedly high or low frequencies, as well as on repeating sequences. However these studies concern the case of linear DNA segments, or in other words DNA chains with ﬁxed boundary conditions. The periodic boundary conditions we consider in our study make the problem of counting alternations (or more generally the appearance of speciﬁc words) in circular DNA segments much more complicated than in the case of linear DNA segments due to the appearance of additional symmetries in the DNA structures imposed by rotations and/or reﬂections. Each base pair in a DNA necklace can contribute at most 2 alternations, depending on which neighbors it diﬀers from. Supposing that the number of AT and GC base pairs in the necklace is given by N and N respectively, this yields the restriction 0 ≤ α ≤ min{2N , 2N }. We AT GC AT GC note that in the extreme case of a homogeneous chain composed of base pairs of the same kind α = 0, while if both types of base pairs are present in the DNA chain the smallest possible value of alternations is α = 2. The later corresponds to a chain having all AT (and consequently GC) base pairs grouped together. Furthermore, if we traverse the necklace pair by pair until we end up where we started, we must necessarily switch between AT and GC base pairs an even number of times. Thus α = 2M for some M ∈ N. Now the natural question is: what is the probability that a random DNA necklace with a speciﬁed number of AT and GC base pairs, N and N respectively, has a speciﬁed number of AT GC alternations α? Or in other words, how many possible combinations of such base pairs are there REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 Po´lya Counting in Periodic DNA Chains 3 that yield α alternations once the cyclic and reﬂective symmetries are taken into account? In what follows we answer these questions and provide an algorithm for computing the number of distinct DNA necklaces satisfying these constraints. The paper is organized in the following way: In Sect. 2, the mathematical background is laid out, leading into a Po´lya Enumeration Theorem for bipartite sets. In Sect. 3 an explicit algorithm for calculating the number of distinct DNA necklaces with given values of α, N and N is AT GC described, while in Sect. 4 we compare the theoretical results to those obtained from Monte-Carlo simulations and investigate the eﬀect of the N and N values on the characteristics of the AT GC probability distribution function (pdf) of α. Finally, in Sect. 5 we summarize our results, while in the Appendix we provide a Python computer code implementing the algorithm of Sect. 3. 2. Theoretical Treatment Our problem can be neatly related to the combinatorics of necklaces. Eﬀectively, we are interested in the number of distinct necklaces with N = N + N beads, where N of the beads are white, AT GC AT N of the beads are black, and there are α alternations between the colors. We consider necklaces GC to be the same if they can be reﬂected or rotated into one another, and beads of the same color are treated as indistinguishable. Because of this, we can equivalently think of a necklace with α alternations as a necklace of α containers, where each container carries some number of black or white beads of the same color, and adjacent containers have diﬀerent colors. This idea is illustrated in Fig. 3. Fig. 3. The necklace of containers corresponding to the DNA necklace of Fig. 2. The numbers in each container represent the number of consecutive black or white beads in that segment of the necklace. We will refer to containers carrying black beads as black containers, and similarly for white containers. Counting the number of distinct necklaces with the given constraints can thus be reformulated as the problem of assigning numbers of beads to α containers, such that the total of the numbers in the black and white containers is equal to N and N respectively. Two such GC AT assignments will be considered equivalent if the containers can be rotated or reﬂected into one another in such a way as to preserve both the colors and numbers of beads they contain. Enumerating such assignments is simpler than enumerating necklaces, as we have one less constraint - the number of alternations is now implicit in the formulation of the problem. To perform this enumeration we will require some tools from Po´lya counting theory - in particular, we will need a version of the Po´lya Enumeration Theorem for sets partitioned into two parts, which we will refer to as bipartite sets. For completeness’ sake, we present this material below. 2.1. Group Actions Let A be a set. Then we deﬁne the symmetric group on A to be the set of permutations of A: S = {ϕ : A → A | ϕ is a bijection}. (2.1) A cycle is a permutation ϕ ∈ S such that there exist distinct elements {x , x , . . . , x } ∈ A and: A 1 2 k x if x = x for some 1 ≤ i < k i+1 i ϕ(x) = (2.2) x if x = x 1 k x otherwise. REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 4 Hillebrand et al. We denote such a cycle suggestively as (x x . . . x ), and say that ϕ ∈ S is a k-cycle if 1 2 k A ϕ = (x x . . . x ) for some x ∈ S . Two cycles (x x . . . x ) and (y y . . . y ) are said 1 2 k i A 1 2 k 1 2 l to be disjoint if the sets {x , x , . . . , x } and {y , y , . . . , y } are disjoint. 1 2 k 1 2 l If A is a ﬁnite set, every element of S can be written as a composition of cycles; in general, however, this cannot be done uniquely. On the other hand, we have the following fundamental structure theorem for elements of ﬁnite symmetric groups (see for example [15]): Theorem (Cycle Decomposition Theorem). If A is a ﬁnite set, then every element ϕ ∈ S can be written as a product of pairwise disjoint cycles, unique up to order of the cycles: ϕ = (x x . . . x ) · · · (x x . . . x ). 11 12 1k n1 n2 nk 1 n Given a group G and a set A, a group action of G on A is a homomorphism Γ : G → S . In G A other words, elements of G are identiﬁed with permutations of A in a manner that preserves the group structure. To simplify the notation, we will write gx instead of Γ (g)(x) for the action of g ∈ G on some x ∈ A. The orbit of an element x ∈ A under the group action Γ is deﬁned to be the set Orb = {gx | G x g ∈ G}, and its stabilizer is given by the subgroup Stab = {g ∈ G | gx = x}. Given some g ∈ G, we denote its set of ﬁxed points by Fix = {x ∈ A | gx = x}. 2.2. Po´lya’s Counting Theory One can often rephrase counting problems in terms of computing the number of distinct orbits of some group action. Po´lya’s counting theory can be thought of as a tool for making these computations systematic and expedient. A fundamental lemma on which this theory is built is the following [9]: Lemma 1 (Burnside’s Lemma). The number of distinct orbits in a group action of a ﬁnite group G on A is given by the average number of ﬁxed points of elements of G: #Orbits = |Fix |. (2.3) |G| g∈G A basic problem in combinatorics is the following. Suppose one has a ﬁnite set of objects A, and one wishes to color them with colors from another set Ω. How many distinct ways are there of coloring the objects up to some kind of symmetry? This can be recast in the language of group actions. The set of possible colorings is given by Ω = {ϕ : A → Ω | ϕ a function}, and the symmetry is given by a group action Γ on A. This group action passes naturally to a group action Γ on Ω , deﬁned by gϕ : x 7→ ϕ(gx). The question now reduces to counting the number of distinct orbits of this latter action. In this simpliﬁed case, Burnside’s lemma is often suﬃcient to answer the question. We can generalize this problem slightly, however. Suppose that each color has an associated weight, given by a function ω : Ω → N. Given a coloring ϕ : A → Ω of the objects, we deﬁne its total weight to be the sum: |ϕ| = ω ◦ ϕ(x). (2.4) x∈A How many distinct colorings of A with a given total weight are there, up to symmetries given by some group action Γ ? Note that the total weight of any coloring in a given orbit is the same, as elements of g merely permute the set A. Thus, the problem boils down to calculating the number of distinct orbits with a given total weight. Po´lya identiﬁed two necessary ingredients for a systematic answer to this question: generating functions, and an understanding of the cycle structure of elements of G [29]. Deﬁnition (Generating Function). Let ω : Ω → N be an assignment of weights to some set Ω. Suppose further that there are at most a ﬁnite number of elements of any given weight, that is, −1 |ω (n)| is ﬁnite for every n ∈ N. Then the generating function of ω is given by the polynomial: −1 i f (x) = |ω (i)| x . (2.5) i=0 REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 Po´lya Counting in Periodic DNA Chains 5 Generating functions are useful as they encode combinatorial data - in this case the number of colors of a given weight - as algebraic objects. In particular, we will need the following lemma: Lemma 2. Let ω : Ω → N and ω : Ω → N be assignments of weights to the sets Ω and Ω 1 1 2 2 1 2 respectively. Deﬁne an assignment of weights to the set Ω × Ω by ω : (x , x ) 7→ ω (x ) + ω (x ). 1 2 1 2 1 1 2 2 Then f (x) = f (x) · f (x). ω ω ω 1 2 Given a group action Γ and an element g ∈ G, we denote by C (g) the number of k-cycles in G k the unique disjoint cycle decomposition of Γ (g). We can now encode information about the cycle structure of elements of G in the following multivariate polynomial: Deﬁnition (Cycle Index). Let G be a ﬁnite group. Then the cycle index of a group action Γ on a ﬁnite set A of cardinality n is given by the polynomial [8]: C (g) C (g) 1 2 C (g) Z (x , x , . . . , x ) = x x · · · x . (2.6) G 1 2 n 1 2 |G| g∈G This cycle index will allow us to eﬃciently compute the number of distinct orbits of the group action. With this in mind, we are now in a position to state a version of the Po´lya counting theorem, answering the generalized problem given earlier: Theorem (Po´lya Enumeration Theorem). Let A be a ﬁnite set of objects, Ω a set of colors, ω : Ω → N an assignment of weights to the colors with generating function f , and Γ a group ω G action of a ﬁnite group G on A. Then Γ passes naturally to a group action Γ on Ω , and a G G generating function by total weight for the number of distinct orbits of Γ is given by: 2 n Orbits (x) = Z f (x), f (x ), . . . , f (x ) . (2.7) ˜ G w w w 2.3. Po´lya Enumeration Theorem for Bipartite Sets By considering multivariate generating functions, the Po´lya enumeration theorem can be generalized to the case where the colors take weights in N . We will generalize the theorem in a diﬀerent direction, however. Suppose we have a partition of A into two parts, A = X ⊔ Y , and a group action Γ on A. We would like to consider the problem of counting distinct colorings of A under this symmetry, with the additional constraint that we color elements of X from a set Ω , and elements of Y from a set Ω . To this end, we will say that a coloring ϕ : A → Ω ⊔ Ω is valid Y X Y if ϕ(x) ∈ Ω ⇐⇒ x ∈ X and ϕ(x) ∈ Ω ⇐⇒ x ∈ Y . X Y There is an obstruction to this, however - the group action may map elements in X to elements in Y or vice versa. In this case, the extension of Γ to the set of possible colorings is no longer well-deﬁned, as there is no natural way to compare the sets of colors Ω and Ω . Fortunately, X Y this is the only obstruction to proving a Po´lya-type theorem for this problem. This motivates the following deﬁnition: Deﬁnition (Partition-Preserving Group Action). Let A = X ⊔ Y , and let Γ be a group action on A. Then we say that Γ is partition-preserving if for every g ∈ G, gx ∈ X ⇐⇒ x ∈ X and gx ∈ Y ⇐⇒ x ∈ Y . The importance of this property is as follows. Suppose we have a group action Γ on A = X ⊔ Y , and some element g ∈ G. Then Γ (g) has a unique disjoint cycle decomposition given by Γ (g) = G G C · C · . . . · C . If Γ is partition-preserving then each cycle C is contained entirely in either X 1 2 G i or Y , and Γ is in fact partition-preserving if and only if this is the case for every g ∈ G. If Γ is partition-preserving, then we deﬁne C (g) to be the number of k-cycles in the disjoint cycle decomposition of Γ (g) that are contained in X, and we deﬁne C (g) analogously. We will now deﬁne an analogue of the cycle index polynomial for the case of partition-preserving group actions. This will allow us to keep track of the cycle structure of elements of the group as well as which partition part each cycle acts on: REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 6 Hillebrand et al. Deﬁnition (Bipartite Cycle Index). Let G be a ﬁnite group and A = X ⊔ Y a ﬁnite set of cardinality n. Then the bipartite cycle index of a partition-preserving group action Γ on A is deﬁned to be the polynomial: 1 X X Y Y C (g) C (g) C (g) C (g) 1 n 1 n Z (x , . . . , x , y , . . . , y ) = x · · · x y · · · y . (2.8) G 1 n 1 n 1 n 1 n |G| g∈G We can now generalize Po´lya’s theorem to the case of partition-preserving group actions. We note that this theorem is used implicitly in [29] without proof. Theorem 1 (Bipartite Po´lya Enumeration Theorem). Let Γ be a partition preserving group action of a ﬁnite group G on a ﬁnite set A = X ⊔ Y . Let Ω = Ω ⊔ Ω be a set of colors, and let X Y + + ω : Ω → N and ω : Ω → N be their assigned weights with respective generating functions X X Y Y f and f . If Φ is the set of valid colorings of A, then Γ passes naturally to a group action Γ X Y G G on Φ, and a generating function by total weight for the number of orbits of Γ is given by: k k Orbits (x) = Z f (x), . . . , f (x ), f (x), . . . , f (x ) . (2.9) ˜ G X X Y Y Proof. We pass to a group action Γ on Φ as follows. Given a valid coloring ϕ ∈ Φ and an element g ∈ G, we deﬁne the action of g on ϕ by gϕ : x 7→ ϕ(gx). To compute a generating function for the number of orbits of Γ by total weight, we will determine the generating functions for the number of ﬁxed points of each g ∈ G by total weight. Consider some g ∈ G. As A is ﬁnite, there exists a unique disjoint cycle decomposition Γ (g) = C · C · . . . · C , where each C is a cycle in the symmetric group S . Now suppose that g ﬁxes 1 2 k i A some valid coloring ϕ ∈ Φ; that is, gϕ = ϕ. Then, assuming the cycle C = (x x . . . x ) for some i 1 2 k x ∈ A, we have by deﬁnition that ϕ(x ) = (gϕ)(x ) = ϕ(gx ) = ϕ(x ), and hence every element i i i i i+1 in the cycle must have the same color under ϕ. The number of colorings of C that are ﬁxed by g is k k i i thus given by the generating function f (x ) if C lies in X, and f (x ) if C lies in Y . We note X i Y i that one of these two cases must occur for every cycle as Γ is partition-preserving. By lemma 2, then, the number of valid colorings of A that are ﬁxed by g is given by the generating function: X X Y Y C (g) C C C k k 1 k 1 k Fix (x) = f (x) · · · f (x )f (x) · · · f (x ). (2.10) X X Y Y By Burnside’s lemma, the number of orbits of Γ of a particular weight is given by the average number of ﬁxed colorings of that weight by elements g ∈ G. Applying Burnside’s lemma for each possible weight, the number of orbits of Γ is thus given by the generating function: Orbits (x) = Fix (x) |G| g∈G X Y X Y 1 C C C (g) C k k 1 k 1 k = f (x) · · · f (x )f (x) · · · f (x ) X X Y Y |G| g∈G k k = Z f (x), . . . , f (x ), f (x), . . . , f (x ) . (2.11) G X X Y Y We note that as a corollary of this proof, we can recover a bivariate generating function from a b this expression, where the coeﬃcient of x y represents the number of distinct colorings with total weight a in Ω , and total weight b in Ω : X Y Corollary. A bivariate generating function by total weight in Ω and Ω , for the number of X Y distinct colorings of A, is given by: k k Orbits (x, y) = Z f (x), . . . , f (x ), f (y), . . . , f (y ) . (2.12) ˜ G X X Y Y REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 Po´lya Counting in Periodic DNA Chains 7 2.4. The Dihedral Group, its Cycle Index and its Extension To apply these results to the problem of counting distinct DNA necklaces, we will need to describe the relevant group action and compute its (bipartite) cycle index. The set of elements acted on by the group is given by the α containers in the DNA necklace and this set can be partitioned into two groups: containers of black beads and containers of white beads. We consider two DNA necklaces to be the same if one can be rotated or reﬂected into the other. These symmetries can be described by an action of the dihedral group, which we will denote by D , where we have α = 2M. The 2M rotational and reﬂective symmetries are what distinguishes the case of periodic DNA chains from linear, ﬁxed boundary condition chains studied in [31] and elsewhere. A fundamental fact about D is that it is generated by two elements r and s, where r is a 2M reﬂection satisfying r = 1, and s is a rotation of order M. Therefore, to describe a group action of D on a DNA necklace it suﬃces to give the action of r and s. In Fig. 4 the action of such a 2M rotation on the necklace is illustrated, while in Figs. 5 and 6 the action of a reﬂection is illustrated for the cases where M is odd and even respectively. It is clear that the resulting group action is partition-preserving. Fig. 4. The action of a rotation s ∈ D on the DNA necklace. 2M To compute the bipartite cycle index of this group action, we will treat reﬂections and rotations separately. To begin with, we can see from Fig. 4 that rotations act symmetrically on the black and white containers in the DNA necklace. Thus, the terms of the cycle index polynomial corresponding to rotations will be symmetric in the x and y . The natural action of the cyclic group C on the i i M M containers in a partition is given by [25]: M/d Z (x , . . . , x ) = ϕ(d)x , (2.13) C 1 M M d d|M where ϕ(d) is deﬁned to be the number of natural numbers less that d that are coprime to it (the Euler totient function). Note that 1 is considered to be coprime to all natural numbers, and so in particular ϕ(d) > 0. Exactly half of the elements of D are rotations, and thus the rotational part 2M M/d M/d of the bipartite cycle index Z is given by ϕ(d)x y . 2M d|M 2 d d The reﬂective part of the group D , on the other hand, acts diﬀerently depending on the parity 2M of M. Suppose ﬁrst that M is odd, in which case a typical reﬂection is illustrated in Fig. 5. Each of the M possible reﬂections occur across an axis consisting of one black container and one white container, both of which are ﬁxed by the reﬂection. The rest of the containers are split into 2-cycles, and thus the bipartite cycle index Z for odd M is given by: 2M 1 1 M/d M/d (M−1)/2 (M−1)/2 Z (x , . . . , x , y , . . . , y ) = ϕ(d)x y + x y x y . (2.14) D 1 M 1 M 1 1 2M 2 2 d d 2 2 d|M If M is even, a typical reﬂection is illustrated in Fig. 6. In this case, each possible reﬂection occurs across an axis consisting of either two white containers or two black containers. The rest of the containers again split into 2-cycles. Thus the bipartite cycle index Z for even M is given by: 2M 1 1 1 M/d M/d (M−2)/2 M/2 (M−2)/2 M/2 2 2 Z (x , . . . , x , y , . . . , y ) = ϕ(d)x y + x x y + y y x . D 1 M 1 M 1 1 2M d d 2 2 2 2 2 4 4 d|M (2.15) REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 8 Hillebrand et al. Fig. 5. The action of a reﬂection r ∈ D on the DNA necklace, for the case where M is odd. 2M Fig. 6. The action of a reﬂection r ∈ D on the DNA necklace, for the case where M is even. 2M 2.5. Generating Functions as Formal Power Series In our particular application of Po´lya theory, the elements we are coloring are the α containers in the DNA necklace and the color of a particular container is deﬁned to be the number of black or white beads it contains. As each container must contain at least one bead, the set of colors is given by N . We are interested in the total number of black and white beads, so the weight of each color will be given quite simply by ω(n) = n for each n ∈ N . This weighting corresponds to the 2 3 generating function (2.5) f (x) = x + x + x + · · · . To compute the number of distinct DNA necklaces with N white beads and N black beads, AT GC N N AT GC we need to calculate the coeﬃcient of x y in (2.12), where the bivariate cycle index is given by the appropriate Z(D ) from Sect. 2.4 and the weight generating function is given by f (x). 2M ω n 2 3 n This requires us to calculate the coeﬃcients of speciﬁc terms in f (x) = (x + x + x + . . . ) for potentially large n. However, doing this expansion naively requires many computing steps, whose number grows exponentially fast as n increases. Thus, this approach is impractical. Fortunately, there exists a way to bypass this problem: treating f (x) as a formal power series, we can manipulate it into a form that makes such computations signiﬁcantly faster. An introduction to the theory of formal power series can be found, for instance, in [42]. For our purposes, we will only need the fact that a form of the binomial theorem holds in this setting: −n n Lemma 3. Letting (1 − x) denote the formal inverse of (1 − x) , we have: n + k − 1 −n k (1 − x) = x . (2.16) n − 1 k=0 This implies the following useful lemma regarding powers of f (x): ∞ n+k−1 n n n+k Lemma 4. As a formal power series f (x) can be written as f (x) = x . ω ω k=0 n−1 2 3 Proof. Note that xf (x) = x + x + · · · = f (x) − x. Rearranging this for f (x), we see that ω ω ω −1 n n −n f (x) = x(1 − x) , and hence f (x) = x (1 − x) . The result now follows from lemma 3. ω ω REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 Po´lya Counting in Periodic DNA Chains 9 In contrast to naively expanding powers of f (x), computing binomial coeﬃcients is computation- ally inexpensive, taking at most a linear number of steps in n. We now list a few results that will come in handy later, when we describe an explicit algorithm for computing the number of distinct DNA necklaces with the given constraints. r a b Lemma 5. The coeﬃcient of x in f (x ) is given by: 1 if b = 0 and a = 0 h i 0 if b = 0 and a > 0 a b f (x ) = (2.17) 0 if b > 0 and a ∤ r or r < ab r/a−1 otherwise. b−1 r a b a b 1 1 2 2 Lemma 6. The coeﬃcient of x in f (x ) · f (x ) is given by: ω ω h i h i h i a b a b a b a b 1 1 2 2 1 1 2 2 f (x ) · f (x ) = f (x ) f (x ) . (2.18) ω ω ω ω r k r−k k=0 3. The Algorithm for Computing the Number of Distinct Valid Necklaces Now we are able to evaluate the number of distinct necklaces, which correspond to a particular value of alternations α. The algorithm is fairly straightforward and eﬃcient. Its implementation requires the following steps: a) Set constraint parameters, N , N , and α = 2M. AT GC b) Choose partitioned cycle index polynomial of the Dihedral group based on parity of M. If M is odd, use (2.14), while for M even use (2.15). c) By the corollary to Po´lya’s Enumeration Theorem (2.12), we know that the number of necklaces, up to symmetry, is given by k k Orbits (x, y) = Z f (x), . . . , f (x ), f (y), . . . , f (y ) . (3.1) ˜ G X X Y Y If M is odd using the outcome of the previous step we get M/d d M/d d Orbits (x, y) = ϕ(d)f (x )f (y ) 2M d|M (M−1)/2 2 (M−1)/2 2 + f(x)f(y)f (x )f (y ). (3.2) If M is even, then we have M/d d M/d d Orbits (x, y) = ϕ(d)f (x )f (y ) 2M d|M 1 1 2 (M−2)/2 2 M/2 2 2 (M−2)/2 2 M/2 2 + f (x)f (x )f (y ) + f (y)f (y )f (x ). (3.3) 4 4 d) Every term in the polynomial produced by (3.1) will be of the form in (2.17) or (2.18). The number of necklaces with N white beads and N black beads is given by the coeﬃcient of AT GC N N AT GC the term x y . To calculate the total number of necklaces, simply sum over each of these terms appearing in the polynomial. REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 10 Hillebrand et al. A Python computer code implementating this algorithm is presented in the Appendix. In order to illustrate the application of this algorithm let us consider a simple, but not trivial case: We set α = 2M = 10, N = 8, N = 6. Clearly M = 5 is odd, so identifying white beads AT GC with AT base pairs and black beads with GC base pairs, we have the cycle index 1 1 2 2 ˜ ˜ ˜ Z(D ) = Z(C ) + x y (x ) (y ) 10 5 1 1 2 2 2 2 1 1 5/d 5/d 2 2 = ϕ(d)(x ) (y ) + x y (x ) (y ) . (3.4) d d 1 1 2 2 5 2 d|5 Now the partitioned Po´lya Enumeration Theorem tells us that we can put the generating functions d d f x and f y in place of the x and y respectively to ﬁnd the generating function of ﬁxed W B d d orbits. So we have 2 3 5 2 3 5 Orbits (x, y) = 1(x + x + x + . . . ) (y + y + y + . . . ) 2 · 5 5 10 15 5 10 15 + 4(x + x + x + . . . )(y + y + y + . . . ) 2 2 4 2 2 2 4 2 + (x + x + . . . )(x + x + . . . ) (y + y + . . . )(y + y + . . . ) . (3.5) Let us ﬁrst look at the cyclic part. Since 5 is prime, the only two integers that divide it are 1 and 5, so this polynomial will be 2 3 5 2 3 5 5 10 15 5 10 15 1(x + x + x + . . .) (y + y + y + . . .) + 4(x + x + x + . . .)(y + y + y + . . .) . 2 · 5 AT Now we try to extract the coeﬃcients of terms that are allowed. These are the terms in x and GC y and we can use (2.17) in order to calculate these coeﬃcients directly. In this case, there will 8 6 be no contribution from the second term, as there are no terms in x and y . So the total cyclic contribution will be (with r = 8 and r = 6 for the respective cases and a = 1, b = 5 for both) 1 N − 1 N − 1 1 5 7 175 GC AT = = . 10 5 − 1 5 − 1 10 4 4 10 Then the same coeﬃcient identifying process can be followed for the reﬂective part. Now the polynomial is given by 2 2 4 2 2 2 4 2 (x + x + . . .)(x + x + . . .) (y + y + . . .)(y + y + . . .) . So for both x and y the coeﬃcients will come from the product of two series, one of them squared. Thus, the relevant terms will come in a series of products given in (2.18). In y the sum of coeﬃcients 1 1 contracts to a single element. That contribution is simply = 1. In x however, there will be 0 1 2 6 4 4 terms from x · x as well as x · x . So then, the sum will be 1 3 3 1 + = 4, 0 1 0 1 1 175 giving a total contribution of (1 + 4) + = 20. Thus there are 20 DNA chains with 8 AT base 2 10 pairs, 6 GC base pairs and 10 alternations. 4. Numerical Results The developed algorithm for calculating the number of distinct DNA chains having α alternations can be used to produce the pdf of α, P(α), which afterwards can be compared to pdfs numerically obtained from Monte-Carlo (MC) simulations. In Figs. 7(a) and (b) we present such pdfs for a DNA chain containing N = 100 base pairs. In particular, we consider the case of N = 40, AT N = 60 in Fig. 7(a) and the case of N = 50, N = 50 in Fig. 7(b). From Figs. 7(a) and (b) GC AT GC we clearly see that the results obtained by the algorithm presented in Sect. 3 (empty circles) and REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 Po´lya Counting in Periodic DNA Chains 11 0.18 0.18 Monte Carlo Monte Carlo (a) (b) 0.16 0.16 Theoretical Theoretical N = 40 N = 50 AT AT 0.14 0.14 N = 60 N = 50 GC GC 0.12 0.12 0.10 0.10 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0.00 0.00 0 20 40 60 80 100 0 20 40 60 80 100 α α 0.12 (c) 0.10 0.08 0.06 0.04 0.02 0.00 10000 20000 30000 MC Fig. 7. Comparison of the pdf P (α) of the number of alternations α, obtained by the algorithm presented in Sect. 3 [empty circles in panels (a) and (b)] and by randomly created DNA chains of N = 100 base pairs through MC simulations [ﬁlled stars in panels (a) and (b)]. The pdfs for N = 40, N = 60 and N = 50, AT GC AT N = 50 are presented in panels (a) and (b) respectively. The number of MC simulations used in (a) and GC (b) are N = 20000. (c) The evolution of the average total absolute diﬀerence hdi between the theoretically MC and the numerically obtained pdfs as a function of N for the case of N = 50, N = 50. The values of MC AT GC hdi are obtained as the average of the quantity (4.1) evaluated for 5 diﬀerent sets of N runs. The error bars MC denote the corresponding standard deviations. by MC simulations of DNA chains with N = 100 base pairs (ﬁlled stars) agree very well. The slight diﬀerences between them are to be expected, as the number of possible chains is generally very large. For instance, in the case of N = 50, N = 50 and α = 50, the number of possible DNA AT GC chains is of the order of 10 possible necklaces. Thus, in general, the number of performed MC simulations cannot get close to the actual total number of possible chains. Nevertheless, although the results of Figs. 7(a) and (b) were obtained by only N = 20000 MC simulations they manage MC to capture the theoretically obtained pdf quite accurately. Of course it is expected that increasing the number of MC simulations will improve the accuracy of the numerical results. As a measure of this accuracy we can consider the total absolute diﬀerence d(N ) = |P (N , α) − P(α)|, (4.1) MC MC MC between the two distributions. In (4.1) P (N , α) is the probability of α alternations obtained MC MC by N MC simulations, P(α) is the one obtained theoretically, while the sum is performed over MC all possible values of α. From the results of Fig. 7(c) where we plot the averaged value of d(N ) MC over 5 sets of N MC simulations as a function of N we see that as the number of simulations MC MC increases, the numerical results get closer to the theoretical ones. REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 P(α) P(α) 12 Hillebrand et al. The results of Fig. 7 clearly show that in order to study the dynamical properties of DNA chains, statistical analysis performed over a few thousands of MC generated random chains (even of the order of 5000) would suﬃce, as such numbers of MC simulations are enough for capturing quite accurately the inﬂuence of alternations on the system’s dynamics. The shape of the pdfs in Figs. 7(a) and (b) suggests that they could possibly be ﬁtted by Gaussian distributions. This is actually true as we can see from the results of Fig. 8, where we performed such a ﬁt for the theoretically obtained pdf of Fig. 7(b). The Gaussian approximation of 0.18 Fitted Gaussian 0.16 Theoretical 0.14 0.12 0.10 0.08 0.06 0.04 0.02 0.00 0 20 40 60 80 100 Fig. 8. Fitting by a Gaussian of the theoretical pdf of Fig. 7(b) (empty circles) with N = 50, N = 50. AT GC The mean of the Gaussian is α = 50.5 and standard deviation σ = 5.1. 0 α the pdfs has several advantages as it allows us to easily quantify the inﬂuence of diﬀerent variables on the number of alternations. Let us ﬁrst look at the eﬀect of increasing the number of only one type of base pair, keeping constant the number of the other type of base pair. In Fig. 9 we present some pdfs of α for N = 100 and increasing values of N from 25 up to 2500. Starting from AT GC N = 100, N = 2500 N = 100, N = 75 0.30 AT GC AT GC N = 100, N = 500 N = 100, N = 50 AT GC AT GC N = 100, N = 100 N = 100, N = 25 AT GC AT GC 0.25 0.20 0.15 0.10 0.05 0.00 50 100 150 200 Fig. 9. Pdfs of α for ﬁxed number of AT base pairs (N = 100) and increasing values of N . Points AT GC correspond to the theoretically obtained values of the pdfs, while curves correspond to the Gaussian ﬁts of these points. Note that even for long DNA chains the value of α cannot exceed α = 200. REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 P(α) P(α) Po´lya Counting in Periodic DNA Chains 13 small values of N , we ﬁnd a very “lopsided” and narrow distribution which as N increases GC GC becomes gradually more symmetric and spreads out, up to a value of N = 200. Then, increasing GC N further, as the numbers of diﬀerent types of base pairs become more dissimilar we again ﬁnd GC gradually more unbalanced pdfs with sharp peaks. The very “lopsided” base pair distributions are obtained when the minority base pairs are signiﬁcantly less than the majority ones and therefore are spread out and isolated among the others. In this case the distribution is sharply peaked around the corresponding maximum possible number of alternations. For the N = 100, N = 25 case AT GC this number is α = 50, while for the N = 100, N = 2500 case it is α = 200. AT GC 250 8 0.35 (b) (c) (a) 7 0.30 6 0.25 5 0.20 α σ 4 0.15 3 0.10 0 2 0.05 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 N N N GC GC GC Fig. 10. The eﬀect of increasing the number N of the GC base pairs for a ﬁxed number of AT base pairs GC (N = 100) on the Gaussian ﬁt P (α) of the pdf values of α, and in particular on (a) the mean value α , AT G 0 (b) the standard deviation σ and (c) the maximum probability max [P (α)]. Some of these pdfs are shown α G in Fig. 9. These changes of the distributions are quantitatively presented in Fig. 10 through the variations of the ﬁtted Gaussian characteristics. The increase of the mean value α of the Gaussian ﬁts as the number N increases is shown in Fig. 10(a). The upper limit of α is 200, when N becomes GC 0 GC much larger than N . The dependence of the width (standard deviation) σ of the Gaussian ﬁts AT α on N is depicted in Fig. 10(b). The initial increase with N corresponds to the spreading out of GC GC the distributions when the numbers of base pairs become more similar. Further increase of the N GC values pushes the pdfs to the other extreme and the lopsidedness comes through again, resulting in narrower distributions (see Fig. 9). This results in the decrease of σ for large values of N . α GC Finally in Fig. 10(c) we observe that as N increases the maximum probability of the pdfs initially GC decreases rapidly and then increases slowly, in accordance with the results of Fig. 9 and of course with the fact that it is inversely proportional to the standard deviation of the Gaussian ﬁt. Let us now focus our attention on the eﬀect of the increment of the total number of base pairs N = N + N , i.e. the total ‘length’ of the DNA chain, when the ratio N : N is kept AT GC GC AT constant. Such cases are presented in Fig. 11, where we plot several pdfs for diﬀerent values of N but for ﬁxed ratios N : N . In particular, the values of the ratios N : N are 1 : 1 in panel GC AT GC AT (a) (b) (c) 0.200.20 0.20 N = 1000 N = 900 N = 1050 N :N = 2 : 1 N :N = 6 : 1 GC AT GC AT N :N = 1 : 1 GC AT N = 400 N = 450 N = 700 0.150.15 0.15 N = 200 N = 150 N = 350 0.100.10 0.10 0.050.05 0.05 0.000.00 0.00 100 200 300 400 500 600 100 200 300 400 500 600 100 200 300 400 500 600 α α α Fig. 11. Pdfs of α for ﬁxed ratios N : N = 1 : 1 (a), 2 : 1 (b) and 6 : 1 (c). Points correspond to the GC AT theoretically obtained values of the pdfs, while curves correspond to the Gaussian ﬁts of these points. (a), 2 : 1 in (b) and 6 : 1 in (c). In all cases the pdfs are ﬁtted by appropriate Gaussian distributions REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 P(α) max[P ( )] G 14 Hillebrand et al. 500 18 0.9 (a) Ratio 6:1 Ratio 6:1 Ratio 6:1 (b) 16 0.8 (c) Ratio 2:1 Ratio 2:1 Ratio 2:1 14 0.7 Ratio 1:1 Ratio 1:1 Ratio 1:1 12 0.6 10 0.5 α σ 8 0.4 6 0.3 4 0.2 2 0.1 0 0 0.0 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 N N N Fig. 12. The eﬀect of increasing the total number of base pairs N for ﬁxed ratios N : N on the parameters GC AT of the Gaussian ﬁt P (α) of the pdf for α: (a) the mean value α , (b) the standard deviation σ and (c) the G 0 α maximum probability max [P (α)]. Some of these pdfs are shown in Fig. 11. whose characteristics are plotted in Fig. 12 as a function of N. From the results of Figs. 11 and 12 we see that as the total number N of base pairs increases the pdfs become more broad, and consequently their maximum value decreases. This means that for large N more α values have a relatively high probability to appear in a randomly created DNA chain. In addition, increasing the ratio N : N results in a decrease of the spreading, as evidenced by the lower standard GC AT deviation in Fig. 12(b) and the higher maximum probability in Fig. 12(c). A linear relationship between N and the mean α is observed for all ratios, with the slope of the line inﬂuenced by the ratio. The slope m for each case is: m = 0.25 for ratio 6 : 1, m = 0.45 for 2 : 1 and m = 0.5 for 1 : 1. 5. Conclusions Motivated by the possibility that the number α of base pair alternations in a circular or periodic DNA chain might aﬀect the dynamics of the system, we have found a probability distribution for this number. Algorithms for such distributions are known for linear DNA sequences with ﬁxed boundary conditions [31]. The introduction of the periodic boundary conditions we consider in our study makes the counting of alternations a much more complicated problem due to the appearance of additional rotational and reﬂectional symmetries. To account for the additional complexity arising from these symmetries we have implemented Po´lya counting theory. In particular, extending Po´lya’s Enumeration Theorem for a partition-preserving group action on a partitioned set, we have constructed a well deﬁned algorithm for calculating the number of DNA chains having a given number of alternations for particular values of the number of AT (N ) and GC (N ) base pairs. AT GC The obtained theoretical results were compared with numerically constructed pdfs through MC simulations. We found that, in general, creating a few thousands of random DNA chains (around 5000) by MC simulations we can approximate quite accurately the theoretical pdf of α. This means that a statistical analysis of these DNA chains will suﬃce to uncover the potential inﬂuence of heterogeneity on the dynamic behavior of the considered DNA model. In addition, approximating the obtained pdfs by Gaussians we investigated the eﬀect of the number of the two base pairs, as well as their ratio on various characteristics of the pdfs, like their mean value, their standard deviation and their maximum. APPENDIX Here we present a Python computer code implementing the algorithm of Sect. 3. The function necklace count(n, B, W) returns the total number of possible necklaces under the symmetry constraints with 2n alternations, B black beads and W white beads. from math import gcd # Compute binomial c o e f f i c i e n t s in l i n e a r time . def binomial (n , k ) : i f k > n or k < 0: return 0 i f k = = 0: return 1 REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 max[P ( )] G Po´lya Counting in Periodic DNA Chains 15 i f k > n //2: return binomial (n , n−k) return (n ∗ binomial (n−1, k−1)) // k # Compute the Euler t o t i e n t function \ phi (n ) , which # g i v e s the number of i n t e g e r s 0 < d <= n t h a t are # r e l a t i v e l y prime to n . def t o t i e n t (n ) : count = 0 for d in range (1 , n+1): i f gcd (d , n) = = 1: count += 1 return count # Get the xˆ r c o e f f i c i e n t of our weight generating f u n c t i o n s f ( xˆm)ˆn , # where : # f ( x ) = x + xˆ2 + xˆ3 + . . . def weight gf ( r , m, n ) : i f n = = 0: i f r = = 0: return 1 return 0 i f r%m != 0: return 0 i f ( r //m) < n : return 0 return binomial ( ( r // m)−1, n−1) # Get the xˆ r c o e f f i c i e n t of a binary product of weight generating # f u n c t i o n s f ( xˆm1)ˆ n1 ∗ f ( xˆm2)ˆn2 , where : # f ( x ) = x + xˆ2 + xˆ3 + . . . def b i n ar y w ei gh t gf ( r , m1, n1 , m2, n2 ) : t o t a l = 0 for i in range (1 , r ) : t o t a l += weight gf ( i , m1, n1 ) ∗ weight gf ( r−i , m2, n2 ) return t o t a l # Compute the number of necklaces up to d i h e d r a l symmetry with # 2n a l t e r n a t i o n s , B b l a c k beads and W white beads . def necklace count (n , B, W) : # F i r s t we count the c o n t r i b u t i o n s from the c y c l i c part # of the c y c l e index . count = 0 for d in range (1 , n+1): i f n%d != 0: continue count += t o t i e n t (d) ∗ weight gf (B, d , n//d) ∗ weight gf (W, d , n//d) # Next we count the c o n t r i b u t i o n s from the d i h e d r a l part # of the c y c l e index . i f n%2 == 0: count += ( weight gf (B, 2 , n//2) ∗ b i n ar y w ei gh t gf (W, 1 , 2 , 2 , (n−2)//2) ∗ (n //2)) count += ( weight gf (W, 2 , n//2) ∗ b i n ar y w ei gh t gf (B, 1 , 2 , 2 , (n−2)//2) ∗ (n //2)) REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 16 Hillebrand et al. else : count += ( b i n ar y w ei gh t gf (B, 1 , 1 , 2 , (n−1)//2) ∗ b i n ar y w ei gh t gf (W, 1 , 1 , 2 , (n−1)//2) ∗ n) return count // (2∗n) Acknowledgements M.H. and G.P-J. acknowledge ﬁnancial assistance from the National Research Foundation (NRF) of South Africa towards this research. G.K. and Ch.S. were supported by the Erasmus+/ International Credit Mobility KA107 program. Ch.S. acknowledges support by the NRF of South Africa (IFRR and CPRR Programmes), the UCT (URC Conference Travel Grant) and thanks Hans-Peter Kunzi for useful discussions. REFERENCES 1. Alberts B., Bray D., Hopkin K., Johnson A., Lewis J., Raﬀ M., Roberts K., Walter P., Essential Cell Biology, 2nd Ed., Garland Science 2004. 2. Alexandrov, B.S., Gelev, V., Monisova, Y., Alexandrov, L.B., Bishop, A.R., Rasmussen, K.Ø., Usheva, A., Nucleic Acids Res. 37, 2405 (2009). 3. Alexandrov, B.S., Gelev, V., Yoo, S.W., Bishop, A.R., Rasmussen, K.Ø., Usheva, A., PLoS Comput. Biol. 5, e1000313 (2009). 4. Alexandrov, A.S., Gelev, V., Yoo, S.W., Alexandrov, L.B., Fukuyo, Yayoi. Bishop, A.R., Rasmussen, K.Ø., Usheva, A., Nucleic Acids Res. 38, 1790 (2010). 5. Apostolaki, A., Kalosakas, G., Phys. Biol. 8, 026006 (2011). 6. Ares, S., Voulgarakis, N.K., Rasmussen, K.Ø., Bishop, A.R., Phys. Rev. Lett. 94, 035504 (2005). 7. Ares, S., Kalosakas, G., Nano Lett. 7, 307 (2007). 8. Brualdi, R. A., Po´lya Counting. In: Introductory Combinatorics, 5th ed., Upper Saddle River, NJ: Prentice Hall, 2010 9. Burnside, W., Theory of groups of ﬁnite order, Cambridge: Cambridge University Press, 1897. 10. Chetverikov, A.P., Ebeling, W., Lakhno, V.D., Shigaev A.S., Velarde, M.G., Eur. Phys. J. B 89, 101 (2016). 11. Choi, C.H., Kalosakas, G., Rasmussen, K.Ø., Hiromura, M., Bishop, A.R., Usheva, A., Nucleic Acids Res. 32, 1584 (2004). 12. Choi, C.H., Rapti, Z., Gelev, V., Hacker, M.R., Alexandrov, B.S., Park, E.J., Park, J.S., Horikoshi, N., Smerzi, A., Rasmussen, K.Ø., Bishop, A.R., Usheva, A., Biophys. J. 95, 597 (2008). 13. Dauxois, T., Peyrard, M., Bishop, A.M, Phys. Rev. E 47, 684 (1993). 14. Hennig, D., Eur. Phys. J. B 30, 211 (2002). 15. Herstein, I. N., Abstract Algebra, 3rd ed., Wiley, 1999. 16. Huang, H.-H., Lindblad, P., J. Biol. Eng. 7, 10 (2013). 17. Kalosakas, G., Phys. Rev. E 84, 051905 (2011). 18. Kalosakas, G., Ares, S., J. Chem. Phys. 130, 235104 (2009). 19. Kalosakas, G., Ngai, K.L., Flach, S., Phys. Rev. E 71, 061901 (2005). 20. Kalosakas, G., Rasmussen, K.Ø., Bishop, A.R., Choi, C.H., Usheva, A., Europhys. Lett. 68, 127 (2004). 21. Kalosakas, G., Rasmussen, K.Ø., Bishop, A.R., Chem. Phys. Lett. 432, 291 (2006). 22. Kolpakov, R., Bana, G., Kucherov, G., Nuc. Ac. Res., 31, 3672 (2003) 23. Lewin B., Genes VIII, Pearson Prentice Hall 2004. 24. Li, W., Computers Chem. 21, 257 (1997). 25. van Lint, J. H., Wilson, R. M., Po´lya theory of counting. In: A Course in Combinatorics, Cambridge: Cambridge University Press, 1992 26. Nowak-Lovato, K., Alexandrov, L.B., Banisadr, A., Bauer, A.L., Bishop, A.R., Usheva, A., Mu, F., Hong-Geller, E., Rasmussen, K.Ø., Hlavacek, W.S., Alexandrov, B.S., PLoS Comput. Biol. 9, e1002881 (2013). 27. Peyrard, M., Nonlinearity 17, R1 (2004). 28. Peyrard, M., Fargo, J., Physica A 288, 199 (2000). 29. Po´lya G., Read R. C., Chemical Compounds. In: Combinatorial Enumeration of Groups, Graphs, and Chemical Compounds., New York: Springer, 1987. 30. R´egnier, M., Disc. App. Math. 104, 259 (2000). 31. Robin, S., Daudin, J.J., Journ. Appl. Prob. 36, 179 (1999) REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018 Po´lya Counting in Periodic DNA Chains 17 32. Robin, S., Schbath, S., Journ. Comp. Biol. 8, 349 (2001). 33. Tabi, C.B., Dang Koko, A., Oumarou Doko, R., Ekobena Fouda, H.P., Kofane, T.C., Physica A 442, 498 (2016). 34. Schbath, S., ESAIM: Probability and Statistics 1, 1 (1995). 35. Schbath, S., Prum, B., de Turckheim, E., Journ. Comp. Biol. 2, 417 (1995). 36. Skokos, Ch., Hillebrand, M., Schwellnus, A., Kalosakas, G., in preparation, (2018). 37. Tapia-Rojo, R., Mazo, J.J., Falo, F., Phys. Rev. E 82, 031916 (2010). 38. Tapia-Rojo, R., Mazo, J.J., Hernandez, J.A., Peleato, M.L., Fillat, M.F., Falo, F., PLoS Comput. Biol. 10, e1003835 (2014). 39. Theodorakopoulos, N., Phys. Rev. E 77, 031919 (2008). 40. Voulgarakis, N.K., Kalosakas, G., Rasmussen, K.Ø., Bishop, A.R., Nano Lett. 4, 629 (2004). 41. Yakushevich, L.V., Nonlinear Physics of DNA, 2nd Ed., Wiley-VCH, 2004. 42. Zariski O., Samuel P., Polynomial and Power Series Rings. In: Commutative Algebra. Graduate Texts in Mathematics, vol 29. Berlin: Springer, 1960. 43. Zoli, M., J. Phys.: Condens. Matter 24, 195103 (2012). 44. Zoli, M., J. Theor. Biol. 354, 95 (2014). REGULAR AND CHAOTIC DYNAMICS Vol. 23 No. 2 2018

Journal

Statistics – arXiv (Cornell University)

Published: May 16, 2018

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Distribution of Base Pair Alternations in a Periodic DNA Chain: Application of Polya Counting to a Physical System

Distribution of Base Pair Alternations in a Periodic DNA Chain: Application of Polya Counting to a Physical System

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Distribution of Base Pair Alternations in a Periodic DNA Chain: Application of Polya Counting to a Physical System

Distribution of Base Pair Alternations in a Periodic DNA Chain: Application of Polya Counting to a Physical System

References (57)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies