Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

GAN-Based Training of Semi-Interpretable Generators for Biological Data Interpolation and Augmentation

GAN-Based Training of Semi-Interpretable Generators for Biological Data Interpolation and... applied sciences Article GAN-Based Training of Semi-Interpretable Generators for Biological Data Interpolation and Augmentation Anastasios Tsourtis *, Georgios Papoutsoglou and Yannis Pantazis Institute of Applied and Computational Mathematics, Foundation of Research and Technology Hellas, GR 700 13 Heraklion, Greece; papoutsoglou@iacm.forth.gr (G.P.); pantazis@iacm.forth.gr (Y.P.) * Correspondence: tsourtis@iacm.forth.gr Abstract: Single-cell measurements incorporate invaluable information regarding the state of each cell and its underlying regulatory mechanisms. The popularity and use of single-cell measurements are constantly growing. Despite the typically large number of collected data, the under-representation of important cell (sub-)populations negatively affects down-stream analysis and its robustness. Therefore, the enrichment of biological datasets with samples that belong to a rare state or manifold is overall advantageous. In this work, we train families of generative models via the minimization of Rényi divergence resulting in an adversarial training framework. Apart from the standard neural network-based models, we propose families of semi-interpretable generative models. The proposed models are further tailored to generate realistic gene expression measurements, whose characteristics include zero-inflation and sparsity, without the need of any data pre-processing. Explicit factors of the data such as measurement time, state or cluster are taken into account by our generative models as conditional variables. We train the proposed conditional models and compare them against the state-of-the-art on a range of synthetic and real datasets and demonstrate their ability to accurately perform data interpolation and augmentation. Citation: Tsourtis, A.; Papoutsoglou, Keywords: single-cell RNA-seq data generation; data interpolation and augmentation; Gaussian G.; Pantazis, Y. GAN-Based Training mixture model; zero-inflated random variables; generative adversarial networks; Rényi divergence of Semi-Interpretable Generators for minimization Biological Data Interpolation and Augmentation. Appl. Sci. 2022, 12, 5434. https://doi.org/10.3390/ app12115434 1. Introduction Academic Editors: Mauro Castelli A breakthrough in creating powerful generative models has been the invention of and Luca Manzoni Generative Adversarial Networks (GANs) [1]. In principle, GANs are capable of modelling any high dimensional data distribution through the training of two adversarial neural net- Received: 30 April 2022 works. The networks are trained together in a competing manner: one network generates Accepted: 23 May 2022 data whose distribution approximates the distribution of the real data; hence, it is called Published: 27 May 2022 the generator, while the other neural network evaluates whether the generated data are Publisher’s Note: MDPI stays neutral real or not; hence, it is called the discriminator. GANs have been proven very successful with regard to jurisdictional claims in in generating photo-realistic images and audio/speech samples. Applications of GANs published maps and institutional affil- include (conditional) image generation [2–8], image-to-image translation [9] and image iations. super-resolution [10]. In time-series, GANs have been used for speech enhancement [11] and speech synthesis [12,13] as well as for natural language processing [14,15] among other types of raw data. More recently, they have been started to show their usefulness in life science applications [16], too. Copyright: © 2022 by the authors. GANs often face training difficulties when applied in single-cell data; either gene expres- Licensee MDPI, Basel, Switzerland. sion or proteomic data. The main reason is that cells constantly shift from one state or type to This article is an open access article another, either naturally like in an immune response or artificially like after a drug treatment, distributed under the terms and and the current sequencing technology allows only sparse observations scattered along the conditions of the Creative Commons evolution of the samples [17]. For example, single-cell RNA sequencing (scRNAseq) is able to Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ capture the diversity of cells during cell differentiation or activation by profiling transcript 4.0/). abundances at high resolution. However, it does so by creating snapshot observations where Appl. Sci. 2022, 12, 5434. https://doi.org/10.3390/app12115434 https://www.mdpi.com/journal/applsci Appl. Sci. 2022, 12, 5434 2 of 16 the temporal resolution is low or totally unknown. Consequently, some cell states may be missing entirely or a low number of samples are available from those states [18]. On top of that, the collected data include a huge amount of zero values (i.e., element-wise zero-inflated samples) that cause established analysis methods to break down [19]. In scRNAseq modeling, representation and generation, GANs and more generally generative models have been utilized for imputing missing values in the data, for dimen- sionality reduction, for mapping data across different measurement types (e.g., domain adaptation), for data augmentation through the approximation of the original data distri- bution and for modeling the response of differentiating cells upon external perturbation. Particularly, an early work on GANs with scRNAseq data was the identification of state- determining genes and clustering tasks [20]. Marouf et al. [21] developed a conditional GAN for data augmentation of single-cell data sets demonstrating realistic in-silico genera- tion of low size cell types. In [22], authors took into account the zero-inflation of the data and proposed a variational autoencoder for clustering purposes. On a different approach to mitigate the excess amount of zeros in the expression matrix, authors in [23,24] performed data imputation. In [23], particularly, the authors converted scRNAseq data to images and then trained GANs to perform dropout imputation and related downstream analysis. Here, we build on the existing literature and further present: 1. Semi-interpretable generators that generate synthetic data that follow the statistical properties of biological data and particularly the mixed nature of the distributions of single-cell RNA-seq expression data. 2. Based on the framework of GANs, a novel training approach for a mixture model of zero-inflated multi-dimensional variables. We utilize Rényi divergence minimization for the first time on biological data to estimate the parameters. Overall, the proposed approach is more flexible since it does not require the optimization over an (assumed) likelihood function. 3. Comparisons between neural network generators (not interpretable) and the proposed semi-interpretable generators for data interpolation and data augmentation tasks. Despite both variants produced accurate results, we show that there is a trade-off between sample quality of the generated data and interpretability. We also train the generative model proposed in [21] and compare its results with ours in terms of quality and complexity. The rest of the paper is structured as follows. In Section 2, a brief introduction on GAN training and Rényi divergence is provided while Section 3 presents the generators used to generate (potentially) zero-inflated variables. Section 4 describes how the experiments were setup and how results were evaluated. In Section 5, we demonstrate the performance of the proposed models on data interpolation and data augmentation using simulated as well as real biological datasets. Section 6 concludes the paper. 2. Generative Adversarial Networks Originally, Generative Adversarial Networks (GANs) have been introduced as a zero- sum game between two competing neural networks with high learning capacity [1]. The first neural network is the generator which takes random noise as input and transforms it to a realistic sample while the second neural network which is called the discriminator aims to detect the differences between the real and synthetic samples. A different but more instructive presentation of the GAN framework is to consider GANs as a minimization problem between two probability distributions. Mathematically, letting p denote the distribution of the real samples and p the distribution of the synthetic samples then the goal is to minimize minD( p jj p ) (1) r g where G is the generator whileD(jj) is a divergence (i.e., a pseudo-distance) or a distance between two distributions. Vanilla GAN [1] minimizes the Shannon–Jensen divergence while f -GAN [25] minimizes the f -divergence. A significantly more stable GAN is ob- Appl. Sci. 2022, 12, 5434 3 of 16 tained via Wasserstein distance minimization giving raise to WGAN [26,27]. With the divergence minimization perspective, the utilization of the discriminator seems redundant. Unfortunately, the distance or divergence D(jj) in (1) is not computable because explicit expressions for p and p are not available in the majority of the applications. However, r g it can become tractable using the so called variational representation (or duality) formu- las. Variational formulas essentially transform the problem of estimating the value of a divergence to a maximization problem over a function space. The function space is further approximated by parametric families of models. Neural networks is the typical choice of approximation models. Please note that these parametric models constitute the analog of the discriminator. In the following, we propose to use Rényi divergence for training the generator and demonstrate how it can become tractable via the Rényi–Donsker–Varahdan variational formula. 2.1. Rényi Divergence Minimization and Rényi GAN Assuming mutual absolute continuity for the distributions, the Rényi divergence of p with respect to p is defined by 1 d p R p jj p := logE , (2) a r g p a(a 1) d p where a 2 Rnf0, 1g is the order parameter. The definition can be extended to 0 and 1 by continuity leading to reverse KL and KL divergence, respectively. The order a effectively controls the tail characteristics of the two distributions. A variational representation formula is essentially an expression between expected values that form a lower bound for the divergence. As already mentioned, the use of varia- tional formulas transforms the direct divergence estimation problem to an optimization problem over functions. The variational representation of the Rényi divergence is given by ([28] Theorem 1) 1 1 (a1)D aD R p jj p = sup logE [e ] logE [e ] , (3) a r g p p r g a 1 a D2M (W) where M (W) is the space of all measurable and bounded functions from W to R. Combining (1) with (3) and introducing an additional generalization, the minimax optimization problem is formulated as 1 1 (a1)D(x) aD(G(z)) min max logE [e ] logE [e ] , (4) x p z p r z G D2G a 1 a where G is the function space for the discriminator. When G is chosen to be the function space of all measurable and bounded functions then the Rényi divergence of p with respect to p is minimized. However there are several other options that could be selected and they are often preferred. Indeed, a widely-used function space is the Lipschitz continuous func- tion space which has been shown to improve the stability of the training in GANs [27,29]. Another choice is the Reproducing kernel Hilbert space (RKHS) which have been also uti- lized in GANs [30]. It has been recently shown in [31] that rich-enough function spaces for the discriminator can be used and still preserve the divergence minimization perspective. 2.2. Conditional Rényi GAN Often, we are interested in generating a conditional distribution. By letting y denote the condition, the optimization problem in (4) is extended to 1 1 (a1)D(x,y) aD(G(z,y),y) min max logE [e ] logE [e ] , (5) x p z p r z G D2G a 1 a where both the generator and discriminator take as additional input the conditional variable y. Appl. Sci. 2022, 12, 5434 4 of 16 The condition variable y can be either discrete (e.g., categorical) or continuous and in both cases, its vector embedding, denoted also with y, lies in a higher d -dimensional space. In the discrete case, the one-hot encoding can be used and d will be equal to the number of categories in y, e.g., y = (0, 1, 0, . . . , 0) for the 2nd category. In the one-dimensional continuous case, the condition is represented as a linear combination between two vectors in the d -dimensional Euclidean space. Mathematically, y = y e + (1 y )e , where y is y n 1 n 2 n the unity-based normalization (a.k.a. min–max feature scaling) of y. 3. Semi-Interpretable Generators The typical parametric models used for the generator are various types of neural networks. Despite their expressivity, neural networks are difficult to interpret therefore we target to model the data distribution with the simpler but more interpretable Gaussian Mixture Model (GMM) [32]. Instead of following the standard Expectation-Maximization algorithm [33,34] for the estimation of the GMM’s parameters, we propose to use the likelihood-free divergence minimization framework and apply a reparameterization trick for the GMM in order to differentiate and be able to back-propagate the gradients of the parameters. We first revisit the reparameterization trick for the Gaussian distribution and then extend it for the GMM as well as for the conditional GMM. 3.1. Gaussian Case Assuming a Gaussian generator with parameters q = fm, Sg, the reparameterization trick simply states that a Gaussian sample can be obtained as an affine transformation of a standard d-dimensional Gaussian distribution N (0, I ). As in Variational Auto En- coders [35], a Gaussian sample is generated from the affine transformation x = m + Lz with z  N (0, I ) and L satisfies the equation LL = S. In other words, the generator becomes a linear transformation of the input noise vector z. 3.2. GMM Case Moving forwards, we present the more expressive case of GMMs as generators. The probability density of a GMM is a weighted sum of K d-dimensional Gaussians given by p N (m , S ) where p > 0 is the normalized frequency of the k-th Gaussian k k k k k=1 while (m , S ) are the respective mean vector and covariance matrix. A sample via the k k reparameterization trick for the GMM is obtained as x = 1 (u) m + L z , with u  U(0, 1) , z  N (0, I ) , i.i.d.. (6) å k k k k d [w ,w ] k1 k k=1 where 1 () indicates the indicator function of the interval [a, b] while w corresponds [a,b] k to the cumulative mass function given by w = å p with w = 0 and w = 1. Notice 0 K k k j=1 that the indicator function is non-differentiable because it is discontinuous at the extreme points a and b. The generator parameters to be learned are the probability and the parameters of the affine transformation for each Gaussian whereas K is fixed. The choice of K will respect the data set characteristics and particularly, as a rule of thumb, K could take a value close to the number of clusters in the data set. A clustering algorithm such as k-means (or Louvain clustering or similar algorithm) could be used on the training data set prior training, in order to determine the value of K. Moreover, we experimentally observed that utilizing a higher value of K leads to many weights, p , to be close to zero. Therefore, we typically choose a large value for K without instilling instabilities in the training process. In other words, we could select K using our framework iteratively by keeping only the modes that are above a certain threshold. While the training of m and L is straightforward, the training of the probabilities is k k more elaborate. First, the softmax function is used on every training step to normalize the values of p = where q ’ are the unconstrained trainable variables which can be k k K j å e j=1 Appl. Sci. 2022, 12, 5434 5 of 16 trained with standard stochastic back-propagation algorithms. Indeed, the direct training of the probabilities, p cannot be carried over with the standard unconstrained algorithms because of the constraint requirements for p . Second and in order to avoid the non- differentiability issue of the indicator function, we approximate it via the difference of two scaled sigmoid functions with a scaling factor, c. Mathematically, the approximation is given by 1 1 1 (u)  s (u w ) s (u w ) = . (7) c k1 c k [w ,w ] k1 k c(uw ) c(uw ) k1 k 1 + e 1 + e Figure 1 shows both the indication function and its approximation for two values of the scaling factor. We set the value for the scaling factor to c = 300 which is a reasonable compromise between accuracy of the indicator function and the efficient propagation of the (non-zero) gradients. We also remark that the authors in [36] followed another approach in similar rationale to avoid differentiation issues associated with sampling from discrete distributions. 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Figure 1. Approximation of the indicator function 1 (u) by the difference of two sigmoids. [w ,w ] k1 k Factor c controls the steepness and we plot c = 100 (blue), c ˆ = 300 (red) and c ¯ = 500 (yellow). 3.2.1. Additional Penalty Term on Diagonal Elements of the Covariance Matrix The training of a GMM generator often results in degenerate covariance matrices where the diagonal elements of the covariance matrices take very small values of order O(10 ). In order to overcome this stability issue during training, we penalize the diagonal elements of S = L L matrices to be positive as in [37]. Thus, we add to the loss function (4) the k k penalty term K d l P(S) = l , (8) S S å å (S ) k jj k=1 j=1 where the coefficient l > 0 controls the effect of the penalty term while S is the covariance S k matrix of the k-th Gaussian. Upon experimentation, we find that l = 0.001 produce non- degenerate covariance matrices in several synthetic examples. Supplementary Figure S3 presents additional results for a series of values for the multiplicative factor l . 3.3. Conditional GMM Case When we want to construct the conditional variant of Equation (6) over y, we expand the parameter space as follows. Let q = [q , . . . , q ] be the unconstrained probability 1 K vector, then we extend it as a linear function of the conditional vector, q = yW + b , q q d K K where W 2 R and b 2 R are the new parameters to be learned. Additionally, we q q incorporate condition y in each mean vector following the same rationale: m = yW + b , k m,k m,k d d d where W 2 R and b 2 R . Thus, the weight as well the means of each Gaussian m,k m,k might change with the condition. This is quite common in biological datasets and especially in the differentiation process where sub-populations of cells evolve in size and number Appl. Sci. 2022, 12, 5434 6 of 16 with time which acts as the condition variable. Finally, in our conditional generator, we choose the covariance matrices, S to be independent of the condition variable, y. 3.4. Generative Models for Zero-Inflated Variables There is an important class of biological data whose values are both discrete and continuous and can be represented with mixed random variables. A particular example of interest is the so-called zero-inflated variables which have a non-zero mass at value zero and a density distribution for the typically positive values. A zero-inflated variable can simultaneously model the answer of a Yes/No question and the value of the quantity for the case the answer is Yes. In gene expression studies and RNA-seq data, zero-inflated variables are unequivocal since a gene might be or might not be activated and when activated the level of activity is measured. The generation of zero-inflated variables requires an extra modeling step which will take into account the mixed nature of these variables as they constitute a mixture of a continuous distribution and a point mass Dirac at zero. Next, we present how to incorporate zero-inflated variables both in the neural network setting and in extending the GMM-based generator case. 3.4.1. Neural Networks: Gated Activation In the case of feedforward neural networks (FNN), we propose to use gated linear units (GLU) [38] as activation functions in the output layer of the generator. A GLU is defined as an element-by-element multiplication between the output of a linear layer with another “parallel” layer with sigmoid activation which acts as a mask (or a gate). The information of the linear layer passes when the gate is open (i.e., when sigmoid activation is one) while the output is zero when the gate is closed (i.e., when sigmoid activation is zero). The architecture of the generator with an GLU output layer is shown in Figure 2. The conditional variant of the FNN generator is similar to [2], where we augment each sample with its respective embedded condition y. In the following we refer to this variant as cFNN. Figure 2. Schematic of how gating is implemented in the Generator architecture of the feed-forward neural network (FNN) conditioned on y. Gating is used to model zero-inflated samples. The 0 0 output layer is calculated as G(z, y) = (W h + b) s(W h + b ) where  stands for element-wise multiplication. 3.4.2. The Reparameterization Trick for ZIMMs As already mentioned, the probability density function of a zero-inflated variable can be represented as a weighted sum of a Dirac mass at zero and a continuous component. Thus, a Bernoulli random variable can determine which of the two components will be selected for the generation of one sample. Using the same trick as for the mode selection in GMMs, the extension of Equation (6) can be written as an extra combination of a Gaussian Appl. Sci. 2022, 12, 5434 7 of 16 sample z and a Bernoulli distribution with probability a . We term Equation (9) as Zero- k k Inflated Mixture Model (ZIMM), 0 0 x = 1 (u) 1 (u ) 0 + 1 (u ) (m + L z ) , (9) å [w ,w ] [0,a ] [a ,1] k k k k1 k k k k=1 0 d z  N (0, I ) , u  U(0, 1) , u  U(0, 1) , k d where each a is a d-dimensional vector denoting the Bernoulli probabilities for each dimension of the data. As in the GMM case, we approximate both indicator functions with the difference of two sigmoid functions as (7). In total, there are K d more parameters to be learned in the ZIMM relative to GMM which correspond to the Bernoulli probabilities for each dimension and each mode. We also implement a conditional version of ZIMM where the trainable variables for the weight probabilities and the mean vector of each Gaussian are conditioned while the remaining parameters are not conditioned. 4. Experimental Setup All experiments were performed using the (conditional) Rényi GAN framework with a discriminator of two hidden layers. The FNN-based generator has two hidden layers as well, whereas the ZIMM-based generator is designed as described in Section 3.4.2. We perform a small grid search for hyper-parameter tuning that minimized the adversarial loss as well as qualitative data visualization, a common practice for training deep generative models. Based on this grid search, we choose the number of nodes to be 32 for both hidden layers and all of the experiments described in Section 5. The ReLU activation function is used in all hidden layers of the discriminator and the generator. For the output layer tanh() is used for the discriminator and linear activation for the generator. The constraining of the optimization to the space of Lipschitz-1 functions was carried out by one-sided gradient penalty [27], with constant l = 1. When Lipschitz-1 penalization is used, the activation gp of the output layer is linear. We used batches of size in the range 512–1024 depending on the dimension of the data and the ADAM optimizer [39] with learning rate in the range 0.0005–0.001. The number of training steps varied between the experiments as we observe that convergence was slower for the semi-interpretable generator case. Depending on the initialization of the network parameters, convergence of the adversarial loss is attained in a range of 100,000–200,000 minibatch steps for the FNN variant and twice as that for the ZIMM-based generator, provided that Lipschitz-1 constraining is used, otherwise more training steps are required. Unless otherwise stated, the order of the Rényi divergence is set to a = 0.5 which corresponds to a 1-1 relation with the Hellinger distance. In all cases, the discriminator is updated 5 times followed by an update for the generator. Hyper- parameters node-size, learning rate and batch-size were determined via a small grid search that minimized adversarial loss as well as qualitative data visualization, a common practice for training deep generative models. Evaluation Criteria for Assessing Generated Data Quality We employ three criteria to assess the quality of the generated data G(z)  p with respect to test data from the real distribution, p : maximum mean discrepancy (MMD) [40], principal component analysis (PCA) and marginal distribution histograms. We use the Information Theoretical Estimators Toolbox [41] for computing the MMD between the test and generated distributions. Reference MMD is computed from test data as the average of fifty, randomly-sampled, equally-sized datasets, all conditioned on the same subset of y values. Every generated or real data set consists of 1000 data points for fair comparison. The radial basis function with s = 5.0 is chosen for the kernel and this value is determined by the average variance of the training data and accounts for better sensitivity to small differences between distributions under comparison. Both U-Statistics and V-Statistics Appl. Sci. 2022, 12, 5434 8 of 16 implementations of MMD conclude to very similar values and we chose the first estimate. Scikit-learn 1.0.1 is used for PCA. 5. Results and Discussion 5.1. Nonlinear Trajectory Interpolation We begin with demonstrating the efficacy of the proposed Rényi GAN training method- ology to learn the mapping and then interpolate between points from a population whose distribution follows a spiral pattern which is a two-dimensional variant of the Swiss roll dataset: 3p (x , x ) = (y cos (y), y sin (y)) + # , with y = (1 + 2u) , (10) 1 2 u  U(0, 1) and #  N (0, e I ) , where y acts as the condition variable and e = 0.2. As explained in Section 2, we construct the embedding vector of the conditional variable, y, to span a 10-dimensional interval with boundaries e = (1, . . . , 1) and e = e . 1 2 1 Moreover, we set the generator ’s noise input, z, to be a 2-dimensional Gaussian vector. We found that the training steps required by the FNN generator to converge were half of those required by the GMM generator (Supplementary Figure S4). To speed up convergence for both cases, we used gradient penalty during optimization thus restricting the function space of the discriminator to Lipschitz continuous functions. We further apply the covariance penalty as explained in Section 3.2.1 for the GMM generator. We also initialize the parameters of the GMM generator using the k-means algorithm which provides a reasonable starting point and improves both acceleration and convergence stability of the training process. Moving forward, it is well-known that different divergences (here, different values for the order of Rényi divergence, a) result in fundamentally different behavior during training affecting thus the obtained solution [42,43]. For instance, KLD minimization (i.e., a = 1) tends to produce a distribution that covers all the modes while the reverse KLD (i.e., a = 0) tends to produce a distribution that often contains a subset of the modes. For the spiral-shaped distribution, we experimentally verified that reverse KLD is a more suitable choice for the loss function. Indeed, reverse KLD favors the concentration of the mass towards the mean instead of covering the tails (see Supplementary Figure S1). Figure 3 demonstrates the generative capabilities of the proposed methodology in two cases: (A) when training data for the full range of values of the conditioning variable exist and (B) when no training data exist for two distant intervals of the conditioning variable. In particular, the left panel of Figure 3A shows training samples with the condition, y, encoding with coloring of the samples and being scaled in the interval [0, 1]. Accordingly, the left panel of Figure 3B shows the training samples for y 2 [0, 0.25][ [0.3, 0.6][ [0.65, 1]. The middle and right panels of Figure 3 show the generated data after training using the conditional FNN and the conditional GMM generator, respectively. Appl. Sci. 2022, 12, 5434 9 of 16 conditional FNN Training data −5 −5 −10 −10 −15 −15 −20 −20 −20 −15 −10 −5 0 5 10 15 20 −20 −15 −10 −5 0 5 10 15 20 x x 1 1 (A) conditional FNN Training data −5 −5 −10 −10 −15 −15 −20 −20 −20 −15 −10 −5 0 5 10 15 20 −20 −15 −10 −5 0 5 10 15 20 x x 1 1 (B) Figure 3. Real and generated samples of a 2-dimensional dataset of size N = 10,000 following a spiral-shaped distribution. Color indicates the value of the condition variable, y 2 [0, 1]. (A) Training data (left) where y belongs to to the whole interval [0, 1] along with generated data using conditional FNN (middle) and conditional GMM (right) with K = 30 modes. (B) Training data (left) where y belongs to [0, 0.25][ [0.3, 0.6][ [0.65, 1]. Generated data using conditional FNN (middle) and conditional GMM (right). The comparison between Figure 3A,B reveals that training on the complete data set is more accurate and also faster (see Supplementary Figure S4). Furthermore, there is a trade-off between accuracy and interpretability as in both cases the reconstruction achieved by the conditional FNN variant is relatively smoother than that of the conditional GMM generator. This trade-off emerges not only as a result of the neural network’s non-linear activation function but also as a consequence of the capacity of each generator. A ZIMM (or GMM) generator is essentially a shallow, one-layer model while FNNs can be constructed arbitrarily deep, in principle. Regardless, both generator variants manage to learn to a great extent the real distribution and delineate the original population trajectory. Interestingly, when the GMM generator is used the generated distribution resembles a piecewise linear interpolation curve between key data points on the original trajectory. This behavior can be interpreted since we have chosen the mean vectors, m , to be a linear function of the condition y as defined in Section 3.3. As a result the approximation of spiral trajectory by a mixture of Gaussians with linearly-varying mean vectors. Let us note that this result not an effect of the reparameterization methodology but of how we chose to model the conditioning. In the case of unconditional data where y is not taken into consideration, the generated data are distributed uniformly on the spiral which is approximated by a mixture of “static” Gaussians (data not shown). We should finally note that in the case of the interpretable generator (right panels of Figure 3A,B) the number of Gaussian modes K plays an important role in the quality of the generated data. In the case of choosing a higher-than-required number of Gaussian modes to represent a given data set, the extra Gaussians become degenerate (i.e., p  0). In effect, the extra Gaussians are discarded from the resulting model. On the contrary, a smaller K will result in an under-representation of the test data, covering only the main x x 2 2 2 Appl. Sci. 2022, 12, 5434 10 of 16 distribution modes while missing smaller ones. In the spiral data set, this resembles a coarse polygon-shaped approximation of Gaussians (Supplementary Figure S2). 5.2. Synthetic RNA-seq Data Next, we demonstrate the capabilities of the proposed methodology in capturing distributional changes of high dimensional populations. For this, we created a synthetic example of a population of cells that transit over three consecutive states of differentiation. We used the dyngen simulator [44] to generate N = 50,000 samples of cells, each associated with 50 gene expression values at different timepoints y across the progression of differenti- ation. By design, the generated data are zero-inflated which is a defining characteristic of the respective single-cell RNAseq technology. For this simulated example we used batch size of 512 samples in order to refrain from missing out distribution regions having low mass. In addition, the dimension of the random noise input to generator was set to 18; however, we found that this parameter does not affect the quality of results. As in the previous example, we initialize the parameters related to the location of the modes of the conditional ZIMM with k-means (over training data) in order to avoid a potentially high number of training steps and/or the optimization process getting trapped in local minima. We note here that K = 20 zero-inflated modes are required for accurate description of the 50-dimensional distribution (see Supplementary Figure S10). Figure 4 visualizes after dimensionality reduction the performance of the conditional FNN and the interpretable conditional ZIMM on the simulated data. Each row corre- sponds to a different sampling scheme. Specifically, the top row shows the generated populations when the training data are uniformly sampled across the trajectory delineated by the conditioning variable y. In contrast, the middle and bottom rows illustrate the performance of both variants when some interval of the conditioning variable is missing entirely. In the former case this entails the suspension of a continuous subset of the con- ditioning variable while, in the latter case, the suspension of a whole differentiation state, i.e., a discrete subpopulation. Evidently, in the case where all training information is available Figure 4 (top) or when a continuous subset is missing Figure 4 (middle) the conditional FNN managed to generate data that are closer to the real distribution. This is quantitatively verified by the MMD calculations reported in Table 1. Both generator models were able to inter- polate between the missing parts of the original distribution. We note here that similar performances are observed if data from the interval y 2 [0.6, 0.8] are missing instead (Supplementary Figure S11). Supplementary Figures S5–S8 display how well each variant captures the original data in the 50-dimensional space. In contrast, when a whole differentiation state is missing from the training data both methods failed to converge (data not shown). This is because in this discrete case the conditioning y does not carry any information that relates states 1 and 3 rendering them independent. For this reason, we repeated the experiment after adding a small percentage of samples from cells at state 2. Figure 4 (bottom) shows results when 5% of these cells (or 736 in absolute numbers) have been added. As seen although both variants manage to create data similar to those coming from the real distribution, they still cannot correctly represent the real distribution of state 2. For example, the variance of the fake distribution generated using the conditional FNN is high. This is probably because the number of samples added from the missing subpopulation is not enough. In fact, we found that the variance of the generated data is positively correlated to the number of samples from the missing distribution (see Supplementary Figure S12). On the other hand, in the case of the conditional ZIMM the generated data do not create a single subpopulation that interpolates well between states 1 and 3. This is because, for each individual gene, the algorithm may not fit well value regions of low probability density (region between zero-inflated and main modes of genes 32, 34–36 in Supplementary Figure S8). As expected, the MMD distance values (Table 1) when part of the overall distribution is missing are bigger in comparison Appl. Sci. 2022, 12, 5434 11 of 16 to generating data when training on the complete data set, though conditional ZIMM is slightly preferable in this case. Table 1. MMD computations using FNN and ZIMM for s = 5 in the RBF kernel, for experiments in Figure 4. Reference is the average MMD over the test set and 50 randomly chosen equally-sized training sets and is equal to & 0.015 0.005. Lower MMD value is preferable. Generator y 2 [0, 1] y 2 [0.4, 0.6] y 2 State 2 cFNN 0.0748 0.2308 0.2430 cZIMM 0.0763 0.5090 0.1719 training data conditional FNN (y ∈ [0, 1]) conditional ZIMM (y ∈ [0, 1]) training training generated generated 20 20 10 10 10 0 0 −10 −10 −10 −20 −20 −30 −20 −10 0 10 20 30 40 −40 −30 −20 −10 0 10 20 30 40 −30 −20 −10 0 10 20 30 40 PCA PCA PCA 1 1 1 conditional FNN (y ∈ [0.4, 0.6]) INTERPOLATION conditional ZIMM (y ∈ [0.4, 0.6]) INTERPOLATION t aining training gene ated generated −10 −10 −20 −30 −20 −10 0 10 20 30 40 −30 −20 −10 0 10 20 30 40 PCA1 PCA1 conditional FNN (y ∈ state 2) conditional ZIMM (y ∈ state 2) original data, states training training state 1 generated generated 20 20 20 state 2 state 3 −10 −10 −10 −20 −20 −30 −20 −10 0 10 20 30 40 −30 −20 −10 0 10 20 30 40 −30 −20 −10 0 10 20 30 40 PCA PCA PCA 1 1 1 Figure 4. Synthetic 50-dimensional data set with zero-inflated modes. First two PCA components visualization of real over generated data, colored in turquoise and orange respectively. (upper row) Training data with uniformly chosen labels y 2 [0, 1]. Conditional FNN shown in the middle is a better fit in comparison to the conditional ZIMM model on the right, at the cost of interpretability. Detailed histograms over all 50 genes (distribution marginals), can be found in the Supplementary Figure S13. (middle row) Training data colored according to pseudo-time label y (continuous) is shown on the left, where all five intervals have similar sample sizes. Training is performed on data with condition y 2 [0, 0.4][ [0.6, 1.0] (turquoise) whereas we generate data with y 2 [0.4, 0.6] (orange). Interpolation to unseen data is more accurate when the conditional FNN generator is used though conditional ZIMM is capable of identifying a significant part of the test set. (lower row) Training data colored according to differentiation state label y (discrete) is shown on the left. Conditional FNN (middle) and conditional ZIMM (right), using a subpopulation of 5% of samples from state 2. The latter is important because this discrete condition does not provide dynamical information for interpolation. 5.3. Single Cell Data Augmentation Finally, we test the training efficacy of our methodology to learn distributions in real biological settings and particularly when the number of available observations from a given cell population is low. This is a fundamental problem in life sciences for which an artificial data augmentation has been recently demonstrated to be a realistic solution [21]. To this direction, we used real measurements from [45] (data were accessed from https: //community.cytobank.org/cytobank/experiments/46098/illustrations/121588, accessed on 15 April 2020). Specifically, we consider single cell mass cytometry measurements PCA PCA 2 2 PCA PCA PCA 2 2 2 PCA PCA PCA 2 2 2 Appl. Sci. 2022, 12, 5434 12 of 16 on 16 bone marrow protein markers (d = 16) coming from healthy and disease-carrying individuals. In total, the data set consists of almost 200K healthy and disease-related cell samples. Before analysis, data were transformed using the inverse hyperbolic sine arcsinh() transformation with a cofactor of 5, which is typical in order to have comparable supports across dimensions. To simulate the existence of a rare cell subpopulation in the data we proceed as follows. First, we collect 26K random cell samples from the healthy population. Then, we consider a case where the number of cells from the rare population is 2% of that of the healthy one and another where it is only 1%. Our goal is to train the proposed conditional generator variants on distributions of rare cell population and then use them to generate new realistic samples for data augmentation. Figure 5 shows the effectiveness of each variant when the training data consist of 2% or 1% of the disease-related subpopulation of samples. In general, we see that both FNN (left) and ZIMM (right) conditional variants are able to generate realistic samples. Data augmentation is feasible for as low as 1% of disease-related data in the training set (lower plots), though some regions in the latent space are not represented sufficiently well. This is verified by a small increase in the MMD distance between the generated and the original data sets. A choice of K = 10 in ZIMM is sufficiently descriptive for a two-class model of healthy and diseased 16-dimensional distributions of these data sets. conditional FNN, 2% CN subpopulation conditional ZIMM, 2% CN subpopulation 6 6 4 4 2 2 0 0 −2 −2 Healthy Healthy −4 −4 CN reference CN reference CN generated CN generated −6 −6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 PCA PCA 1 1 conditional FNN, 1% CN subpopulation conditional ZIMM, 1% CN subpopulation 6 6 4 4 2 2 0 0 −2 −2 Healthy Healthy −4 −4 CN reference CN reference CN generated CN generated −6 −6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 PCA PCA 1 1 Figure 5. Two dimensional PCA plots on the real 16-dimensional data set. Healthy (turquoise) and diseased (red) cell distributions, each consisting of N = 26,000 samples, are shown in each subplot as reference. Cross symbols indicate the generated cell distribution (orange). (top) Performance of cFNN (left) and cZIMM (right) when the training set comprises of 2% (N = 520) diseased samples. (bottom) Performance of cFNN (left) and cZIMM (right) when the training set comprises of 1% (N = 260) diseased samples. Detailed histograms are included in Supplementary Figure S13. 5.4. Comparison against the State-of-the-Art We compare our approach against another conditional GAN approach tailored for single-cell data called conditional single-cell GAN (cscGAN) [21]. Briefly, cscGAN assumes single-cell RNAseq data as input and is able to augment under-represented classes by generating realistic synthetic samples. Similarly to the cFNN and cZIMM, it is based on PCA PCA 2 2 PCA PCA 2 2 Appl. Sci. 2022, 12, 5434 13 of 16 the assumption that the training samples lie on the same low-dimensional manifold [46]. However, the cscGAN approach cannot handle continuous conditioning variables while both cFNN and cZIMM are via label embedding. We trained cscGAN on the datasets of Sections 5.2 and 5.3. To achieve a fair com- parison, all experiments were conducted on the basis of count data because the cscGAN architecture follows a custom normalization procedure per cell (library size normalization). Please remember that we worked with count data in the scRNASeq dataset (Section 5.2) while we performed an arcsinh transformation to the protein data in Section 5.3. Fur- thermore, both cscGAN generator ’s and discriminator ’s capacity was reduced because of overfitting concerns. Essentially, we reduced the number of nodes for each of the three layers by a factor between 4 and 16 without affecting the quality of the generated samples in the training set. Obviously, the downsizing of the neural networks also resulted in much faster training. Let us also remark that the respective cFNN presented in Section 5.2 is still quite smaller than the chosen cscGAN. On the scRNASeq data, we found that cscGAN fails to generate realistic samples when the under-represented distribution interpolates between consecutive cell states (see Supplementary Figure S14). This result might stem from the fact that the cscGAN approach cannot efficiently handle correlated conditioning variables in cases of low sample size. On the other hand, cscGAN produced marginally better results on the mass cytometry data. Table 2 summarizes the MMD scores when the training data consist of two separate cell populations; one abundant and another that is arbitrarily small i.e., 2% and 1% relative to the abundant one. As expected, cZIMM’s MMD had the highest value. In contrast, the MMD scores of cFNN and cscGAN are on par for the 2% case while cscGAN has lower MMD for the 1% case. Accordingly, Figure 6 compares the results of our cFNN model (orange crosses) to the cscGAN approach (green crosses). As it is clearly seen, both cscGAN and cFNN are capable of producing realistic samples from the under-represented training class. However, cFNN requires significantly less training time mainly because of the smaller size of its generator. Table 2. cFNN, cZIMM and cscGAN MMD scores for the cases of 2% and 1% rare population (RBF kernel, s = 500). Notice that the value of s is now two orders of magnitude larger than it was in Table 3 because the data account for gene counts (i.e., unnormalized). The baseline MMD score calculated over the test set and 50 randomly chosen equally-sized training sets is equal to & 0.0086 0.002. Lower MMD value is preferable. Generator 2% Subpopulation 1% Subpopulation cFNN 0.0302 0.0359 cZIMM 0.0351 0.0374 cscGAN 0.0296 0.0328 Table 3. cFNN and cZIMM for the cases of 2% and 1% rare population (RBF kernel, s = 5). The baseline MMD score calculated over the test set and 50 randomly chosen equally-sized training sets and is equal to & 0.016 0.004. Lower MMD value is preferable. Generator 2% Subpopulation 1% Subpopulation cFNN 0.0459 0.0819 cZIMM 0.0693 0.0830 Appl. Sci. 2022, 12, 5434 14 of 16 conditional FNN, 2% CN subpopulation conditional FNN, 1% CN subpopulation 10000 10000 −5000 −5000 Healthy Healthy CN reference CN reference −10000 −10000 CN generated CN generated −10000 −5000 0 5000 10000 15000 −10000 −5000 0 5000 10000 15000 PCA PCA 1 1 cscGAN, 2% CN subpopulation cscGAN, 1% CN subpopulation 10000 10000 5000 5000 0 0 −5000 −5000 Healthy Healthy CN reference CN reference −10000 −10000 CN generated (cscGAN) CN generated (cscGAN) −10000 −5000 0 5000 10000 15000 −10000 −5000 0 5000 10000 15000 PCA PCA 1 1 Figure 6. Two dimensional PCA plots on the real 16-dimensional count data. Healthy (turquoise) and diseased (red) cell distributions, each consisting of N = 26,000 samples, are shown in each subplot as reference. Cross symbols indicate the generated cell distributions. (top) Performance of cFNN when the training set comprises of 2% (left) and 1% (right) diseased samples. (bottom) cscGAN generated data for the same training data sets. Both models are equally capable of data augmentation of underrepresented classes. 6. Conclusions In this paper, we proposed a new approach for multivariate data interpolation and augmentation through conditional generative adversarial network (GAN) modeling. This likelihood-free approach is able to accept high-dimensional, zero-inflated data, which is a frequent characteristic of single-cell RNA-seq data. We provide two variants; FNN and ZIMM, for which there is a trade-off between model’s accuracy and interpretability. Both variants are able to generate realistic single-cell RNA-seq representations of high- dimensional data belonging either to distinct states (i.e. discrete condition) or continuous- time (i.e. continuous condition). In addition, we performed data augmentation in cases where as low as 1% of samples from a state are provided. Interpolation and augmentation tasks are particularly important when data contain dis- rupted or incomplete measurements, respectively. For the conditional ZIMM model, careful initialization and hyper-parameter selection based on the real data is crucial, whereas con- ditional FNN performs well for a wide range of hyper-parameters and data dimensionality. Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/app12115434/s1. Author Contributions: Conceptualization, Y.P.; methodology, Y.P. and A.T.; software, A.T.; validation, A.T.; formal analysis, Y.P.; investigation, A.T.; resources, Y.P.; data curation, G.P.; writing—original draft preparation, A.T.; writing—review and editing, G.P. and Y.P.; visualization, A.T.; supervision, Y.P. and G.P.; project administration, Y.P.; funding acquisition, Y.P., G.P. and A.T. All authors have read and agreed to the published version of the manuscript. Funding: This research is co-financed by Greece and the European Union (European Social Fund- ESF) through the Operational Programme “Human Resources Development, Education and Lifelong Learning 2014-2020” in the context of the project “Characterizing Population Dynamics with Ap- plications in Biological Data” (MIS 5050686). Yannis Pantazis acknowledges partial support by PCA PCA 2 2 PCA PCA 2 2 Appl. Sci. 2022, 12, 5434 15 of 16 the Hellenic Foundation for Research and Innovation (H.F.R.I.) under the “Second Call for H.F.R.I. Research Projects to support Faculty members and Researchers” (Project Number: 4753). Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable. Data Availability Statement: Code will be made available at https://github.com/tasoskrhs/conditional_ ZIMM, upon decision of this work. Conflicts of Interest: The authors declare no conflict of interest. References 1. Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.C.; Bengio, Y. Generative adversarial nets. In Proceedings of the Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. 2. Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:cs.LG/1411.1784. 3. Denton, E.; Chintala, S.; Szlam, A.; Fergus, R. Deep Generative Image Models Using a Laplacian Pyramid of Adversarial Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 1, pp. 1486–1494. 4. Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. 5. Odena, A.; Olah, C.; Shlens, J. Conditional Image Synthesis with Auxiliary Classifier GANs. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; Volume 70, pp. 2642–2651. 6. Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. 7. Brock, A.; Donahue, J.; Simonyan, K. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. 8. Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4401–4410. 9. Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2242–2251. [CrossRef] 10. Ledig, C.; Theis, L.; Huszar, F.; Caballero, J.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; Shi, W. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. 11. Pascual, S.; Bonafonte, A.; Serrà, J. SEGAN: Speech Enhancement Generative Adversarial Network. In Proceedings of the INTERSPEECH, Stockholm, Sweden, 20–24 August 2017; pp. 3642–3646. 12. Saito, Y.; Takamichi, S.; Saruwatari, H. Statistical parametric speech synthesis incorporating generative adversarial networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 84–96. [CrossRef] 13. Kumar, K.; Kumar, R.; de Boissiere, T.; Gestin, L.; Teoh, W.Z.; Sotelo, J.; de Brébisson, A.; Bengio, Y.; Courville, A.C. MELGAN: Generative adversarial networks for conditional waveform synthesis. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 14881–14892. 14. Che, T.; Li, Y.; Zhang, R.; Hjelm, R.D.; Li, W.; Song, Y.; Bengio, Y. Maximum-likelihood augmented discrete generative adversarial networks. arXiv 2017, arXiv:1702.07983. 15. Fedus, W.; Goodfellow, I.; Dai, A.M. MaskGAN: Better Text Generation via Filling in the _. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. 16. Lan, L.; You, L.; Zhang, Z.; Fan, Z.; Zhao, W.; Zeng, N.; Chen, Y.; Zhou, X. Generative Adversarial Networks and Its Applications in Biomedical Informatics. Front. Public Health 2020, 8, 164. [CrossRef] 17. Saelens, W.; Cannoodt, R.; Todorov, H.; Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 2019, 37, 547–554. [CrossRef] 18. Buettner, F.; Natarajan, K.N.; Casale, F.P.; Proserpio, V.; Scialdone, A.; Theis, F.J.; Teichmann, S.a.; Marioni, J.C.; Stegle, O. Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat. Biotechnol. 2015, 33, 155–160. [CrossRef] 19. Stegle, O.; Teichmann, S.A.; Marioni, J.C. Computational and analytical challenges in single-cell transcriptomics. Nat. Rev. Genet. 2015, 16, 133–145. [CrossRef] 20. Ghahramani, A.; Watt, F.M.; Luscombe, N.M. Generative adversarial networks simulate gene expression and predict perturbations in single cells. bioRxiv 2018. [CrossRef] 21. Marouf, M.; Machart, P.; Bansal, V.; Kilian, C.; Magruder, D.S.; Krebs, C.F.; Bonn, S. Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks. Nat. Commun. 2020, 11, 166. [CrossRef] Appl. Sci. 2022, 12, 5434 16 of 16 22. Grønbech, C.H.; Vording, M.F.; Timshel, P.N.; Sønderby, C.K.; Pers, T.H.; Winther, O. scVAE: Variational auto-encoders for single-cell gene expression data. Bioinformatics 2020, 36, 4415–4422. [CrossRef] [PubMed] 23. Xu, Y.; Zhang, Z.; You, L.; Liu, J.; Fan, Z.; Zhou, X. scIGANs: Single-cell RNA-seq imputation using generative adversarial networks. Nucleic Acids Res. 2020, 48, e85. [CrossRef] [PubMed] 24. Arisdakessian, C.; Poirion, O.; Yunits, B.; Zhu, X.; Garmire, L.X. DeepImpute: An accurate, fast, and scalable deep neural network method to impute single-cell RNA-seq data. Genome Biol. 2019, 20, 211. [CrossRef] [PubMed] 25. Nowozin, S.; Cseke, B.; Tomioka, R. F-GAN: Training Generative Neural Samplers Using Variational Divergence Minimization. In Proceedings of the 30th International Conference on Neural Information Processing Systems 29 (NIPS 2016), Barcelona, Spain, 5–10 December 2016; pp. 271–279. 26. Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; Volume 70, pp. 214–223. 27. Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A.C. Improved training of Wasserstein GANs. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5767–5777. 28. Birrell, J.; Dupuis, P.; Katsoulakis, M.A.; Rey-Bellet, L.; Wang, J. Variational Representations and Neural Network Estimation for Rényi Divergences. arXiv 2020, arXiv:2007.03814. 29. Miyato, T.; Kataoka, T.; Koyama, M.; Yoshida, Y. Spectral Normalization for Generative Adversarial Networks. In Proceedings of the International Conference on Learning Representations, Sydney, NSW, Australia, 6–11 August 2018. 30. Li, C.L.; Chang, W.C.; Cheng, Y.; Yang, Y.; Poczos, B. MMD GAN: Towards Deeper Understanding of Moment Matching Network. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. 31. Birrell, J.; Dupuis, P.; Katsoulakis, M.A.; Pantazis, Y.; Rey-Bellet, L. (f,Gamma)-Divergences: Interpolating between f-Divergences and Integral Probability Metrics. J. Mach. Learn. Res. 2022, 23, 1–70. 32. Bishop, C.M. Pattern Recognition and Machine Learning (Information Science and Statistics); Springer: New York, NY, USA, 2006. 33. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1977, 39, 1–38. 34. Moon, T. The expectation-maximization algorithm. IEEE Signal Process. Mag. 1996, 13, 47–60. [CrossRef] 35. Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14–16 April 2014. 36. Maddison, C.J.; Mnih, A.; Teh, Y.W. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. In Proceedings of the 5th International Conference on Learning Representations, ICLR, Toulon, France, 24–26 April 2017. 37. Zong, B.; Song, Q.; Min, M.R.; Cheng, W.; Lumezanu, C.; Cho, D.; Chen, H. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. 38. Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y.N. Convolutional Sequence to Sequence Learning. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; Volume 70, pp. 1243–1252. 39. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. 40. Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A Kernel Two-sample Test. J. Mach. Learn. Res. 2012, 13, 723–773. 41. Szabó, Z. Information Theoretical Estimators Toolbox. J. Mach. Learn. Res. 2014, 15, 283–287. 42. Minka, T. Divergence Measures and Message Passing; Technical Report MSR-TR-2005-173; Microsoft Research: New York, NY, USA, 2005. 43. Pantazis, Y.; Paul, D.; Fasoulakis, M.; Stylianou, Y.; Katsoulakis, M.A. Cumulant GAN. arXiv 2020, arXiv:2006.06625v2. 44. Cannoodt, R.; Saelens, W.; Deconinck, L.; Saeys, Y. Spearheading future omics analyses using dyngen, a multi-modal simulator of single cells. Nat. Commun. 2021, 12, 3942. [CrossRef] [PubMed] 45. Levine, J.H.; Simonds, E.F.; Bendall, S.C.; Davis, K.L.; Amir, E.A.D.; Tadmor, M.D.; Litvin, O.; Fienberg, H.G.; Jager, A.; Zunder, E.R.; et al. Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis. Cell 2015, 162, 184–197. [CrossRef] [PubMed] 46. Lindenbaum, O.; Stanley, J.; Wolf, G.; Krishnaswamy, S. Geometry Based Data Generation. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Volume 31. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Applied Sciences Multidisciplinary Digital Publishing Institute

GAN-Based Training of Semi-Interpretable Generators for Biological Data Interpolation and Augmentation

Loading next page...
 
/lp/multidisciplinary-digital-publishing-institute/gan-based-training-of-semi-interpretable-generators-for-biological-kaZqhGabul

References (18)

Publisher
Multidisciplinary Digital Publishing Institute
Copyright
© 1996-2022 MDPI (Basel, Switzerland) unless otherwise stated Disclaimer The statements, opinions and data contained in the journals are solely those of the individual authors and contributors and not of the publisher and the editor(s). MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. Terms and Conditions Privacy Policy
ISSN
2076-3417
DOI
10.3390/app12115434
Publisher site
See Article on Publisher Site

Abstract

applied sciences Article GAN-Based Training of Semi-Interpretable Generators for Biological Data Interpolation and Augmentation Anastasios Tsourtis *, Georgios Papoutsoglou and Yannis Pantazis Institute of Applied and Computational Mathematics, Foundation of Research and Technology Hellas, GR 700 13 Heraklion, Greece; papoutsoglou@iacm.forth.gr (G.P.); pantazis@iacm.forth.gr (Y.P.) * Correspondence: tsourtis@iacm.forth.gr Abstract: Single-cell measurements incorporate invaluable information regarding the state of each cell and its underlying regulatory mechanisms. The popularity and use of single-cell measurements are constantly growing. Despite the typically large number of collected data, the under-representation of important cell (sub-)populations negatively affects down-stream analysis and its robustness. Therefore, the enrichment of biological datasets with samples that belong to a rare state or manifold is overall advantageous. In this work, we train families of generative models via the minimization of Rényi divergence resulting in an adversarial training framework. Apart from the standard neural network-based models, we propose families of semi-interpretable generative models. The proposed models are further tailored to generate realistic gene expression measurements, whose characteristics include zero-inflation and sparsity, without the need of any data pre-processing. Explicit factors of the data such as measurement time, state or cluster are taken into account by our generative models as conditional variables. We train the proposed conditional models and compare them against the state-of-the-art on a range of synthetic and real datasets and demonstrate their ability to accurately perform data interpolation and augmentation. Citation: Tsourtis, A.; Papoutsoglou, Keywords: single-cell RNA-seq data generation; data interpolation and augmentation; Gaussian G.; Pantazis, Y. GAN-Based Training mixture model; zero-inflated random variables; generative adversarial networks; Rényi divergence of Semi-Interpretable Generators for minimization Biological Data Interpolation and Augmentation. Appl. Sci. 2022, 12, 5434. https://doi.org/10.3390/ app12115434 1. Introduction Academic Editors: Mauro Castelli A breakthrough in creating powerful generative models has been the invention of and Luca Manzoni Generative Adversarial Networks (GANs) [1]. In principle, GANs are capable of modelling any high dimensional data distribution through the training of two adversarial neural net- Received: 30 April 2022 works. The networks are trained together in a competing manner: one network generates Accepted: 23 May 2022 data whose distribution approximates the distribution of the real data; hence, it is called Published: 27 May 2022 the generator, while the other neural network evaluates whether the generated data are Publisher’s Note: MDPI stays neutral real or not; hence, it is called the discriminator. GANs have been proven very successful with regard to jurisdictional claims in in generating photo-realistic images and audio/speech samples. Applications of GANs published maps and institutional affil- include (conditional) image generation [2–8], image-to-image translation [9] and image iations. super-resolution [10]. In time-series, GANs have been used for speech enhancement [11] and speech synthesis [12,13] as well as for natural language processing [14,15] among other types of raw data. More recently, they have been started to show their usefulness in life science applications [16], too. Copyright: © 2022 by the authors. GANs often face training difficulties when applied in single-cell data; either gene expres- Licensee MDPI, Basel, Switzerland. sion or proteomic data. The main reason is that cells constantly shift from one state or type to This article is an open access article another, either naturally like in an immune response or artificially like after a drug treatment, distributed under the terms and and the current sequencing technology allows only sparse observations scattered along the conditions of the Creative Commons evolution of the samples [17]. For example, single-cell RNA sequencing (scRNAseq) is able to Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ capture the diversity of cells during cell differentiation or activation by profiling transcript 4.0/). abundances at high resolution. However, it does so by creating snapshot observations where Appl. Sci. 2022, 12, 5434. https://doi.org/10.3390/app12115434 https://www.mdpi.com/journal/applsci Appl. Sci. 2022, 12, 5434 2 of 16 the temporal resolution is low or totally unknown. Consequently, some cell states may be missing entirely or a low number of samples are available from those states [18]. On top of that, the collected data include a huge amount of zero values (i.e., element-wise zero-inflated samples) that cause established analysis methods to break down [19]. In scRNAseq modeling, representation and generation, GANs and more generally generative models have been utilized for imputing missing values in the data, for dimen- sionality reduction, for mapping data across different measurement types (e.g., domain adaptation), for data augmentation through the approximation of the original data distri- bution and for modeling the response of differentiating cells upon external perturbation. Particularly, an early work on GANs with scRNAseq data was the identification of state- determining genes and clustering tasks [20]. Marouf et al. [21] developed a conditional GAN for data augmentation of single-cell data sets demonstrating realistic in-silico genera- tion of low size cell types. In [22], authors took into account the zero-inflation of the data and proposed a variational autoencoder for clustering purposes. On a different approach to mitigate the excess amount of zeros in the expression matrix, authors in [23,24] performed data imputation. In [23], particularly, the authors converted scRNAseq data to images and then trained GANs to perform dropout imputation and related downstream analysis. Here, we build on the existing literature and further present: 1. Semi-interpretable generators that generate synthetic data that follow the statistical properties of biological data and particularly the mixed nature of the distributions of single-cell RNA-seq expression data. 2. Based on the framework of GANs, a novel training approach for a mixture model of zero-inflated multi-dimensional variables. We utilize Rényi divergence minimization for the first time on biological data to estimate the parameters. Overall, the proposed approach is more flexible since it does not require the optimization over an (assumed) likelihood function. 3. Comparisons between neural network generators (not interpretable) and the proposed semi-interpretable generators for data interpolation and data augmentation tasks. Despite both variants produced accurate results, we show that there is a trade-off between sample quality of the generated data and interpretability. We also train the generative model proposed in [21] and compare its results with ours in terms of quality and complexity. The rest of the paper is structured as follows. In Section 2, a brief introduction on GAN training and Rényi divergence is provided while Section 3 presents the generators used to generate (potentially) zero-inflated variables. Section 4 describes how the experiments were setup and how results were evaluated. In Section 5, we demonstrate the performance of the proposed models on data interpolation and data augmentation using simulated as well as real biological datasets. Section 6 concludes the paper. 2. Generative Adversarial Networks Originally, Generative Adversarial Networks (GANs) have been introduced as a zero- sum game between two competing neural networks with high learning capacity [1]. The first neural network is the generator which takes random noise as input and transforms it to a realistic sample while the second neural network which is called the discriminator aims to detect the differences between the real and synthetic samples. A different but more instructive presentation of the GAN framework is to consider GANs as a minimization problem between two probability distributions. Mathematically, letting p denote the distribution of the real samples and p the distribution of the synthetic samples then the goal is to minimize minD( p jj p ) (1) r g where G is the generator whileD(jj) is a divergence (i.e., a pseudo-distance) or a distance between two distributions. Vanilla GAN [1] minimizes the Shannon–Jensen divergence while f -GAN [25] minimizes the f -divergence. A significantly more stable GAN is ob- Appl. Sci. 2022, 12, 5434 3 of 16 tained via Wasserstein distance minimization giving raise to WGAN [26,27]. With the divergence minimization perspective, the utilization of the discriminator seems redundant. Unfortunately, the distance or divergence D(jj) in (1) is not computable because explicit expressions for p and p are not available in the majority of the applications. However, r g it can become tractable using the so called variational representation (or duality) formu- las. Variational formulas essentially transform the problem of estimating the value of a divergence to a maximization problem over a function space. The function space is further approximated by parametric families of models. Neural networks is the typical choice of approximation models. Please note that these parametric models constitute the analog of the discriminator. In the following, we propose to use Rényi divergence for training the generator and demonstrate how it can become tractable via the Rényi–Donsker–Varahdan variational formula. 2.1. Rényi Divergence Minimization and Rényi GAN Assuming mutual absolute continuity for the distributions, the Rényi divergence of p with respect to p is defined by 1 d p R p jj p := logE , (2) a r g p a(a 1) d p where a 2 Rnf0, 1g is the order parameter. The definition can be extended to 0 and 1 by continuity leading to reverse KL and KL divergence, respectively. The order a effectively controls the tail characteristics of the two distributions. A variational representation formula is essentially an expression between expected values that form a lower bound for the divergence. As already mentioned, the use of varia- tional formulas transforms the direct divergence estimation problem to an optimization problem over functions. The variational representation of the Rényi divergence is given by ([28] Theorem 1) 1 1 (a1)D aD R p jj p = sup logE [e ] logE [e ] , (3) a r g p p r g a 1 a D2M (W) where M (W) is the space of all measurable and bounded functions from W to R. Combining (1) with (3) and introducing an additional generalization, the minimax optimization problem is formulated as 1 1 (a1)D(x) aD(G(z)) min max logE [e ] logE [e ] , (4) x p z p r z G D2G a 1 a where G is the function space for the discriminator. When G is chosen to be the function space of all measurable and bounded functions then the Rényi divergence of p with respect to p is minimized. However there are several other options that could be selected and they are often preferred. Indeed, a widely-used function space is the Lipschitz continuous func- tion space which has been shown to improve the stability of the training in GANs [27,29]. Another choice is the Reproducing kernel Hilbert space (RKHS) which have been also uti- lized in GANs [30]. It has been recently shown in [31] that rich-enough function spaces for the discriminator can be used and still preserve the divergence minimization perspective. 2.2. Conditional Rényi GAN Often, we are interested in generating a conditional distribution. By letting y denote the condition, the optimization problem in (4) is extended to 1 1 (a1)D(x,y) aD(G(z,y),y) min max logE [e ] logE [e ] , (5) x p z p r z G D2G a 1 a where both the generator and discriminator take as additional input the conditional variable y. Appl. Sci. 2022, 12, 5434 4 of 16 The condition variable y can be either discrete (e.g., categorical) or continuous and in both cases, its vector embedding, denoted also with y, lies in a higher d -dimensional space. In the discrete case, the one-hot encoding can be used and d will be equal to the number of categories in y, e.g., y = (0, 1, 0, . . . , 0) for the 2nd category. In the one-dimensional continuous case, the condition is represented as a linear combination between two vectors in the d -dimensional Euclidean space. Mathematically, y = y e + (1 y )e , where y is y n 1 n 2 n the unity-based normalization (a.k.a. min–max feature scaling) of y. 3. Semi-Interpretable Generators The typical parametric models used for the generator are various types of neural networks. Despite their expressivity, neural networks are difficult to interpret therefore we target to model the data distribution with the simpler but more interpretable Gaussian Mixture Model (GMM) [32]. Instead of following the standard Expectation-Maximization algorithm [33,34] for the estimation of the GMM’s parameters, we propose to use the likelihood-free divergence minimization framework and apply a reparameterization trick for the GMM in order to differentiate and be able to back-propagate the gradients of the parameters. We first revisit the reparameterization trick for the Gaussian distribution and then extend it for the GMM as well as for the conditional GMM. 3.1. Gaussian Case Assuming a Gaussian generator with parameters q = fm, Sg, the reparameterization trick simply states that a Gaussian sample can be obtained as an affine transformation of a standard d-dimensional Gaussian distribution N (0, I ). As in Variational Auto En- coders [35], a Gaussian sample is generated from the affine transformation x = m + Lz with z  N (0, I ) and L satisfies the equation LL = S. In other words, the generator becomes a linear transformation of the input noise vector z. 3.2. GMM Case Moving forwards, we present the more expressive case of GMMs as generators. The probability density of a GMM is a weighted sum of K d-dimensional Gaussians given by p N (m , S ) where p > 0 is the normalized frequency of the k-th Gaussian k k k k k=1 while (m , S ) are the respective mean vector and covariance matrix. A sample via the k k reparameterization trick for the GMM is obtained as x = 1 (u) m + L z , with u  U(0, 1) , z  N (0, I ) , i.i.d.. (6) å k k k k d [w ,w ] k1 k k=1 where 1 () indicates the indicator function of the interval [a, b] while w corresponds [a,b] k to the cumulative mass function given by w = å p with w = 0 and w = 1. Notice 0 K k k j=1 that the indicator function is non-differentiable because it is discontinuous at the extreme points a and b. The generator parameters to be learned are the probability and the parameters of the affine transformation for each Gaussian whereas K is fixed. The choice of K will respect the data set characteristics and particularly, as a rule of thumb, K could take a value close to the number of clusters in the data set. A clustering algorithm such as k-means (or Louvain clustering or similar algorithm) could be used on the training data set prior training, in order to determine the value of K. Moreover, we experimentally observed that utilizing a higher value of K leads to many weights, p , to be close to zero. Therefore, we typically choose a large value for K without instilling instabilities in the training process. In other words, we could select K using our framework iteratively by keeping only the modes that are above a certain threshold. While the training of m and L is straightforward, the training of the probabilities is k k more elaborate. First, the softmax function is used on every training step to normalize the values of p = where q ’ are the unconstrained trainable variables which can be k k K j å e j=1 Appl. Sci. 2022, 12, 5434 5 of 16 trained with standard stochastic back-propagation algorithms. Indeed, the direct training of the probabilities, p cannot be carried over with the standard unconstrained algorithms because of the constraint requirements for p . Second and in order to avoid the non- differentiability issue of the indicator function, we approximate it via the difference of two scaled sigmoid functions with a scaling factor, c. Mathematically, the approximation is given by 1 1 1 (u)  s (u w ) s (u w ) = . (7) c k1 c k [w ,w ] k1 k c(uw ) c(uw ) k1 k 1 + e 1 + e Figure 1 shows both the indication function and its approximation for two values of the scaling factor. We set the value for the scaling factor to c = 300 which is a reasonable compromise between accuracy of the indicator function and the efficient propagation of the (non-zero) gradients. We also remark that the authors in [36] followed another approach in similar rationale to avoid differentiation issues associated with sampling from discrete distributions. 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Figure 1. Approximation of the indicator function 1 (u) by the difference of two sigmoids. [w ,w ] k1 k Factor c controls the steepness and we plot c = 100 (blue), c ˆ = 300 (red) and c ¯ = 500 (yellow). 3.2.1. Additional Penalty Term on Diagonal Elements of the Covariance Matrix The training of a GMM generator often results in degenerate covariance matrices where the diagonal elements of the covariance matrices take very small values of order O(10 ). In order to overcome this stability issue during training, we penalize the diagonal elements of S = L L matrices to be positive as in [37]. Thus, we add to the loss function (4) the k k penalty term K d l P(S) = l , (8) S S å å (S ) k jj k=1 j=1 where the coefficient l > 0 controls the effect of the penalty term while S is the covariance S k matrix of the k-th Gaussian. Upon experimentation, we find that l = 0.001 produce non- degenerate covariance matrices in several synthetic examples. Supplementary Figure S3 presents additional results for a series of values for the multiplicative factor l . 3.3. Conditional GMM Case When we want to construct the conditional variant of Equation (6) over y, we expand the parameter space as follows. Let q = [q , . . . , q ] be the unconstrained probability 1 K vector, then we extend it as a linear function of the conditional vector, q = yW + b , q q d K K where W 2 R and b 2 R are the new parameters to be learned. Additionally, we q q incorporate condition y in each mean vector following the same rationale: m = yW + b , k m,k m,k d d d where W 2 R and b 2 R . Thus, the weight as well the means of each Gaussian m,k m,k might change with the condition. This is quite common in biological datasets and especially in the differentiation process where sub-populations of cells evolve in size and number Appl. Sci. 2022, 12, 5434 6 of 16 with time which acts as the condition variable. Finally, in our conditional generator, we choose the covariance matrices, S to be independent of the condition variable, y. 3.4. Generative Models for Zero-Inflated Variables There is an important class of biological data whose values are both discrete and continuous and can be represented with mixed random variables. A particular example of interest is the so-called zero-inflated variables which have a non-zero mass at value zero and a density distribution for the typically positive values. A zero-inflated variable can simultaneously model the answer of a Yes/No question and the value of the quantity for the case the answer is Yes. In gene expression studies and RNA-seq data, zero-inflated variables are unequivocal since a gene might be or might not be activated and when activated the level of activity is measured. The generation of zero-inflated variables requires an extra modeling step which will take into account the mixed nature of these variables as they constitute a mixture of a continuous distribution and a point mass Dirac at zero. Next, we present how to incorporate zero-inflated variables both in the neural network setting and in extending the GMM-based generator case. 3.4.1. Neural Networks: Gated Activation In the case of feedforward neural networks (FNN), we propose to use gated linear units (GLU) [38] as activation functions in the output layer of the generator. A GLU is defined as an element-by-element multiplication between the output of a linear layer with another “parallel” layer with sigmoid activation which acts as a mask (or a gate). The information of the linear layer passes when the gate is open (i.e., when sigmoid activation is one) while the output is zero when the gate is closed (i.e., when sigmoid activation is zero). The architecture of the generator with an GLU output layer is shown in Figure 2. The conditional variant of the FNN generator is similar to [2], where we augment each sample with its respective embedded condition y. In the following we refer to this variant as cFNN. Figure 2. Schematic of how gating is implemented in the Generator architecture of the feed-forward neural network (FNN) conditioned on y. Gating is used to model zero-inflated samples. The 0 0 output layer is calculated as G(z, y) = (W h + b) s(W h + b ) where  stands for element-wise multiplication. 3.4.2. The Reparameterization Trick for ZIMMs As already mentioned, the probability density function of a zero-inflated variable can be represented as a weighted sum of a Dirac mass at zero and a continuous component. Thus, a Bernoulli random variable can determine which of the two components will be selected for the generation of one sample. Using the same trick as for the mode selection in GMMs, the extension of Equation (6) can be written as an extra combination of a Gaussian Appl. Sci. 2022, 12, 5434 7 of 16 sample z and a Bernoulli distribution with probability a . We term Equation (9) as Zero- k k Inflated Mixture Model (ZIMM), 0 0 x = 1 (u) 1 (u ) 0 + 1 (u ) (m + L z ) , (9) å [w ,w ] [0,a ] [a ,1] k k k k1 k k k k=1 0 d z  N (0, I ) , u  U(0, 1) , u  U(0, 1) , k d where each a is a d-dimensional vector denoting the Bernoulli probabilities for each dimension of the data. As in the GMM case, we approximate both indicator functions with the difference of two sigmoid functions as (7). In total, there are K d more parameters to be learned in the ZIMM relative to GMM which correspond to the Bernoulli probabilities for each dimension and each mode. We also implement a conditional version of ZIMM where the trainable variables for the weight probabilities and the mean vector of each Gaussian are conditioned while the remaining parameters are not conditioned. 4. Experimental Setup All experiments were performed using the (conditional) Rényi GAN framework with a discriminator of two hidden layers. The FNN-based generator has two hidden layers as well, whereas the ZIMM-based generator is designed as described in Section 3.4.2. We perform a small grid search for hyper-parameter tuning that minimized the adversarial loss as well as qualitative data visualization, a common practice for training deep generative models. Based on this grid search, we choose the number of nodes to be 32 for both hidden layers and all of the experiments described in Section 5. The ReLU activation function is used in all hidden layers of the discriminator and the generator. For the output layer tanh() is used for the discriminator and linear activation for the generator. The constraining of the optimization to the space of Lipschitz-1 functions was carried out by one-sided gradient penalty [27], with constant l = 1. When Lipschitz-1 penalization is used, the activation gp of the output layer is linear. We used batches of size in the range 512–1024 depending on the dimension of the data and the ADAM optimizer [39] with learning rate in the range 0.0005–0.001. The number of training steps varied between the experiments as we observe that convergence was slower for the semi-interpretable generator case. Depending on the initialization of the network parameters, convergence of the adversarial loss is attained in a range of 100,000–200,000 minibatch steps for the FNN variant and twice as that for the ZIMM-based generator, provided that Lipschitz-1 constraining is used, otherwise more training steps are required. Unless otherwise stated, the order of the Rényi divergence is set to a = 0.5 which corresponds to a 1-1 relation with the Hellinger distance. In all cases, the discriminator is updated 5 times followed by an update for the generator. Hyper- parameters node-size, learning rate and batch-size were determined via a small grid search that minimized adversarial loss as well as qualitative data visualization, a common practice for training deep generative models. Evaluation Criteria for Assessing Generated Data Quality We employ three criteria to assess the quality of the generated data G(z)  p with respect to test data from the real distribution, p : maximum mean discrepancy (MMD) [40], principal component analysis (PCA) and marginal distribution histograms. We use the Information Theoretical Estimators Toolbox [41] for computing the MMD between the test and generated distributions. Reference MMD is computed from test data as the average of fifty, randomly-sampled, equally-sized datasets, all conditioned on the same subset of y values. Every generated or real data set consists of 1000 data points for fair comparison. The radial basis function with s = 5.0 is chosen for the kernel and this value is determined by the average variance of the training data and accounts for better sensitivity to small differences between distributions under comparison. Both U-Statistics and V-Statistics Appl. Sci. 2022, 12, 5434 8 of 16 implementations of MMD conclude to very similar values and we chose the first estimate. Scikit-learn 1.0.1 is used for PCA. 5. Results and Discussion 5.1. Nonlinear Trajectory Interpolation We begin with demonstrating the efficacy of the proposed Rényi GAN training method- ology to learn the mapping and then interpolate between points from a population whose distribution follows a spiral pattern which is a two-dimensional variant of the Swiss roll dataset: 3p (x , x ) = (y cos (y), y sin (y)) + # , with y = (1 + 2u) , (10) 1 2 u  U(0, 1) and #  N (0, e I ) , where y acts as the condition variable and e = 0.2. As explained in Section 2, we construct the embedding vector of the conditional variable, y, to span a 10-dimensional interval with boundaries e = (1, . . . , 1) and e = e . 1 2 1 Moreover, we set the generator ’s noise input, z, to be a 2-dimensional Gaussian vector. We found that the training steps required by the FNN generator to converge were half of those required by the GMM generator (Supplementary Figure S4). To speed up convergence for both cases, we used gradient penalty during optimization thus restricting the function space of the discriminator to Lipschitz continuous functions. We further apply the covariance penalty as explained in Section 3.2.1 for the GMM generator. We also initialize the parameters of the GMM generator using the k-means algorithm which provides a reasonable starting point and improves both acceleration and convergence stability of the training process. Moving forward, it is well-known that different divergences (here, different values for the order of Rényi divergence, a) result in fundamentally different behavior during training affecting thus the obtained solution [42,43]. For instance, KLD minimization (i.e., a = 1) tends to produce a distribution that covers all the modes while the reverse KLD (i.e., a = 0) tends to produce a distribution that often contains a subset of the modes. For the spiral-shaped distribution, we experimentally verified that reverse KLD is a more suitable choice for the loss function. Indeed, reverse KLD favors the concentration of the mass towards the mean instead of covering the tails (see Supplementary Figure S1). Figure 3 demonstrates the generative capabilities of the proposed methodology in two cases: (A) when training data for the full range of values of the conditioning variable exist and (B) when no training data exist for two distant intervals of the conditioning variable. In particular, the left panel of Figure 3A shows training samples with the condition, y, encoding with coloring of the samples and being scaled in the interval [0, 1]. Accordingly, the left panel of Figure 3B shows the training samples for y 2 [0, 0.25][ [0.3, 0.6][ [0.65, 1]. The middle and right panels of Figure 3 show the generated data after training using the conditional FNN and the conditional GMM generator, respectively. Appl. Sci. 2022, 12, 5434 9 of 16 conditional FNN Training data −5 −5 −10 −10 −15 −15 −20 −20 −20 −15 −10 −5 0 5 10 15 20 −20 −15 −10 −5 0 5 10 15 20 x x 1 1 (A) conditional FNN Training data −5 −5 −10 −10 −15 −15 −20 −20 −20 −15 −10 −5 0 5 10 15 20 −20 −15 −10 −5 0 5 10 15 20 x x 1 1 (B) Figure 3. Real and generated samples of a 2-dimensional dataset of size N = 10,000 following a spiral-shaped distribution. Color indicates the value of the condition variable, y 2 [0, 1]. (A) Training data (left) where y belongs to to the whole interval [0, 1] along with generated data using conditional FNN (middle) and conditional GMM (right) with K = 30 modes. (B) Training data (left) where y belongs to [0, 0.25][ [0.3, 0.6][ [0.65, 1]. Generated data using conditional FNN (middle) and conditional GMM (right). The comparison between Figure 3A,B reveals that training on the complete data set is more accurate and also faster (see Supplementary Figure S4). Furthermore, there is a trade-off between accuracy and interpretability as in both cases the reconstruction achieved by the conditional FNN variant is relatively smoother than that of the conditional GMM generator. This trade-off emerges not only as a result of the neural network’s non-linear activation function but also as a consequence of the capacity of each generator. A ZIMM (or GMM) generator is essentially a shallow, one-layer model while FNNs can be constructed arbitrarily deep, in principle. Regardless, both generator variants manage to learn to a great extent the real distribution and delineate the original population trajectory. Interestingly, when the GMM generator is used the generated distribution resembles a piecewise linear interpolation curve between key data points on the original trajectory. This behavior can be interpreted since we have chosen the mean vectors, m , to be a linear function of the condition y as defined in Section 3.3. As a result the approximation of spiral trajectory by a mixture of Gaussians with linearly-varying mean vectors. Let us note that this result not an effect of the reparameterization methodology but of how we chose to model the conditioning. In the case of unconditional data where y is not taken into consideration, the generated data are distributed uniformly on the spiral which is approximated by a mixture of “static” Gaussians (data not shown). We should finally note that in the case of the interpretable generator (right panels of Figure 3A,B) the number of Gaussian modes K plays an important role in the quality of the generated data. In the case of choosing a higher-than-required number of Gaussian modes to represent a given data set, the extra Gaussians become degenerate (i.e., p  0). In effect, the extra Gaussians are discarded from the resulting model. On the contrary, a smaller K will result in an under-representation of the test data, covering only the main x x 2 2 2 Appl. Sci. 2022, 12, 5434 10 of 16 distribution modes while missing smaller ones. In the spiral data set, this resembles a coarse polygon-shaped approximation of Gaussians (Supplementary Figure S2). 5.2. Synthetic RNA-seq Data Next, we demonstrate the capabilities of the proposed methodology in capturing distributional changes of high dimensional populations. For this, we created a synthetic example of a population of cells that transit over three consecutive states of differentiation. We used the dyngen simulator [44] to generate N = 50,000 samples of cells, each associated with 50 gene expression values at different timepoints y across the progression of differenti- ation. By design, the generated data are zero-inflated which is a defining characteristic of the respective single-cell RNAseq technology. For this simulated example we used batch size of 512 samples in order to refrain from missing out distribution regions having low mass. In addition, the dimension of the random noise input to generator was set to 18; however, we found that this parameter does not affect the quality of results. As in the previous example, we initialize the parameters related to the location of the modes of the conditional ZIMM with k-means (over training data) in order to avoid a potentially high number of training steps and/or the optimization process getting trapped in local minima. We note here that K = 20 zero-inflated modes are required for accurate description of the 50-dimensional distribution (see Supplementary Figure S10). Figure 4 visualizes after dimensionality reduction the performance of the conditional FNN and the interpretable conditional ZIMM on the simulated data. Each row corre- sponds to a different sampling scheme. Specifically, the top row shows the generated populations when the training data are uniformly sampled across the trajectory delineated by the conditioning variable y. In contrast, the middle and bottom rows illustrate the performance of both variants when some interval of the conditioning variable is missing entirely. In the former case this entails the suspension of a continuous subset of the con- ditioning variable while, in the latter case, the suspension of a whole differentiation state, i.e., a discrete subpopulation. Evidently, in the case where all training information is available Figure 4 (top) or when a continuous subset is missing Figure 4 (middle) the conditional FNN managed to generate data that are closer to the real distribution. This is quantitatively verified by the MMD calculations reported in Table 1. Both generator models were able to inter- polate between the missing parts of the original distribution. We note here that similar performances are observed if data from the interval y 2 [0.6, 0.8] are missing instead (Supplementary Figure S11). Supplementary Figures S5–S8 display how well each variant captures the original data in the 50-dimensional space. In contrast, when a whole differentiation state is missing from the training data both methods failed to converge (data not shown). This is because in this discrete case the conditioning y does not carry any information that relates states 1 and 3 rendering them independent. For this reason, we repeated the experiment after adding a small percentage of samples from cells at state 2. Figure 4 (bottom) shows results when 5% of these cells (or 736 in absolute numbers) have been added. As seen although both variants manage to create data similar to those coming from the real distribution, they still cannot correctly represent the real distribution of state 2. For example, the variance of the fake distribution generated using the conditional FNN is high. This is probably because the number of samples added from the missing subpopulation is not enough. In fact, we found that the variance of the generated data is positively correlated to the number of samples from the missing distribution (see Supplementary Figure S12). On the other hand, in the case of the conditional ZIMM the generated data do not create a single subpopulation that interpolates well between states 1 and 3. This is because, for each individual gene, the algorithm may not fit well value regions of low probability density (region between zero-inflated and main modes of genes 32, 34–36 in Supplementary Figure S8). As expected, the MMD distance values (Table 1) when part of the overall distribution is missing are bigger in comparison Appl. Sci. 2022, 12, 5434 11 of 16 to generating data when training on the complete data set, though conditional ZIMM is slightly preferable in this case. Table 1. MMD computations using FNN and ZIMM for s = 5 in the RBF kernel, for experiments in Figure 4. Reference is the average MMD over the test set and 50 randomly chosen equally-sized training sets and is equal to & 0.015 0.005. Lower MMD value is preferable. Generator y 2 [0, 1] y 2 [0.4, 0.6] y 2 State 2 cFNN 0.0748 0.2308 0.2430 cZIMM 0.0763 0.5090 0.1719 training data conditional FNN (y ∈ [0, 1]) conditional ZIMM (y ∈ [0, 1]) training training generated generated 20 20 10 10 10 0 0 −10 −10 −10 −20 −20 −30 −20 −10 0 10 20 30 40 −40 −30 −20 −10 0 10 20 30 40 −30 −20 −10 0 10 20 30 40 PCA PCA PCA 1 1 1 conditional FNN (y ∈ [0.4, 0.6]) INTERPOLATION conditional ZIMM (y ∈ [0.4, 0.6]) INTERPOLATION t aining training gene ated generated −10 −10 −20 −30 −20 −10 0 10 20 30 40 −30 −20 −10 0 10 20 30 40 PCA1 PCA1 conditional FNN (y ∈ state 2) conditional ZIMM (y ∈ state 2) original data, states training training state 1 generated generated 20 20 20 state 2 state 3 −10 −10 −10 −20 −20 −30 −20 −10 0 10 20 30 40 −30 −20 −10 0 10 20 30 40 −30 −20 −10 0 10 20 30 40 PCA PCA PCA 1 1 1 Figure 4. Synthetic 50-dimensional data set with zero-inflated modes. First two PCA components visualization of real over generated data, colored in turquoise and orange respectively. (upper row) Training data with uniformly chosen labels y 2 [0, 1]. Conditional FNN shown in the middle is a better fit in comparison to the conditional ZIMM model on the right, at the cost of interpretability. Detailed histograms over all 50 genes (distribution marginals), can be found in the Supplementary Figure S13. (middle row) Training data colored according to pseudo-time label y (continuous) is shown on the left, where all five intervals have similar sample sizes. Training is performed on data with condition y 2 [0, 0.4][ [0.6, 1.0] (turquoise) whereas we generate data with y 2 [0.4, 0.6] (orange). Interpolation to unseen data is more accurate when the conditional FNN generator is used though conditional ZIMM is capable of identifying a significant part of the test set. (lower row) Training data colored according to differentiation state label y (discrete) is shown on the left. Conditional FNN (middle) and conditional ZIMM (right), using a subpopulation of 5% of samples from state 2. The latter is important because this discrete condition does not provide dynamical information for interpolation. 5.3. Single Cell Data Augmentation Finally, we test the training efficacy of our methodology to learn distributions in real biological settings and particularly when the number of available observations from a given cell population is low. This is a fundamental problem in life sciences for which an artificial data augmentation has been recently demonstrated to be a realistic solution [21]. To this direction, we used real measurements from [45] (data were accessed from https: //community.cytobank.org/cytobank/experiments/46098/illustrations/121588, accessed on 15 April 2020). Specifically, we consider single cell mass cytometry measurements PCA PCA 2 2 PCA PCA PCA 2 2 2 PCA PCA PCA 2 2 2 Appl. Sci. 2022, 12, 5434 12 of 16 on 16 bone marrow protein markers (d = 16) coming from healthy and disease-carrying individuals. In total, the data set consists of almost 200K healthy and disease-related cell samples. Before analysis, data were transformed using the inverse hyperbolic sine arcsinh() transformation with a cofactor of 5, which is typical in order to have comparable supports across dimensions. To simulate the existence of a rare cell subpopulation in the data we proceed as follows. First, we collect 26K random cell samples from the healthy population. Then, we consider a case where the number of cells from the rare population is 2% of that of the healthy one and another where it is only 1%. Our goal is to train the proposed conditional generator variants on distributions of rare cell population and then use them to generate new realistic samples for data augmentation. Figure 5 shows the effectiveness of each variant when the training data consist of 2% or 1% of the disease-related subpopulation of samples. In general, we see that both FNN (left) and ZIMM (right) conditional variants are able to generate realistic samples. Data augmentation is feasible for as low as 1% of disease-related data in the training set (lower plots), though some regions in the latent space are not represented sufficiently well. This is verified by a small increase in the MMD distance between the generated and the original data sets. A choice of K = 10 in ZIMM is sufficiently descriptive for a two-class model of healthy and diseased 16-dimensional distributions of these data sets. conditional FNN, 2% CN subpopulation conditional ZIMM, 2% CN subpopulation 6 6 4 4 2 2 0 0 −2 −2 Healthy Healthy −4 −4 CN reference CN reference CN generated CN generated −6 −6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 PCA PCA 1 1 conditional FNN, 1% CN subpopulation conditional ZIMM, 1% CN subpopulation 6 6 4 4 2 2 0 0 −2 −2 Healthy Healthy −4 −4 CN reference CN reference CN generated CN generated −6 −6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 PCA PCA 1 1 Figure 5. Two dimensional PCA plots on the real 16-dimensional data set. Healthy (turquoise) and diseased (red) cell distributions, each consisting of N = 26,000 samples, are shown in each subplot as reference. Cross symbols indicate the generated cell distribution (orange). (top) Performance of cFNN (left) and cZIMM (right) when the training set comprises of 2% (N = 520) diseased samples. (bottom) Performance of cFNN (left) and cZIMM (right) when the training set comprises of 1% (N = 260) diseased samples. Detailed histograms are included in Supplementary Figure S13. 5.4. Comparison against the State-of-the-Art We compare our approach against another conditional GAN approach tailored for single-cell data called conditional single-cell GAN (cscGAN) [21]. Briefly, cscGAN assumes single-cell RNAseq data as input and is able to augment under-represented classes by generating realistic synthetic samples. Similarly to the cFNN and cZIMM, it is based on PCA PCA 2 2 PCA PCA 2 2 Appl. Sci. 2022, 12, 5434 13 of 16 the assumption that the training samples lie on the same low-dimensional manifold [46]. However, the cscGAN approach cannot handle continuous conditioning variables while both cFNN and cZIMM are via label embedding. We trained cscGAN on the datasets of Sections 5.2 and 5.3. To achieve a fair com- parison, all experiments were conducted on the basis of count data because the cscGAN architecture follows a custom normalization procedure per cell (library size normalization). Please remember that we worked with count data in the scRNASeq dataset (Section 5.2) while we performed an arcsinh transformation to the protein data in Section 5.3. Fur- thermore, both cscGAN generator ’s and discriminator ’s capacity was reduced because of overfitting concerns. Essentially, we reduced the number of nodes for each of the three layers by a factor between 4 and 16 without affecting the quality of the generated samples in the training set. Obviously, the downsizing of the neural networks also resulted in much faster training. Let us also remark that the respective cFNN presented in Section 5.2 is still quite smaller than the chosen cscGAN. On the scRNASeq data, we found that cscGAN fails to generate realistic samples when the under-represented distribution interpolates between consecutive cell states (see Supplementary Figure S14). This result might stem from the fact that the cscGAN approach cannot efficiently handle correlated conditioning variables in cases of low sample size. On the other hand, cscGAN produced marginally better results on the mass cytometry data. Table 2 summarizes the MMD scores when the training data consist of two separate cell populations; one abundant and another that is arbitrarily small i.e., 2% and 1% relative to the abundant one. As expected, cZIMM’s MMD had the highest value. In contrast, the MMD scores of cFNN and cscGAN are on par for the 2% case while cscGAN has lower MMD for the 1% case. Accordingly, Figure 6 compares the results of our cFNN model (orange crosses) to the cscGAN approach (green crosses). As it is clearly seen, both cscGAN and cFNN are capable of producing realistic samples from the under-represented training class. However, cFNN requires significantly less training time mainly because of the smaller size of its generator. Table 2. cFNN, cZIMM and cscGAN MMD scores for the cases of 2% and 1% rare population (RBF kernel, s = 500). Notice that the value of s is now two orders of magnitude larger than it was in Table 3 because the data account for gene counts (i.e., unnormalized). The baseline MMD score calculated over the test set and 50 randomly chosen equally-sized training sets is equal to & 0.0086 0.002. Lower MMD value is preferable. Generator 2% Subpopulation 1% Subpopulation cFNN 0.0302 0.0359 cZIMM 0.0351 0.0374 cscGAN 0.0296 0.0328 Table 3. cFNN and cZIMM for the cases of 2% and 1% rare population (RBF kernel, s = 5). The baseline MMD score calculated over the test set and 50 randomly chosen equally-sized training sets and is equal to & 0.016 0.004. Lower MMD value is preferable. Generator 2% Subpopulation 1% Subpopulation cFNN 0.0459 0.0819 cZIMM 0.0693 0.0830 Appl. Sci. 2022, 12, 5434 14 of 16 conditional FNN, 2% CN subpopulation conditional FNN, 1% CN subpopulation 10000 10000 −5000 −5000 Healthy Healthy CN reference CN reference −10000 −10000 CN generated CN generated −10000 −5000 0 5000 10000 15000 −10000 −5000 0 5000 10000 15000 PCA PCA 1 1 cscGAN, 2% CN subpopulation cscGAN, 1% CN subpopulation 10000 10000 5000 5000 0 0 −5000 −5000 Healthy Healthy CN reference CN reference −10000 −10000 CN generated (cscGAN) CN generated (cscGAN) −10000 −5000 0 5000 10000 15000 −10000 −5000 0 5000 10000 15000 PCA PCA 1 1 Figure 6. Two dimensional PCA plots on the real 16-dimensional count data. Healthy (turquoise) and diseased (red) cell distributions, each consisting of N = 26,000 samples, are shown in each subplot as reference. Cross symbols indicate the generated cell distributions. (top) Performance of cFNN when the training set comprises of 2% (left) and 1% (right) diseased samples. (bottom) cscGAN generated data for the same training data sets. Both models are equally capable of data augmentation of underrepresented classes. 6. Conclusions In this paper, we proposed a new approach for multivariate data interpolation and augmentation through conditional generative adversarial network (GAN) modeling. This likelihood-free approach is able to accept high-dimensional, zero-inflated data, which is a frequent characteristic of single-cell RNA-seq data. We provide two variants; FNN and ZIMM, for which there is a trade-off between model’s accuracy and interpretability. Both variants are able to generate realistic single-cell RNA-seq representations of high- dimensional data belonging either to distinct states (i.e. discrete condition) or continuous- time (i.e. continuous condition). In addition, we performed data augmentation in cases where as low as 1% of samples from a state are provided. Interpolation and augmentation tasks are particularly important when data contain dis- rupted or incomplete measurements, respectively. For the conditional ZIMM model, careful initialization and hyper-parameter selection based on the real data is crucial, whereas con- ditional FNN performs well for a wide range of hyper-parameters and data dimensionality. Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/app12115434/s1. Author Contributions: Conceptualization, Y.P.; methodology, Y.P. and A.T.; software, A.T.; validation, A.T.; formal analysis, Y.P.; investigation, A.T.; resources, Y.P.; data curation, G.P.; writing—original draft preparation, A.T.; writing—review and editing, G.P. and Y.P.; visualization, A.T.; supervision, Y.P. and G.P.; project administration, Y.P.; funding acquisition, Y.P., G.P. and A.T. All authors have read and agreed to the published version of the manuscript. Funding: This research is co-financed by Greece and the European Union (European Social Fund- ESF) through the Operational Programme “Human Resources Development, Education and Lifelong Learning 2014-2020” in the context of the project “Characterizing Population Dynamics with Ap- plications in Biological Data” (MIS 5050686). Yannis Pantazis acknowledges partial support by PCA PCA 2 2 PCA PCA 2 2 Appl. Sci. 2022, 12, 5434 15 of 16 the Hellenic Foundation for Research and Innovation (H.F.R.I.) under the “Second Call for H.F.R.I. Research Projects to support Faculty members and Researchers” (Project Number: 4753). Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable. Data Availability Statement: Code will be made available at https://github.com/tasoskrhs/conditional_ ZIMM, upon decision of this work. Conflicts of Interest: The authors declare no conflict of interest. References 1. Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.C.; Bengio, Y. Generative adversarial nets. In Proceedings of the Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. 2. Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:cs.LG/1411.1784. 3. Denton, E.; Chintala, S.; Szlam, A.; Fergus, R. Deep Generative Image Models Using a Laplacian Pyramid of Adversarial Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 1, pp. 1486–1494. 4. Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. 5. Odena, A.; Olah, C.; Shlens, J. Conditional Image Synthesis with Auxiliary Classifier GANs. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; Volume 70, pp. 2642–2651. 6. Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. 7. Brock, A.; Donahue, J.; Simonyan, K. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. 8. Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4401–4410. 9. Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2242–2251. [CrossRef] 10. Ledig, C.; Theis, L.; Huszar, F.; Caballero, J.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; Shi, W. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. 11. Pascual, S.; Bonafonte, A.; Serrà, J. SEGAN: Speech Enhancement Generative Adversarial Network. In Proceedings of the INTERSPEECH, Stockholm, Sweden, 20–24 August 2017; pp. 3642–3646. 12. Saito, Y.; Takamichi, S.; Saruwatari, H. Statistical parametric speech synthesis incorporating generative adversarial networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 84–96. [CrossRef] 13. Kumar, K.; Kumar, R.; de Boissiere, T.; Gestin, L.; Teoh, W.Z.; Sotelo, J.; de Brébisson, A.; Bengio, Y.; Courville, A.C. MELGAN: Generative adversarial networks for conditional waveform synthesis. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 14881–14892. 14. Che, T.; Li, Y.; Zhang, R.; Hjelm, R.D.; Li, W.; Song, Y.; Bengio, Y. Maximum-likelihood augmented discrete generative adversarial networks. arXiv 2017, arXiv:1702.07983. 15. Fedus, W.; Goodfellow, I.; Dai, A.M. MaskGAN: Better Text Generation via Filling in the _. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. 16. Lan, L.; You, L.; Zhang, Z.; Fan, Z.; Zhao, W.; Zeng, N.; Chen, Y.; Zhou, X. Generative Adversarial Networks and Its Applications in Biomedical Informatics. Front. Public Health 2020, 8, 164. [CrossRef] 17. Saelens, W.; Cannoodt, R.; Todorov, H.; Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 2019, 37, 547–554. [CrossRef] 18. Buettner, F.; Natarajan, K.N.; Casale, F.P.; Proserpio, V.; Scialdone, A.; Theis, F.J.; Teichmann, S.a.; Marioni, J.C.; Stegle, O. Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat. Biotechnol. 2015, 33, 155–160. [CrossRef] 19. Stegle, O.; Teichmann, S.A.; Marioni, J.C. Computational and analytical challenges in single-cell transcriptomics. Nat. Rev. Genet. 2015, 16, 133–145. [CrossRef] 20. Ghahramani, A.; Watt, F.M.; Luscombe, N.M. Generative adversarial networks simulate gene expression and predict perturbations in single cells. bioRxiv 2018. [CrossRef] 21. Marouf, M.; Machart, P.; Bansal, V.; Kilian, C.; Magruder, D.S.; Krebs, C.F.; Bonn, S. Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks. Nat. Commun. 2020, 11, 166. [CrossRef] Appl. Sci. 2022, 12, 5434 16 of 16 22. Grønbech, C.H.; Vording, M.F.; Timshel, P.N.; Sønderby, C.K.; Pers, T.H.; Winther, O. scVAE: Variational auto-encoders for single-cell gene expression data. Bioinformatics 2020, 36, 4415–4422. [CrossRef] [PubMed] 23. Xu, Y.; Zhang, Z.; You, L.; Liu, J.; Fan, Z.; Zhou, X. scIGANs: Single-cell RNA-seq imputation using generative adversarial networks. Nucleic Acids Res. 2020, 48, e85. [CrossRef] [PubMed] 24. Arisdakessian, C.; Poirion, O.; Yunits, B.; Zhu, X.; Garmire, L.X. DeepImpute: An accurate, fast, and scalable deep neural network method to impute single-cell RNA-seq data. Genome Biol. 2019, 20, 211. [CrossRef] [PubMed] 25. Nowozin, S.; Cseke, B.; Tomioka, R. F-GAN: Training Generative Neural Samplers Using Variational Divergence Minimization. In Proceedings of the 30th International Conference on Neural Information Processing Systems 29 (NIPS 2016), Barcelona, Spain, 5–10 December 2016; pp. 271–279. 26. Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; Volume 70, pp. 214–223. 27. Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A.C. Improved training of Wasserstein GANs. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5767–5777. 28. Birrell, J.; Dupuis, P.; Katsoulakis, M.A.; Rey-Bellet, L.; Wang, J. Variational Representations and Neural Network Estimation for Rényi Divergences. arXiv 2020, arXiv:2007.03814. 29. Miyato, T.; Kataoka, T.; Koyama, M.; Yoshida, Y. Spectral Normalization for Generative Adversarial Networks. In Proceedings of the International Conference on Learning Representations, Sydney, NSW, Australia, 6–11 August 2018. 30. Li, C.L.; Chang, W.C.; Cheng, Y.; Yang, Y.; Poczos, B. MMD GAN: Towards Deeper Understanding of Moment Matching Network. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. 31. Birrell, J.; Dupuis, P.; Katsoulakis, M.A.; Pantazis, Y.; Rey-Bellet, L. (f,Gamma)-Divergences: Interpolating between f-Divergences and Integral Probability Metrics. J. Mach. Learn. Res. 2022, 23, 1–70. 32. Bishop, C.M. Pattern Recognition and Machine Learning (Information Science and Statistics); Springer: New York, NY, USA, 2006. 33. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1977, 39, 1–38. 34. Moon, T. The expectation-maximization algorithm. IEEE Signal Process. Mag. 1996, 13, 47–60. [CrossRef] 35. Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14–16 April 2014. 36. Maddison, C.J.; Mnih, A.; Teh, Y.W. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. In Proceedings of the 5th International Conference on Learning Representations, ICLR, Toulon, France, 24–26 April 2017. 37. Zong, B.; Song, Q.; Min, M.R.; Cheng, W.; Lumezanu, C.; Cho, D.; Chen, H. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. 38. Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y.N. Convolutional Sequence to Sequence Learning. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; Volume 70, pp. 1243–1252. 39. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. 40. Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A Kernel Two-sample Test. J. Mach. Learn. Res. 2012, 13, 723–773. 41. Szabó, Z. Information Theoretical Estimators Toolbox. J. Mach. Learn. Res. 2014, 15, 283–287. 42. Minka, T. Divergence Measures and Message Passing; Technical Report MSR-TR-2005-173; Microsoft Research: New York, NY, USA, 2005. 43. Pantazis, Y.; Paul, D.; Fasoulakis, M.; Stylianou, Y.; Katsoulakis, M.A. Cumulant GAN. arXiv 2020, arXiv:2006.06625v2. 44. Cannoodt, R.; Saelens, W.; Deconinck, L.; Saeys, Y. Spearheading future omics analyses using dyngen, a multi-modal simulator of single cells. Nat. Commun. 2021, 12, 3942. [CrossRef] [PubMed] 45. Levine, J.H.; Simonds, E.F.; Bendall, S.C.; Davis, K.L.; Amir, E.A.D.; Tadmor, M.D.; Litvin, O.; Fienberg, H.G.; Jager, A.; Zunder, E.R.; et al. Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis. Cell 2015, 162, 184–197. [CrossRef] [PubMed] 46. Lindenbaum, O.; Stanley, J.; Wolf, G.; Krishnaswamy, S. Geometry Based Data Generation. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Volume 31.

Journal

Applied SciencesMultidisciplinary Digital Publishing Institute

Published: May 27, 2022

Keywords: single-cell RNA-seq data generation; data interpolation and augmentation; Gaussian mixture model; zero-inflated random variables; generative adversarial networks; Rényi divergence minimization

There are no references for this article.