Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Two Applications of Deep Learning in the Physical Layer of Communication Systems

Two Applications of Deep Learning in the Physical Layer of Communication Systems Two Applications of Deep Learning in the Physical Layer of Communication Systems Emil Björnson and Pontus Giselsson Deep learning has proved itself to be a powerful tool to develop data-driven signal processing algorithms for challenging engineering problems. By learning the key features and characteristics of the input signals, instead of requiring a human to first identify and model them, learned algorithms can beat many man- made algorithms. In particular, deep neural networks are capable of learning the complicated features in nature-made signals, such as photos and audio recordings, and use them for classification and decision making. The situation is rather different in communication systems, where the information signals are man- made, the propagation channels are relatively easy to model, and we know how to operate close to the Shannon capacity limits. Does this mean that there is no role for deep learning in the development of future communication systems? I. R ELEVANCE The answer to the question above is “no” but for the aforementioned reasons, we need to be careful not to reinvent the wheel. We must identify the right problems to tackle with deep learning and, even then, not start from a blank sheet of paper. There are many signal processing problems in the physical layer of communication systems that we already know how to solve optimally, for example, using well-established estimation, detection, and optimization theory. Nonetheless, there are also important practical problems where we lack acceptable solutions, for example, due to a lack of appropriate models or algorithms. In this lecture note, we first introduce the key properties of artificial neural networks and deep learning. The focus is not on technicalities around the training process or choice of network structure, but on what we can practically achieve, assuming the training is carried out successfully. We will then describe three application categories in communication engineering, whereof one exposes some fundamental weaknesses of deep learning and two illustrate important advances that can be made by utilizing deep learning. II. PREREQUISITES This lecture note requires basic knowledge of linear algebra, digital communications, and probability. E. Björnson is with Linköping University, Sweden. P. Giselsson is with Lund University, Sweden. This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. arXiv:2001.03350v2 [cs.IT] 2 Jan 2021 2 x f(x ;) y 0 0 (a) An arbitrary gray box taking x as input and giving y as output. f (x ; ) 2 1 2 f (x ; ) ^ 1 0 1 f (x ; ) 3 2 3 x y Input layer Hidden layer 1 Hidden layer 2 Output layer (b) A fully-connected feed-forward network with four layers (L = 3) that fits into the box in (a). Fig. 1. The gray-box input-output model in (a) is characterized by f and a parameter vector . It is called an artificial neural network if f has a particular structure, such as the one illustrated in (b). III. PROBLEM S TATEM ENT AND S OLUTION We begin by briefly describing what artificial neural networks are and formulating the problem of using them as function approximators. Consider a system that takes an n -length input vector x 2 R and produces a k-length output vector 0 0 y 2 R , as illustrated in Fig. 1(a). The output is determined by the input via a deterministic function f : y = f(x ;): (1) The function is fixed but is characterized by an m-dimensional parameter vector  2 R . Many different input-output relations can be modeled in this way by changing the parameter vector , but they all share an underlying structure determined by the initial choice of f . This is called a gray-box model. When the function f is selected to resemble the biological neural networks in human brains, the gray box is called an artificial neural network. The input vector x is then viewed as the values in n neurons 0 0 from which the function f produces the values of y in k other neurons. There are many different examples of this. The classical one is a fully-connected feed-forward network, which is illustrated in Fig. 1(b). In ^ ^ ^ this case, f is a composition of L functions, f ; : : : ; f , which describes transitions between neurons in 1 L 3 an input layer to neurons in an output layer via L 1 intermediate “hidden” layers. L characterizes how deep the network is. The function f is determined by the parameters  = fW ;b g and modeled as l l l l f (x ; ) =  (W x + b ); (2) l l1 l l l l1 l n n n n n l l1 l l l where W 2 R is called a weight matrix, b 2 R is called a bias vector, and  : R ! R is an l l l element-wise non-linear function that is called an activation function. With inspiration from the structure of the human brain, the function f can be interpreted as taking the values x in the n neurons of l l1 l1 layer l 1, mixing the values together according to the affine transition relation W x +b , and finally l l1 l applying the activation function  to the determine values of the n neurons of layer l. l l If there are four layers as in Fig. 1(b), then L = 3 and the complete input-output relation is ^ ^ ^ y = f f f (x ; ) ; ; : (3) 3 2 1 0 1 2 3 Hence, the composite function f is determined by the parameter vector  containing the n (n +1) l l1 l=1 parameter values from  ; ; (i.e., the weights and biases from all layers). 1 2 3 A. Problem Statement Artificial neural networks are generally used to approximate other functions, by selecting the parameter vector  to somehow minimize the approximation error. In particular, the category of fully-connected feed-forward networks is capable of approximating any continuous function arbitrarily well by utilizing a (possibly) large but finite number of parameters (and neurons) [1]. This important result can be viewed as a generalization of Taylor polynomial approximations to functions with vector inputs and vector outputs. Two other categories are convolutional neural networks and recurrent neural networks [2]. Each category is believed to be better at approximating certain types of functions, in the sense of requiring fewer parameters to achieve a certain approximation error and/or it being easier to find appropriate parameter values in practice. Selecting the right category of neural network is important but beyond the scope of this lecture note. Instead, our problem statement is: what are the important use cases where the function approximation capability can be utilized in the physical layer of communication systems, to achieve large improvements compared to conventional techniques? B. Solution To identify practically important use cases, we first need to understand how the function approximation is carried out. The parameter vector of an artificial neural network can be tuned/trained to approximate a (possibly unknown) function that we call f ; that is, f should be trained to become a good estimate of 4 f . This is preferably done by supervised learning using a set of T training examples consisting of input train train train vectors x and the corresponding output vectors y = f(x ) that we want the neural network to t t t reproduce, for t = 1; : : : ; T . Let us represent these training examples as the columns of two matrices: h i h i train train train train train train X = ; Y = : (4) x : : : x y : : : y 1 T 1 T The inputs should ideally be selected independently at random from the distribution of inputs that appears when using f in reality. The training basically consists of finding the parameter  that minimizes a loss function ` that measures the approximation mismatch: train train = arg min ` ;X ;Y : (5) For example, the loss can be measured in the mean-squared sense as X 2 train train train train ` ;X ;Y = y f(x ;) : (6) t t t=1 The goal is that the trained neural network f(x ; ) will provide approximately the right outputs not only for the training examples, but for any input signal x generated in the same way. This desired property is called generalization. Intuitively, if the unknown function f is continuous and has limited variability, we should be able to approximate it well from a large training set. We can once again make a parallel to polynomial approximations; any scalar polynomial of order T 1 is uniquely determined by T samples (training examples) of the inputs and outputs. If the polynomial order is unknown, or if the function is only approximately polynomial, we need a larger number of samples to ensure a good approximation. Since the training in (5) is a complicated non-convex optimization problem, huge efforts have been dedicated to finding computationally and performance-wise acceptable suboptimal solutions. Moreover, the generalization to unseen inputs can be improved by various regularizations, hyper-parameter choices, and network designs [2]. These choices affect the model complexity. A simple model cannot capture complex dependencies. A too complex model explains the training data only (this is called overfitting). A correct complexity trade-off gives good generalization and is typically found using cross-validation. However, such empirical craftsmanship is not the focus of this lecture note, but we conclude: 1) Artificial neural networks can approximate any continuous function. 2) The supervised training requires a large training set with inputs/outputs to achieve a low approxi- mation error. There are many functionalities in communication systems that can be described by a mathematical function f and, thus, can be approximated by a neural network. To identify the promising use cases, we will first explain the basic methodology and its weaknesses by giving a concrete example. 5 1) A Deep-Learning Solution to Signal Detection: The physical layer of a communication system determines how an information-bearing signal is sent from the transmitter to the receiver over a physical channel. A critical task is the signal detection, where the receiver tries to identify what information was sent. To describe some key properties of deep learning, we will exemplify how it can be used for signal detection. We consider a classical additive white Gaussian noise (AWGN) channel, where a two-dimensional 2 2 signal vector s 2 R is sent. The received signal r 2 R is given by r = s + n; (7) 2 2 where n  N (0;  I) is an independent Gaussian noise vector where the entries have variance  . We assume two bits of information are encoded into s using a quadrature phase-shift keying (QPSK) constellation. Hence, there are four possible signal points that are equally spaced on the unit circle: 82 3 2 3 2 3 2 39 p p p p < = 1= 2 1= 2 1= 2 1= 2 4 5 4 5 4 5 4 5 s 2 ; ; ; : (8) p p p p : ; 1= 2 1= 2 1= 2 1= 2 The mapping between information bits and signals is illustrated in Fig. 2(a). Due to the additive noise, the received signal r can take any value, but the Gaussian distribution makes values close to one of the signal points in (8) more likely than values far away. This can be seen from the red dots in Fig. 2(a), which represent r for 10,000 noise realizations with  = 0:2 that are added to each signal point. Based on the received signal r, the receiver needs to guess (detect) what signal s was sent. We have trained a neural network for this task, by taking the received signal x = r as input and letting the output y be a four-dimensional vector that is one for the detected signal and has zeroes elsewhere. We used the 40,000 red dots in Fig. 2(a), and the signals s that generated these r, to train a fully-connected neural network using standard training methods. We then applied the neural network to a wide range of possible received signals to illustrate how it is making its detection. The colored areas in Fig. 2(b) show in which regions the received signals are mapped to the respective information signals. The regions are separated by lines, which is expected since each layer performs linear algebra operations; in particular, each activation function determines if the input is below/above a line that has been selected by training. Note that we have “zoomed out” and the range of values that was shown in Fig. 2(a) is indicated by the black square. The colored detection regions produced by the neural network have peculiar asymmetric shapes, which are not optimal. In fact, the optimal detection regions for AWGN channels are well known [3, Ch. 6]: the received signal should be mapped to the closest signal point in terms of Euclidean distance. The optimal detection regions are shown in Fig. 2(c). The regions are quite similar within the black square, but greatly deviates further away. Several important observations can be made from this example: 6 1.5 01 11 0.5 A received signal r -0.5 00 10 -1 -1.5 -1.5 -1 -0.5 0 0.5 1 1.5 (a) Quadrature phase-shift keying for information encoding and the corresponding received signals. Detection: 01 Detection: 11 Detection: 01 Detection: 11 Detection: 00 Detection: 10 Detection: 00 Detection: 10 (b) Detection regions with a trained neural network. (c) Optimal detection regions using detection theory. Fig. 2. We send QPSK signals over an AWGN channel, as shown in (a), and try to detect the signals at the receiver. The detection regions produced by a trained neural network is shown in (b) and the optimal regions obtained from detection theory are shown in (c). 1) If there is a known optimal algorithm, a trained neural network cannot outperform it. The detection error probability is, however, almost the same in this particular example since most received signals appear within the black square where the neural network has a decent behavior. 2) There are two reasons why the detection regions in Fig. 2(b) are wrongly shaped. Firstly, the shape inside the black square (around the signal points) is wrong due to overfitting; the training examples in Fig. 2(a) can be approximately separated by many piecewise linear boundaries, including the ones shown in Fig. 2(b). Secondly, since all training examples are inside the square, the behavior outside the square is somewhat random; the neural network has learned to interpolate between training examples but not to extrapolate outside the square. This is a practical issue since received signals far outside the square occasionally appear due to the long-tailed Gaussian distribution. This is a general phenomenon; neural networks are good at handling typical inputs but may generalize poorly to atypical inputs. 7 3) We could have used prior domain knowledge (from digital communications) to preprocess the input signals. In this example, the neural network had to rediscover where the constellation points are, how the noise is distributed, and how to make the right detection. If we would instead compute the Euclidean distance between the received signal and each of the four signal constellation points, we could use that as input to a neural network. This will give more accurate and reliable results since we have utilized our domain knowledge to ensure that the neural network has fewer characteristics to learn. However, it still cannot beat the optimal detection. 2) Is There a Role of Deep Learning in Communications?: Since signal detection in AWGN channels is easy to perform optimally, it makes little sense to utilize artificial neural networks for that purpose. There are many similar tasks in communications where deep learning cannot make any meaningful improvements. For example, the fundamental performance limits were derived by Shannon [4] and we can operate close to those limits using modern channel codes. Moreover, it is known how to perform optimal channel estimation, multi-user multiple-input multiple-output (MIMO) processing, and transmit power allocation in many wireless communication scenarios [5]. The fact that the information signals are man-made gives us strong prior information that makes it easier to devise effective man-made algorithms than in many other fields, where the signals are created by nature. There are nevertheless some important roles that deep learning can play in communications. Firstly, there are many problems where a known algorithm finds the optimum but has prohibitively high complexity for real-time implementation. Secondly, there are cases where the standard system models are inadequate or incomplete. It is sufficient to replace the noise distribution in the previous example with an unknown one to find a case where learning can help. We will elaborate on these two applications in the remainder of this lecture note. But before that, we stress that errors are unavoidable in the physical layer of communication systems and are conventionally dealt with using retransmissions. This built-in fault tolerance is positive when it comes to the utilization of deep learning. It gives robustness to the strange behaviors that occasionally occur when an atypical signal is fed into a neural network that has been trained to work well for typical input signals. However, adversaries can also exploit atypical signals to perform jamming more efficiently [6]. IV. A PPLICATION 1: A LGORITHMIC APPROXIMATION The first important application of deep learning in communications is to approximate a known but computationally complicated algorithm. There are many examples of iterative algorithms that asymptot- ically find a global (or local) optimum to an optimization problem, but require very many iterations for convergence and/or complicated operations in each iteration [7]. Such algorithms might not be practically useful in communication systems where latency constraints require execution times below a millisecond. y = f(x) Known algorithm y = f(x) y y Training Neural network Neural network x y^ = f(x; ) ^ ^ ^ f y^ = f(x;) f (a) Offline training phase (b) Real-time usage ^ ^ Fig. 3. A known algorithm f can be approximated by training a neural network f to make f(x)  f(x; ) for all possible inputs, as shown in (a). The training procedure will iteratively update  to gradually reduce the approximation errors until it converges to some  . If the neural network is designed to have sufficiently low complexity, then the trained neural network in (b) can be used in real-time applications. The general procedure for training a neural network for algorithmic approximation is illustrated in Fig. 3. Suppose we have a known algorithm, represented by the function y = f(x), which cannot be implemented in real time. To address this problem using deep learning, we can first create a training set train containing a large number T of input signals x , for t = 1; : : : ; T . We then run the algorithm T times to compute the outputs train train y = f(x ): (9) t t After having generated the training set, we can train an artificial neural network to provide approximately the same outputs for these inputs. More precisely, we should find an optimized parameter vector  in accordance to (5). If the training is performed well, the neural network will generalize well (i.e., provide good outputs) to previously unseen input signals that were generated in the same way as the inputs used for training. Simply speaking, this means that f(x)  f(x; ) for all inputs x of practical interest. There are many optimization problems to be solved in communication systems. For example, at the transmitter, power allocation between concurrent transmissions is important to limit interference [5], [7]. At the receiver, non-linear signal detection problems must be solved to deal with interference in MIMO systems [8]. Some of these problems are convex and can be solved by off-the-shelf optimization software. Other problems are non-convex but there exist iterative algorithms that converge to local or global optima. In both cases, the computational complexity is often prohibitive for real-time applications, where similar optimization problems with different input data are solved repeatedly. A neural network can then be trained to learn approximately how the solution depends on the input data. This approximate input-output map can be evaluated with substantially lower computational cost, as exemplified in [7], [8]. Domain knowledge can be utilized to pre-process the input data, to focus the learning on the problem that the algorithm is solving and not on rediscovering known properties (e.g., that the desired signal lies in a certain subspace). 9 There are two main approaches. One can learn the input-output mapping based on training data, as described above, while ignoring how it was produced. Alternatively, the shape of the neural network can be selected so that each layer mimics one iteration of a known algorithm that converges asymptotically to an optimum. This is called deep unfolding and exploits that many first-order iterative optimization methods have the same structure as a (recurrent) neural network [9]. The parameters of the neural network are then trained to give a nearly optimum solution after a predefined number of iterations, thereby speeding up the convergence. In [8], the authors “unfold” a gradient-descent-like algorithm for MIMO detection to create a neural network where each layer performs similar operations but with optimized parameters. When using this approach, (9) needs not to be determined in advance, which simplifies the training. The practical benefit of this application is the complexity reduction it can provide; the neural network will essentially learn how to make algorithmic shortcuts to strike a good balance between accuracy and computational complexity. Another important benefit is related to hardware implementation. To solve a practical problem with real-time constraints, we conventionally would first need to design an algorithm and then develop a dedicated circuit based on it, which can be very time-consuming. With the help of deep learning, we can instead predesign a general-purpose circuit that implements a neural network of a given maximum size (i.e., number of layers and neurons) with a predetermined run time. We can then train a neural network to perform the algorithmic task we need and, finally, load the corresponding trained parameters (i.e., weights and biases) onto the circuit. This new approach to hardware implementation can greatly reduce the time from that the algorithmic design begins to a product can hit the market. A main issue with this application is the highly computationally demanding generation of desired train outputs: the more complex the algorithm f is, the longer time it takes to compute f(x ) for t = 1; : : : ; T . We are basically moving the complexity issue from the algorithmic run time to the design process. There is a practical limit to which algorithms that we can approximate in this way. If it takes 1 hour to generate one training example, it will take 11.4 years or extreme parallelism to generate 100,000 examples. V. A PPLICATION 2: INVERSION OF AN UNKNOWN F UNCTION The second important application is to invert an unknown function. In particular, non-linear distortion can occur between the transmitter and receiver. Three prominent examples are finite-resolution quantization in the receiver hardware, non-linear amplifiers in the transmitter hardware [10], and non-linear fiber-optical channels [11]. While quantizers typically are designed with known properties, the latter two examples can be represented by an unknown function g that takes a signal y as input and produces a distorted output x = g(y). The conventional way to undo the distortion is to identify an appropriate parameterized model of the function, then estimate the parameters from measurements, and finally create an inverse function based on the estimates. This three-step approach is suboptimal and prone to error-propagation. 10 y y^ Training − x = g(y) Neural network Unknown (a) Training phase y y^ = f(x;) function g x = g(y) (b) Usage Unknown Neural network y y^ = f(x; ) function g f Fig. 4. An unknown function g with input y is inverted using a neural network f by training it to achieve f(g(y); )  y, as shown in (a). The training procedure will iteratively update  to gradually reduce the approximation errors until it converges to some  . The trained neural network in (b) can be used to counteract the unknown function, without having to explicitly model it and estimate model parameters. An alternative is to train a neural network to directly invert the function, without explicit modeling or parameter estimation. Whenever only suboptimal conventional algorithms exist, a learned algorithm can theoretically provide better performance and robustness, but only if the training is carried out successfully. The general procedure for training a neural network for function inversion is illustrated in Fig. 4. We train need to generate a large number T of possible communication signals y and send them through the unknown function to measure train train x = g(y ) for t = 1; : : : ; T: (10) t t train train It is then x that is used as input to the neural network, while y is the desired output. t t Different from Application 1, the creation of a training set can be very computationally efficient in Application 2 because the outputs are man-made. It is typically created to be statistically equivalent to the signals observed at run time, but one can also create a biased training set to emphasize typical or atypical examples. Online learning when operating the communication system is possible by occasionally sending predefined reference signals to generate new training data. This is useful when the function g is time-varying (e.g., due to temperature variations in the hardware). The key to successful utilization of deep learning is to identify tasks in communication systems that currently lack an optimal solution—there is then an opportunity to beat the state-of-the-art. For example, a common way to deal with non-linear communication hardware is to apply the Bussgang decomposition [12] to write the output of the non- linear function g as g(y) = Dy + n, where D is a deterministic matrix and n is distortion noise that is uncorrelated with y but statistically dependent. By pretending as if n is independent noise, one can often develop communication algorithms (e.g., for channel estimation or data detection) that partially mitigate 11 distortion, but such algorithms are suboptimal since the distortion is in fact dependent on the input. As shown in [10], one can achieve substantially better performance by training neural networks instead. VI. W HAT W E HAVE L EARNED Although many parts of communication systems can be solved optimally, there are important cases where deep learning can give large improvements. In particular, it can be used to reduce computational complexity of known algorithms or to deal with non-linear hardware or channels in an efficient way. VII. ACKNOWLEDGMENT This work was supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. VIII. AUTHORS Emil Björnson (emil.bjornson@liu.se) received the MSc degree in engineering mathematics from Lund University, Sweden, in 2007, and the PhD degree in telecommunications from the KTH Royal Institute of Technology, Sweden, in 2011. He is now an associate professor at Linköping University, Sweden. He has authored the textbooks Optimal Resource Allocation in Coordinated Multi-Cell Systems (2013) and Massive MIMO Networks: Spectral, Energy, and Hardware Efficiency (2017). He received the 2018 IEEE Marconi Prize Paper Award in Wireless Communications, the 2019 EURASIP Early Career Award, the 2019 IEEE Communications Society Fred W. Ellersick Prize, and the 2019 IEEE Signal Processing Magazine Best Column Award. Pontus Giselsson (pontus.giselsson@control.lth.se) is an Associate Professor at the Department of Automatic Control at Lund University, Sweden. His current research interests include mathematical optimization and its wide range of applications, e.g., in machine learning, control, signal processing, and wireless communication. He received an MSc degree from Lund University in 2006 and a PhD degree from Lund University in 2012. During 2013 and 2014, he held a postdoc position at Stanford University. In 2012, he received the Young Author Price at the ADCHEM IFAC Symposium, in 2014, he received the Young Author Price at the IFAC World Congress, and in 2015, he received the Ingvar Carlsson Award from the Swedish Foundation for Strategic Research. R EFERENCES [1] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Signals and Systems, vol. 2, no. 4, pp. 303–314, Dec. 1989. [2] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. 12 [3] U. Madhow, Introduction to Communication Systems. Cambridge University Press, 2014. [4] C. E. Shannon, “A mathematical theory of communication,” Bell System Technical Journal, vol. 27, pp. 379–423, 623–656, 1948. [5] E. Björnson, J. Hoydis, and L. Sanguinetti, “Massive MIMO networks: Spectral, energy, and hardware efficiency,” Foundations and Trends® in Signal Processing, vol. 11, no. 3-4, pp. 154–655, 2017. [6] M. Sadeghi and E. G. Larsson, “Adversarial attacks on deep-learning based radio signal classification,” IEEE Wireless Communications Letters, vol. 8, no. 1, pp. 213–216, Feb. 2019. [7] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos, “Learning to optimize: Training deep neural networks for interference management,” IEEE Transactions on Signal Processing, vol. 66, no. 20, pp. 5438–5453, Oct. 2018. [8] N. Samuel, T. Diskin, and A. Wiesel, “Deep MIMO detection,” in IEEE SPAWC, Jul. 2017. [9] J. R. Hershey, J. L. Roux, and F. Weninger, “Deep unfolding: Model-based inspiration of novel deep architectures,” arXiv preprint, vol. abs/1409.2574, 2014. [Online]. Available: http://arxiv.org/abs/1904.03406 [10] Ö. T. Demir and E. Björnson, “Channel estimation in massive MIMO under hardware non-linearities: Bayesian methods versus deep learning,” IEEE Open Journal of the Communications Society, vol. 1, no. 1, pp. 109–124, 2020. [11] A. D. Ellis, J. Zhao, and D. Cotter, “Approaching the non-linear Shannon limit,” Journal of Lightwave Technology, vol. 28, no. 4, pp. 423–433, Feb. 2010. [12] J. J. Bussgang, “Crosscorrelation functions of amplitude-distorted Gaussian signals,” RLE, MIT, Tech. Rep. 216, 1952. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Computing Research Repository arXiv (Cornell University)

Two Applications of Deep Learning in the Physical Layer of Communication Systems

Computing Research Repository , Volume 2021 (2001) – Jan 10, 2020

Loading next page...
 
/lp/arxiv-cornell-university/two-applications-of-deep-learning-in-the-physical-layer-of-53zq1bwL6Y
ISSN
1053-5888
eISSN
ARCH-3344
DOI
10.1109/MSP.2020.2996545
Publisher site
See Article on Publisher Site

Abstract

Two Applications of Deep Learning in the Physical Layer of Communication Systems Emil Björnson and Pontus Giselsson Deep learning has proved itself to be a powerful tool to develop data-driven signal processing algorithms for challenging engineering problems. By learning the key features and characteristics of the input signals, instead of requiring a human to first identify and model them, learned algorithms can beat many man- made algorithms. In particular, deep neural networks are capable of learning the complicated features in nature-made signals, such as photos and audio recordings, and use them for classification and decision making. The situation is rather different in communication systems, where the information signals are man- made, the propagation channels are relatively easy to model, and we know how to operate close to the Shannon capacity limits. Does this mean that there is no role for deep learning in the development of future communication systems? I. R ELEVANCE The answer to the question above is “no” but for the aforementioned reasons, we need to be careful not to reinvent the wheel. We must identify the right problems to tackle with deep learning and, even then, not start from a blank sheet of paper. There are many signal processing problems in the physical layer of communication systems that we already know how to solve optimally, for example, using well-established estimation, detection, and optimization theory. Nonetheless, there are also important practical problems where we lack acceptable solutions, for example, due to a lack of appropriate models or algorithms. In this lecture note, we first introduce the key properties of artificial neural networks and deep learning. The focus is not on technicalities around the training process or choice of network structure, but on what we can practically achieve, assuming the training is carried out successfully. We will then describe three application categories in communication engineering, whereof one exposes some fundamental weaknesses of deep learning and two illustrate important advances that can be made by utilizing deep learning. II. PREREQUISITES This lecture note requires basic knowledge of linear algebra, digital communications, and probability. E. Björnson is with Linköping University, Sweden. P. Giselsson is with Lund University, Sweden. This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. arXiv:2001.03350v2 [cs.IT] 2 Jan 2021 2 x f(x ;) y 0 0 (a) An arbitrary gray box taking x as input and giving y as output. f (x ; ) 2 1 2 f (x ; ) ^ 1 0 1 f (x ; ) 3 2 3 x y Input layer Hidden layer 1 Hidden layer 2 Output layer (b) A fully-connected feed-forward network with four layers (L = 3) that fits into the box in (a). Fig. 1. The gray-box input-output model in (a) is characterized by f and a parameter vector . It is called an artificial neural network if f has a particular structure, such as the one illustrated in (b). III. PROBLEM S TATEM ENT AND S OLUTION We begin by briefly describing what artificial neural networks are and formulating the problem of using them as function approximators. Consider a system that takes an n -length input vector x 2 R and produces a k-length output vector 0 0 y 2 R , as illustrated in Fig. 1(a). The output is determined by the input via a deterministic function f : y = f(x ;): (1) The function is fixed but is characterized by an m-dimensional parameter vector  2 R . Many different input-output relations can be modeled in this way by changing the parameter vector , but they all share an underlying structure determined by the initial choice of f . This is called a gray-box model. When the function f is selected to resemble the biological neural networks in human brains, the gray box is called an artificial neural network. The input vector x is then viewed as the values in n neurons 0 0 from which the function f produces the values of y in k other neurons. There are many different examples of this. The classical one is a fully-connected feed-forward network, which is illustrated in Fig. 1(b). In ^ ^ ^ this case, f is a composition of L functions, f ; : : : ; f , which describes transitions between neurons in 1 L 3 an input layer to neurons in an output layer via L 1 intermediate “hidden” layers. L characterizes how deep the network is. The function f is determined by the parameters  = fW ;b g and modeled as l l l l f (x ; ) =  (W x + b ); (2) l l1 l l l l1 l n n n n n l l1 l l l where W 2 R is called a weight matrix, b 2 R is called a bias vector, and  : R ! R is an l l l element-wise non-linear function that is called an activation function. With inspiration from the structure of the human brain, the function f can be interpreted as taking the values x in the n neurons of l l1 l1 layer l 1, mixing the values together according to the affine transition relation W x +b , and finally l l1 l applying the activation function  to the determine values of the n neurons of layer l. l l If there are four layers as in Fig. 1(b), then L = 3 and the complete input-output relation is ^ ^ ^ y = f f f (x ; ) ; ; : (3) 3 2 1 0 1 2 3 Hence, the composite function f is determined by the parameter vector  containing the n (n +1) l l1 l=1 parameter values from  ; ; (i.e., the weights and biases from all layers). 1 2 3 A. Problem Statement Artificial neural networks are generally used to approximate other functions, by selecting the parameter vector  to somehow minimize the approximation error. In particular, the category of fully-connected feed-forward networks is capable of approximating any continuous function arbitrarily well by utilizing a (possibly) large but finite number of parameters (and neurons) [1]. This important result can be viewed as a generalization of Taylor polynomial approximations to functions with vector inputs and vector outputs. Two other categories are convolutional neural networks and recurrent neural networks [2]. Each category is believed to be better at approximating certain types of functions, in the sense of requiring fewer parameters to achieve a certain approximation error and/or it being easier to find appropriate parameter values in practice. Selecting the right category of neural network is important but beyond the scope of this lecture note. Instead, our problem statement is: what are the important use cases where the function approximation capability can be utilized in the physical layer of communication systems, to achieve large improvements compared to conventional techniques? B. Solution To identify practically important use cases, we first need to understand how the function approximation is carried out. The parameter vector of an artificial neural network can be tuned/trained to approximate a (possibly unknown) function that we call f ; that is, f should be trained to become a good estimate of 4 f . This is preferably done by supervised learning using a set of T training examples consisting of input train train train vectors x and the corresponding output vectors y = f(x ) that we want the neural network to t t t reproduce, for t = 1; : : : ; T . Let us represent these training examples as the columns of two matrices: h i h i train train train train train train X = ; Y = : (4) x : : : x y : : : y 1 T 1 T The inputs should ideally be selected independently at random from the distribution of inputs that appears when using f in reality. The training basically consists of finding the parameter  that minimizes a loss function ` that measures the approximation mismatch: train train = arg min ` ;X ;Y : (5) For example, the loss can be measured in the mean-squared sense as X 2 train train train train ` ;X ;Y = y f(x ;) : (6) t t t=1 The goal is that the trained neural network f(x ; ) will provide approximately the right outputs not only for the training examples, but for any input signal x generated in the same way. This desired property is called generalization. Intuitively, if the unknown function f is continuous and has limited variability, we should be able to approximate it well from a large training set. We can once again make a parallel to polynomial approximations; any scalar polynomial of order T 1 is uniquely determined by T samples (training examples) of the inputs and outputs. If the polynomial order is unknown, or if the function is only approximately polynomial, we need a larger number of samples to ensure a good approximation. Since the training in (5) is a complicated non-convex optimization problem, huge efforts have been dedicated to finding computationally and performance-wise acceptable suboptimal solutions. Moreover, the generalization to unseen inputs can be improved by various regularizations, hyper-parameter choices, and network designs [2]. These choices affect the model complexity. A simple model cannot capture complex dependencies. A too complex model explains the training data only (this is called overfitting). A correct complexity trade-off gives good generalization and is typically found using cross-validation. However, such empirical craftsmanship is not the focus of this lecture note, but we conclude: 1) Artificial neural networks can approximate any continuous function. 2) The supervised training requires a large training set with inputs/outputs to achieve a low approxi- mation error. There are many functionalities in communication systems that can be described by a mathematical function f and, thus, can be approximated by a neural network. To identify the promising use cases, we will first explain the basic methodology and its weaknesses by giving a concrete example. 5 1) A Deep-Learning Solution to Signal Detection: The physical layer of a communication system determines how an information-bearing signal is sent from the transmitter to the receiver over a physical channel. A critical task is the signal detection, where the receiver tries to identify what information was sent. To describe some key properties of deep learning, we will exemplify how it can be used for signal detection. We consider a classical additive white Gaussian noise (AWGN) channel, where a two-dimensional 2 2 signal vector s 2 R is sent. The received signal r 2 R is given by r = s + n; (7) 2 2 where n  N (0;  I) is an independent Gaussian noise vector where the entries have variance  . We assume two bits of information are encoded into s using a quadrature phase-shift keying (QPSK) constellation. Hence, there are four possible signal points that are equally spaced on the unit circle: 82 3 2 3 2 3 2 39 p p p p < = 1= 2 1= 2 1= 2 1= 2 4 5 4 5 4 5 4 5 s 2 ; ; ; : (8) p p p p : ; 1= 2 1= 2 1= 2 1= 2 The mapping between information bits and signals is illustrated in Fig. 2(a). Due to the additive noise, the received signal r can take any value, but the Gaussian distribution makes values close to one of the signal points in (8) more likely than values far away. This can be seen from the red dots in Fig. 2(a), which represent r for 10,000 noise realizations with  = 0:2 that are added to each signal point. Based on the received signal r, the receiver needs to guess (detect) what signal s was sent. We have trained a neural network for this task, by taking the received signal x = r as input and letting the output y be a four-dimensional vector that is one for the detected signal and has zeroes elsewhere. We used the 40,000 red dots in Fig. 2(a), and the signals s that generated these r, to train a fully-connected neural network using standard training methods. We then applied the neural network to a wide range of possible received signals to illustrate how it is making its detection. The colored areas in Fig. 2(b) show in which regions the received signals are mapped to the respective information signals. The regions are separated by lines, which is expected since each layer performs linear algebra operations; in particular, each activation function determines if the input is below/above a line that has been selected by training. Note that we have “zoomed out” and the range of values that was shown in Fig. 2(a) is indicated by the black square. The colored detection regions produced by the neural network have peculiar asymmetric shapes, which are not optimal. In fact, the optimal detection regions for AWGN channels are well known [3, Ch. 6]: the received signal should be mapped to the closest signal point in terms of Euclidean distance. The optimal detection regions are shown in Fig. 2(c). The regions are quite similar within the black square, but greatly deviates further away. Several important observations can be made from this example: 6 1.5 01 11 0.5 A received signal r -0.5 00 10 -1 -1.5 -1.5 -1 -0.5 0 0.5 1 1.5 (a) Quadrature phase-shift keying for information encoding and the corresponding received signals. Detection: 01 Detection: 11 Detection: 01 Detection: 11 Detection: 00 Detection: 10 Detection: 00 Detection: 10 (b) Detection regions with a trained neural network. (c) Optimal detection regions using detection theory. Fig. 2. We send QPSK signals over an AWGN channel, as shown in (a), and try to detect the signals at the receiver. The detection regions produced by a trained neural network is shown in (b) and the optimal regions obtained from detection theory are shown in (c). 1) If there is a known optimal algorithm, a trained neural network cannot outperform it. The detection error probability is, however, almost the same in this particular example since most received signals appear within the black square where the neural network has a decent behavior. 2) There are two reasons why the detection regions in Fig. 2(b) are wrongly shaped. Firstly, the shape inside the black square (around the signal points) is wrong due to overfitting; the training examples in Fig. 2(a) can be approximately separated by many piecewise linear boundaries, including the ones shown in Fig. 2(b). Secondly, since all training examples are inside the square, the behavior outside the square is somewhat random; the neural network has learned to interpolate between training examples but not to extrapolate outside the square. This is a practical issue since received signals far outside the square occasionally appear due to the long-tailed Gaussian distribution. This is a general phenomenon; neural networks are good at handling typical inputs but may generalize poorly to atypical inputs. 7 3) We could have used prior domain knowledge (from digital communications) to preprocess the input signals. In this example, the neural network had to rediscover where the constellation points are, how the noise is distributed, and how to make the right detection. If we would instead compute the Euclidean distance between the received signal and each of the four signal constellation points, we could use that as input to a neural network. This will give more accurate and reliable results since we have utilized our domain knowledge to ensure that the neural network has fewer characteristics to learn. However, it still cannot beat the optimal detection. 2) Is There a Role of Deep Learning in Communications?: Since signal detection in AWGN channels is easy to perform optimally, it makes little sense to utilize artificial neural networks for that purpose. There are many similar tasks in communications where deep learning cannot make any meaningful improvements. For example, the fundamental performance limits were derived by Shannon [4] and we can operate close to those limits using modern channel codes. Moreover, it is known how to perform optimal channel estimation, multi-user multiple-input multiple-output (MIMO) processing, and transmit power allocation in many wireless communication scenarios [5]. The fact that the information signals are man-made gives us strong prior information that makes it easier to devise effective man-made algorithms than in many other fields, where the signals are created by nature. There are nevertheless some important roles that deep learning can play in communications. Firstly, there are many problems where a known algorithm finds the optimum but has prohibitively high complexity for real-time implementation. Secondly, there are cases where the standard system models are inadequate or incomplete. It is sufficient to replace the noise distribution in the previous example with an unknown one to find a case where learning can help. We will elaborate on these two applications in the remainder of this lecture note. But before that, we stress that errors are unavoidable in the physical layer of communication systems and are conventionally dealt with using retransmissions. This built-in fault tolerance is positive when it comes to the utilization of deep learning. It gives robustness to the strange behaviors that occasionally occur when an atypical signal is fed into a neural network that has been trained to work well for typical input signals. However, adversaries can also exploit atypical signals to perform jamming more efficiently [6]. IV. A PPLICATION 1: A LGORITHMIC APPROXIMATION The first important application of deep learning in communications is to approximate a known but computationally complicated algorithm. There are many examples of iterative algorithms that asymptot- ically find a global (or local) optimum to an optimization problem, but require very many iterations for convergence and/or complicated operations in each iteration [7]. Such algorithms might not be practically useful in communication systems where latency constraints require execution times below a millisecond. y = f(x) Known algorithm y = f(x) y y Training Neural network Neural network x y^ = f(x; ) ^ ^ ^ f y^ = f(x;) f (a) Offline training phase (b) Real-time usage ^ ^ Fig. 3. A known algorithm f can be approximated by training a neural network f to make f(x)  f(x; ) for all possible inputs, as shown in (a). The training procedure will iteratively update  to gradually reduce the approximation errors until it converges to some  . If the neural network is designed to have sufficiently low complexity, then the trained neural network in (b) can be used in real-time applications. The general procedure for training a neural network for algorithmic approximation is illustrated in Fig. 3. Suppose we have a known algorithm, represented by the function y = f(x), which cannot be implemented in real time. To address this problem using deep learning, we can first create a training set train containing a large number T of input signals x , for t = 1; : : : ; T . We then run the algorithm T times to compute the outputs train train y = f(x ): (9) t t After having generated the training set, we can train an artificial neural network to provide approximately the same outputs for these inputs. More precisely, we should find an optimized parameter vector  in accordance to (5). If the training is performed well, the neural network will generalize well (i.e., provide good outputs) to previously unseen input signals that were generated in the same way as the inputs used for training. Simply speaking, this means that f(x)  f(x; ) for all inputs x of practical interest. There are many optimization problems to be solved in communication systems. For example, at the transmitter, power allocation between concurrent transmissions is important to limit interference [5], [7]. At the receiver, non-linear signal detection problems must be solved to deal with interference in MIMO systems [8]. Some of these problems are convex and can be solved by off-the-shelf optimization software. Other problems are non-convex but there exist iterative algorithms that converge to local or global optima. In both cases, the computational complexity is often prohibitive for real-time applications, where similar optimization problems with different input data are solved repeatedly. A neural network can then be trained to learn approximately how the solution depends on the input data. This approximate input-output map can be evaluated with substantially lower computational cost, as exemplified in [7], [8]. Domain knowledge can be utilized to pre-process the input data, to focus the learning on the problem that the algorithm is solving and not on rediscovering known properties (e.g., that the desired signal lies in a certain subspace). 9 There are two main approaches. One can learn the input-output mapping based on training data, as described above, while ignoring how it was produced. Alternatively, the shape of the neural network can be selected so that each layer mimics one iteration of a known algorithm that converges asymptotically to an optimum. This is called deep unfolding and exploits that many first-order iterative optimization methods have the same structure as a (recurrent) neural network [9]. The parameters of the neural network are then trained to give a nearly optimum solution after a predefined number of iterations, thereby speeding up the convergence. In [8], the authors “unfold” a gradient-descent-like algorithm for MIMO detection to create a neural network where each layer performs similar operations but with optimized parameters. When using this approach, (9) needs not to be determined in advance, which simplifies the training. The practical benefit of this application is the complexity reduction it can provide; the neural network will essentially learn how to make algorithmic shortcuts to strike a good balance between accuracy and computational complexity. Another important benefit is related to hardware implementation. To solve a practical problem with real-time constraints, we conventionally would first need to design an algorithm and then develop a dedicated circuit based on it, which can be very time-consuming. With the help of deep learning, we can instead predesign a general-purpose circuit that implements a neural network of a given maximum size (i.e., number of layers and neurons) with a predetermined run time. We can then train a neural network to perform the algorithmic task we need and, finally, load the corresponding trained parameters (i.e., weights and biases) onto the circuit. This new approach to hardware implementation can greatly reduce the time from that the algorithmic design begins to a product can hit the market. A main issue with this application is the highly computationally demanding generation of desired train outputs: the more complex the algorithm f is, the longer time it takes to compute f(x ) for t = 1; : : : ; T . We are basically moving the complexity issue from the algorithmic run time to the design process. There is a practical limit to which algorithms that we can approximate in this way. If it takes 1 hour to generate one training example, it will take 11.4 years or extreme parallelism to generate 100,000 examples. V. A PPLICATION 2: INVERSION OF AN UNKNOWN F UNCTION The second important application is to invert an unknown function. In particular, non-linear distortion can occur between the transmitter and receiver. Three prominent examples are finite-resolution quantization in the receiver hardware, non-linear amplifiers in the transmitter hardware [10], and non-linear fiber-optical channels [11]. While quantizers typically are designed with known properties, the latter two examples can be represented by an unknown function g that takes a signal y as input and produces a distorted output x = g(y). The conventional way to undo the distortion is to identify an appropriate parameterized model of the function, then estimate the parameters from measurements, and finally create an inverse function based on the estimates. This three-step approach is suboptimal and prone to error-propagation. 10 y y^ Training − x = g(y) Neural network Unknown (a) Training phase y y^ = f(x;) function g x = g(y) (b) Usage Unknown Neural network y y^ = f(x; ) function g f Fig. 4. An unknown function g with input y is inverted using a neural network f by training it to achieve f(g(y); )  y, as shown in (a). The training procedure will iteratively update  to gradually reduce the approximation errors until it converges to some  . The trained neural network in (b) can be used to counteract the unknown function, without having to explicitly model it and estimate model parameters. An alternative is to train a neural network to directly invert the function, without explicit modeling or parameter estimation. Whenever only suboptimal conventional algorithms exist, a learned algorithm can theoretically provide better performance and robustness, but only if the training is carried out successfully. The general procedure for training a neural network for function inversion is illustrated in Fig. 4. We train need to generate a large number T of possible communication signals y and send them through the unknown function to measure train train x = g(y ) for t = 1; : : : ; T: (10) t t train train It is then x that is used as input to the neural network, while y is the desired output. t t Different from Application 1, the creation of a training set can be very computationally efficient in Application 2 because the outputs are man-made. It is typically created to be statistically equivalent to the signals observed at run time, but one can also create a biased training set to emphasize typical or atypical examples. Online learning when operating the communication system is possible by occasionally sending predefined reference signals to generate new training data. This is useful when the function g is time-varying (e.g., due to temperature variations in the hardware). The key to successful utilization of deep learning is to identify tasks in communication systems that currently lack an optimal solution—there is then an opportunity to beat the state-of-the-art. For example, a common way to deal with non-linear communication hardware is to apply the Bussgang decomposition [12] to write the output of the non- linear function g as g(y) = Dy + n, where D is a deterministic matrix and n is distortion noise that is uncorrelated with y but statistically dependent. By pretending as if n is independent noise, one can often develop communication algorithms (e.g., for channel estimation or data detection) that partially mitigate 11 distortion, but such algorithms are suboptimal since the distortion is in fact dependent on the input. As shown in [10], one can achieve substantially better performance by training neural networks instead. VI. W HAT W E HAVE L EARNED Although many parts of communication systems can be solved optimally, there are important cases where deep learning can give large improvements. In particular, it can be used to reduce computational complexity of known algorithms or to deal with non-linear hardware or channels in an efficient way. VII. ACKNOWLEDGMENT This work was supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. VIII. AUTHORS Emil Björnson (emil.bjornson@liu.se) received the MSc degree in engineering mathematics from Lund University, Sweden, in 2007, and the PhD degree in telecommunications from the KTH Royal Institute of Technology, Sweden, in 2011. He is now an associate professor at Linköping University, Sweden. He has authored the textbooks Optimal Resource Allocation in Coordinated Multi-Cell Systems (2013) and Massive MIMO Networks: Spectral, Energy, and Hardware Efficiency (2017). He received the 2018 IEEE Marconi Prize Paper Award in Wireless Communications, the 2019 EURASIP Early Career Award, the 2019 IEEE Communications Society Fred W. Ellersick Prize, and the 2019 IEEE Signal Processing Magazine Best Column Award. Pontus Giselsson (pontus.giselsson@control.lth.se) is an Associate Professor at the Department of Automatic Control at Lund University, Sweden. His current research interests include mathematical optimization and its wide range of applications, e.g., in machine learning, control, signal processing, and wireless communication. He received an MSc degree from Lund University in 2006 and a PhD degree from Lund University in 2012. During 2013 and 2014, he held a postdoc position at Stanford University. In 2012, he received the Young Author Price at the ADCHEM IFAC Symposium, in 2014, he received the Young Author Price at the IFAC World Congress, and in 2015, he received the Ingvar Carlsson Award from the Swedish Foundation for Strategic Research. R EFERENCES [1] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Signals and Systems, vol. 2, no. 4, pp. 303–314, Dec. 1989. [2] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. 12 [3] U. Madhow, Introduction to Communication Systems. Cambridge University Press, 2014. [4] C. E. Shannon, “A mathematical theory of communication,” Bell System Technical Journal, vol. 27, pp. 379–423, 623–656, 1948. [5] E. Björnson, J. Hoydis, and L. Sanguinetti, “Massive MIMO networks: Spectral, energy, and hardware efficiency,” Foundations and Trends® in Signal Processing, vol. 11, no. 3-4, pp. 154–655, 2017. [6] M. Sadeghi and E. G. Larsson, “Adversarial attacks on deep-learning based radio signal classification,” IEEE Wireless Communications Letters, vol. 8, no. 1, pp. 213–216, Feb. 2019. [7] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos, “Learning to optimize: Training deep neural networks for interference management,” IEEE Transactions on Signal Processing, vol. 66, no. 20, pp. 5438–5453, Oct. 2018. [8] N. Samuel, T. Diskin, and A. Wiesel, “Deep MIMO detection,” in IEEE SPAWC, Jul. 2017. [9] J. R. Hershey, J. L. Roux, and F. Weninger, “Deep unfolding: Model-based inspiration of novel deep architectures,” arXiv preprint, vol. abs/1409.2574, 2014. [Online]. Available: http://arxiv.org/abs/1904.03406 [10] Ö. T. Demir and E. Björnson, “Channel estimation in massive MIMO under hardware non-linearities: Bayesian methods versus deep learning,” IEEE Open Journal of the Communications Society, vol. 1, no. 1, pp. 109–124, 2020. [11] A. D. Ellis, J. Zhao, and D. Cotter, “Approaching the non-linear Shannon limit,” Journal of Lightwave Technology, vol. 28, no. 4, pp. 423–433, Feb. 2010. [12] J. J. Bussgang, “Crosscorrelation functions of amplitude-distorted Gaussian signals,” RLE, MIT, Tech. Rep. 216, 1952.

Journal

Computing Research RepositoryarXiv (Cornell University)

Published: Jan 10, 2020

There are no references for this article.