Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Artificial neural networks for neuroscientists: A primer

Artificial neural networks for neuroscientists: A primer Artificial neural networks (ANNs) are essential tools in machine learning that have drawn increasing attention in neuroscience. Be- sides offering powerful techniques for data analysis, ANNs provide a new approach for neuroscientists to build models for complex behav- iors, heterogeneous neural activity and circuit connectivity, as well as to explore optimization in neural systems, in ways that traditional models are not designed for. In this pedagogical Primer, we introduce ANNs and demonstrate how they have been fruitfully deployed to study neuroscientific questions. We first discuss basic concepts and methods of ANNs. Then, with a focus on bringing this mathematical framework closer to neurobiology, we detail how to customize the analysis, structure, and learning of ANNs to better address a wide range of challenges in brain research. To help the readers garner hands-on experience, this Primer is accompanied with tutorial-style code in PyTorch and Jupyter Notebook, covering major topics. 1 Artificial neural networks in neuroscience Learning with artificial neural networks (ANNs), or deep learning, have emerged as a dominant framework in machine learning (ML) nowadays [LeCun et al., 2015], leading to breakthroughs across a wide range of applications, including computer vision [Krizhevsky et al., 2012], natural language processing [Devlin et al., 2018], and strategic games [Silver et al., 2017]. Some key ideas in this field can be traced to brain research: supervised learning rules have their roots in the theory of training perceptrons which in turn was inspired by the brain [Rosenblatt, 1962]; the hierarchical architecture [Fukushima and Miyake, 1982] and convolutional principle [LeCun and Bengio, 1995] were closely linked to our knowledge about the primate visual system [Hubel and Wiesel, 1962, Felleman and Van Essen, 1991]. Today, there is a continued exchange of ideas from neuroscience to the field of artificial intelligence [Hassabis et al., 2017]. At the same time, machine learning offers new and powerful tools for systems neu- roscience. One utility of the deep learning framework is to analyze neuroscientific data (Figure 1). Indeed, the advances in computer vision, especially convolutional Preprint. arXiv:2006.01001v2 [q-bio.NC] 24 Sep 2020 neural networks, have revolutionized image and video data processing. For instance, uncontrolled behaviors over time, such as micro-movements of animals in a labora- tory experiment, can now be tracked and quantified efficiently with the help of deep neural networks [Mathis et al., 2018]. Innovative neurotechnologies are producing a deluge of big data from brain connectomics, transcriptome and neurophysiology, the analyses of which can benefit from machine learning. Examples include image segmentation to achieve detailed, m scale, reconstruction of connectivity in a neural microcircuit [Januszewski et al., 2018, Helmstaedter et al., 2013], and estimation of neural firing rate from spiking data [Pandarinath et al., 2018]. This primer will not be focused on data analysis; instead, our primary aim is to present basic concepts and methods for the development of ANN models of biologi- cal neural circuits in the field of computational neuroscience. It is noteworthy that ANNs should not be confused with neural network models in general. Mathematical models are all “artificial" inasmuch as they are not biological. We denote by ANNs specifically models that are in part inspired by neuroscience yet for which biologi- cally justification is not the primary concern, in contrast to other types of models that strive to be built on quantitative data from the two pillars of neuroscience: neuroanatomy and neurophysiology. The use of ANNs in neuroscience [Zipser and Andersen, 1988] and cognitive science [Cohen et al., 1990] dates back to the early days of ANNs [Rumelhart et al., 1986]. In recent years, ANNs are becoming increasingly common model systems in neuroscience [Yamins and DiCarlo, 2016, Kriegeskorte, 2015, Sussillo, 2014, Barak, 2017]. There are three reasons for which ANNs or deep learning models have already been, and will likely continue to be, particularly useful for neuroscientists. First, fresh modeling approaches are needed to meet new challenges in brain research. Over the past decades, computational neuroscience has made great strides and be- come an integrated part of systems neuroscience [Abbott, 2008]. Much insights have been gained through integration of experiments and theory, including the idea of excitation and inhibition balance [Van Vreeswijk and Sompolinsky, 1996, Shu et al., 2003] and normalization [Carandini and Heeger, 2012]. Progress was also made in developing models of basic cognitive functions such as simple decision- making [Gold and Shadlen, 2007, Wang, 2008]. However, real-life problems can be incredibly complex, the underlying brain systems are often difficult to capture with “hand-constructed” computational models. For example, object classification in the brain is carried out through many layers of complex linear-nonlinear pro- cessing. Building functional models of the visual systems that achieve behavioral performance close to humans’ remained a formidable challenge not only for neu- roscientists, but also for computer vision researchers. By directly training neural network models on complex tasks and behaviors, deep learning provides a way to efficiently generate candidate models for brain functions that otherwise could be near impossible to model (Figure 1). By learning to perform a variety of complex behaviors of animals, ANNs could serve as potential model systems for biological neural networks, complementing nonhuman animal models for understanding the human brain. 2 Figure 1: Reasons for using ANNs for neuroscience research. (Top left) Neu- ral/Behavioral data analysis. ANNs can serve as image processing tools for efficient pose estimation (color dots). Figure inspired from Nath et al. [2019]. (Top right) Modeling complex behaviors. ANNs can perform object discrimination tasks involv- ing challenging naturalistic visual objects. Figure adapted from Kar et al. [2019]. (Bottom left) Illustrating that ANNs can be used to model complex neural activ- ity/connectivity patterns (blue lines). (Bottom right) Understanding neural circuits from an optimization perspective. In this view, functional neural networks (star sym- bol) are results of the optimization (arrows) of an objective function in an abstract space of a model constrained by the neural network architecture (colored space). A second reason for advocating deep networks in systems neuroscience is the acknowledgment that relatively simple models often do not account for a wide diversity of activity patterns in heterogeneous neural populations (Figure 1). One can rightly argue that this is a virtue rather than defect because simplicity and generality are hallmarks of good theories. However, complex neural signals also tell us that existing models may be insufficient to elucidate mysteries of the brain. This is perhaps especially true in the case of the prefrontal cortex. Neurons in prefrontal cortex often show complex mixed selectivity to various task variables [Rigotti et al., 2010, 2013]. Such complex patterns are often not straightforward to interpret and understand using hand-built models that by design strive for simplicity. ANNs are promising to capture the complex nature of neural activity. Thirdly, besides providing mechanistic models of biological systems, machine learning can be used to probe the “why” question in neuroscience [Barlow et al., 1961]. Brains are biological machines evolved under pressure to compute robustly and efficiently. Even when we understand how a system works, we may still ask why it works that way. Similarly to biological systems evolving to survive, ANNs 3 are trained to optimize objective functions given various architectural constraints (the number of neurons, economy of circuit wiring, etc.) (Figure 1). By identifying the particular objective and set of constraints that lead to brain-resembling ANNs, we could potentially gain insights into the evolutionary pressure faced by biological systems [Richards et al., 2019]. In this pedagogical primer, we will discuss how ANNs can benefit neuroscientists in the three ways described above. In section 2, we will first introduce the key ingredients common in any study of ANNs. In section 3, we will describe two major applications of ANNs as neuroscientific models: convolutional networks as models for sensory, especially visual, systems, and recurrent neural networks as models for cognitive and motor systems. In the following sections 4 and 5, we will overview how to customize the analysis and architectural design of ANNs to better address a wide range of neuroscience questions. To help the readers gain hands-on experience, we accompany this primer with tutorial-style code in PyTorch and Jupyter Notebook (https://github.com/gyyang/nn-brain), covering all major topics. 2 Basic ingredients and variations in artificial neural networks In this section, we will introduce basic concepts in ANNs and their common varia- tions. Readers can skip this section if they are familiar with ANNs and deep learning. For a more thorough introduction, readers can refer to Goodfellow et al. [2016]. 2.1 Basic ingredient: learning problem, architecture, and algorithm A typical study using deep networks consists of three basic ingredients: learning problem, network architecture, and training algorithm. Weights of connections between units or neurons in a neural network are constrained by the network ar- chitecture, but their specific values are randomly assigned at initialization. These weights constitute a large number of parameters, collected denoted by  which also includes other model parameters (see below), to be trained using an algorithm. The training algorithm specifies how connection weights change to better solve a learning problem, such as to fit a dataset or perform a task. We will go over a simple example, where a multi-layer-perceptron (MLP) is trained to perform a simple digit-classification task using supervised learning. Learning problem In supervised learning, a system learns to fit a dataset contain- (i) (i) ing a set of inputsfx g; i = 1; ; N . Each input x is paired with a target output (i) y . Symbols in bold represent vectors (column vectors by default). The goal is target to learn parameters  of a neural network function F (;) that predicts the target (i) (i) (i) outputs given inputs, y = F (x ;)  y . In the simple digit-classification target task MNIST [LeCun et al., 1998], each input is an image containing a single digit, while the target output is a probability distribution over all classes (0, 1, ..., 9) given by a 10-dimensional vector or simply an integer corresponding to the class of that object. 4 More precisely, the system is trained to optimize the value of an objective function, (i) (i) or commonly, minimize the value of a loss function L = L(y ;y ), target N i (i) (i) (i) where L(y ;y ) quantifies the difference between the target output y and target target (i) the actual output y . Network architecture ANNs are incredibly versatile, including a wide range of architectures. Of all architectures, the most fundamental one is a Multi-Layer Perceptron (MLP) [Rosenblatt, 1958, 1962] (Figure 2A). A MLP consists of multiple layers of neurons, where neurons in the l-th layer only receive inputs from the (l 1)- th layer, and only project to the (l + 1)-th layer. (1) r = x; (1) (l) (l) (l1) (l) r = f (W r + b ); 1 < l < N; (2) (N ) (N1) (N ) y = W r + b : (3) (l) Here x is an external input, r denotes the neural activity of neurons in the l-th (l) layer, and W is the connection matrix from the (l 1)-th to the l-th layer. f () is a (usually nonlinear) activation function of the model neurons. The output of the (N ) (l) (N ) network is read out through connections W . Parameters b and b are biases for model neurons and output units respectively. If the network is trained to classify, then the output is often normalized such that y = 1, where y represents the j j predicted probability of class j. When there are enough neurons per layer, MLPs can in theory approximate arbitrary functions [Hornik et al., 1989]. However, in practice, the network size is limited, and good solutions may not be found through training even when they exist. MLPs are often used in combination with, or as parts of, more modern neural network architectures. Training algorithm The signature method of training in deep learning is stochas- tic gradient descent (SGD) [Robbins and Monro, 1951, Rumelhart et al., 1986]. Trainable parameters, collectively denoted as , are updated in the opposite direction of the gradient of the loss, @L=@. Intuitively, the j-th parameter  should be reduced by training if the cost function L increases with it; and increased otherwise. For each step of training, since it is usually too expensive to evaluate the loss using the entire training set, the loss is computed using a small number M of randomly selected training examples (a minibatch), indexed by B = fk ; ; k g, 1 M (k) (k) L = L(y ;y ); (4) batch target k2B hence the name “stochastic”. For simplicity, we assume a minibatch size of 1 and omit batch in the following equations (L will be referred to as L, etc.). The batch gradient, @L=@ is the direction of parameter change that would lead to maximum increase in the loss function when the change is small enough. To decrease the loss, trainable parameters are updated in the opposite direction of the gradient, with a magnitude proportional to the learning rate , @L =  : (5) 5 Figure 2: Schematics of common neural network architectures. (A) A multi- layer perceptron (MLP). (B) A recurrent neural network (middle) receives a stream of inputs (left). After training, an output unit (right) should produce a desired output. Figure inspired from Mante et al. [2013]. (C) A recurrent neural network is unrolled in time as a feedforward system with each layer corresponding to the network state at one time step. c and r describe the network state and output activity at time t t t respectively. c is a function of r and the input x . (D) A convolutional neural t t1 t network for processing images. Each layer contains a number of channels (4 in layer 1, 6 in layer 2). A channel (represented by a square) consists of spatially organized neurons, each receiving connections from neurons with similar spatial preferences. The spatial extent of these connections is described by the kernel size. Figure inspired from LeCun et al. [1998]. Parameters such as W and b are usually trainable. Other parameters are set by the modelers and called hyperparameters, for example, the learning rate . A crucial requirement for computing gradients is differentiability, namely derivatives of functions in the model are well defined. For a feedforward network without any intermediate (hidden) layer [Rosenblatt, 1962] processing a single example x (minibatch size 1), y = Wx + b; or equivalently, y = W x + b ; (6) i ij j i computing the gradient is straightforward, @L @L @y @L = = x ; (7) @W @y @W @y ij k ij i with @y =@W equal to x when k = i, otherwise 0. In vector notation, k ij j @L @L = x : (8) @W @y 6 Here we follow the convention that @L=@W and @L=@y have the same form as W and y, respectively. Assuming that 1 1 2 2 L = ky y k = (y y ) ; (9) target j target;j 2 2 we have, @L = (y y )x ; (10) target @W @L W / = (y y )x : (11) ij target;i i j @W ij This modification only depends on local information about the input and output units of each connection. Hence, if y > y , W should change to increase the net target;i i ij input and W has the same sign as x . The opposite is true if y < y . ij j target;i i For a multi-layer network, the differentiation is done using the back-propagation algorithm [Rumelhart et al., 1986, LeCun, 1988]. To compute the loss L, the network is run in a forward pass (Eq. 1-3). Next, to efficiently compute the exact gradient @L=@, information about the loss needs to be passed backward, in the opposite direction of the forward pass, hence the name back-propagation. To illustrate the concept, consider a N -layer linear feedforward network (Eq. 1-3, (l) (l) but with f (x) = x). To compute @L=@W , we need to compute @L=@r . From (l+1) (l+1) (l) (l+1) r = W r + b , we have (l+1) X X X @r @L @L @L @L j (l+1) (l+1) = = W = [W ] : (12) ji ij (l) (l+1) (l) (l+1) (l+1) @r @r @r @r @r i j j i j j j j In vector notation, @L @L @L (l+1) | (l+1) | (l+2) | = [W ] = [W ] [W ] =  : (13) (l) (l+1) (l+2) @r @r @r (l) Therefore, starting with @L=@y, @L=@r can be recursively computed from (l+1) @L=@r , for l = N 1; ; 1. This computation flows in the opposite direction of the forward pass, and is called the backward pass. In general, back-propagation applies to neural networks with arbitrary differential components. Computing the exact gradient through back-propagation is considered unrealistic biologically because updating connections at each layer requires precise, non-local information of connection weights at downstream layers (in the form of connection matrix transposed, Eq. 13). 2.2 Variations of learning problems/objective functions In this and the following sections (2.3, 2.4), we introduce common variations of learning problems, network architectures, and training algorithms. Traditionally, learning problems are divided into three kinds: supervised, reinforce- ment, and unsupervised learning problems. The difference across these three kinds of learning problems lies in the goal or objective. In supervised learning, each input 7 is associated with a target. The system learns to produce outputs that match the targets. In reinforcement learning, instead of explicit (high-dimensional) targets, the system receives a series of scalar rewards. It learns to produce outputs (actions) that maximize total rewards. Unsupervised learning refers to a diverse set of problems where the system is not provided with explicit targets or rewards. Due to space limitations, we will mainly focus on networks trained with supervised learning in this Primer. Supervised learning As mentioned before, for supervised learning tasks, input (i) (i) and target output pairs are provided f(x ;y )g. The goal is to minimize the target difference between target outputs and actual outputs predicted by the network. In many common supervised learning problems, the target outputs are behavioral outputs. For example, in a typical object classification task, each input is an image containing a single object, while the target output is an integer corresponding to the class of that object (e.g., dog, cat, etc.). In other cases, the target output can directly be neural recording data [McIntosh et al., 2016, Rajan et al., 2016, Andalman et al., 2019]. The classical perceptual decision-making task with random-dot motion [Britten et al., 1992, Roitman and Shadlen, 2002] can be formulated as a supervised learning problem, because there is a correct answer. In this task, animals watch randomly moving dots and report the dots’ overall motion direction by choosing one of two alternatives, A or B. This task can be simplified as a network receiving a stream of (i) noisy inputs x at every time point t of the i-th trial, which can represent the net evidence in support of A and against B. At the end of each trial t = T , the system (i) (i) should learn to report the sign of the average input y = sign(hx i ), +1 for target t choice A and1 for choice B. Reinforcement learning For reinforcement learning [Sutton and Barto, 2018], a model (an agent) interacts with an environment, such as a (virtual) maze. At time step t, the agent receives an observation o from the environment, produces an action a that updates the environment state to s , and receives a scalar reward r (negative t t+1 t value for punishment). For example, a model navigating a virtual maze can receive pixel-based visual inputs as observations o , produce actions a that move itself in t t the maze, and receive rewards when it exits the maze. The objective is to produce appropriate actions a given past and present observations that maximize cumulative rewards r . In many classical reinforcement learning problems, the observation o equals to the environment state s , which contains complete information about t t the environment. Reinforcement learning (without neural networks) has been widely used by neuro- scientists and cognitive scientists to study value-based learning and decision-making tasks [Schultz et al., 1997, Daw et al., 2011, Niv, 2009]. For example, in the multi- armed bandit task, the agent chooses between multiple options repeatedly, where each option produces rewards with a certain probability. Reinforcement learning theory can model how the agent’s behavior adapts over time, and help neuroscientists study the neural mechanism of value-based behavior. 8 Deep reinforcement learning trains deep neural networks using reinforcement learn- ing [Mnih et al., 2015], enabling applications to many more complex problems. Deep reinforcement learning can in principle be used to study most tasks performed by lab animals [Botvinick et al., 2020], since animals are usually motivated to perform the task via rewards. Although many such tasks can also be formulated as supervised learning problems when there exists a correct choice (e.g., perceptual decision mak- ing), many other tasks can only be described as reinforcement learning tasks because answers are subjective [Haroush and Williams, 2015, Kiani and Shadlen, 2009]. For example, a perceptual decision-making task where there is a correct answer (A, not B) can be extended to assess animals’ confidence about their choice [Kiani and Shadlen, 2009, Song et al., 2017]. In addition to the two alternatives that result in a large reward for the correct choice and no reward otherwise, monkeys are presented a sure-bet option that guarantees a small reward. Since a small reward is better than no reward, subjects are more likely to choose the sure-bet option when they are less confident about making a perceptual judgement. Reinforcement learning is necessary here because there is no ground-truth choice output: the optimal choice depends on the animals’ own confidence level at their perceptual decision. (i) Unsupervised learning For unsupervised learning, only inputs fx g are pro- vided, the objective function is defined solely with the inputs and the network parameters L(x;) (no targets or rewards). For example, finding the first compo- nent in Principal Component Analysis (PCA) can be formulated as unsupervised learning in a simple neural network. A single neuron y reading out from a group of input neurons x, (y = w x), can learn to extract the first principle component by maximizing its variance Var(y) while keeping its connection weights normalized (kwk = 1) [Oja, 1982]. Unsupervised learning is particularly relevant for modeling development of sensory cortices. Although widely-used in machine learning, the kind of labeled data needed for supervised learning, such as image-object class pairs, is rare for most animals. Unsupervised learning has been used to explain neural responses of early visual areas [Barlow et al., 1961, Olshausen and Field, 1996], and more recently, of higher visual areas [Zhuang et al., 2019]. Compared to reinforcement and unsupervised learning, supervised learning can be particularly effective because the network receives more informative feedback in the form of high-dimensional target outputs. Therefore, it is common to formu- late a reinforcement/unsupervised learning problem (or parts of it) as a supervised one. For example, consider an unsupervised learning problem of compressing high- dimensional inputs x into lower-dimensional representation z while retaining as much information as possible about the inputs (not necessarily in the information- theoretic sense). One approach to this problem is to train autoencoder networks [Rumelhart et al., 1986, Kingma and Welling, 2013] using supervised learning. An autoencoder consists of an encoder that maps input x into a low-dimensional latent representation z = f (x), and a decoder that maps the latent back to a encode high-dimensional representation y = f (z). To make sure z contains informa- decode tion about x, autoencoders use the original input as the supervised learning target, y = x. target 9 2.3 Variations of network architectures Recurrent neural network Besides MLP, another fundamental ANN architecture is recurrent neural networks (RNNs) that process information in time (Figure 2B). In a “vanilla” or Elman RNN [Elman, 1990], activity of model neurons at time t, r , is driven by recurrent connectivity W , and by inputs x through connectivity W . r t x The output of the network is read out through connections W . c = W r + W x + b ; (14) t r t1 x t r r = f (c ); (15) t t y = W r + b : (16) t y t y Here c represents the cell state, analogous to membrane potential or input current, while r represents the neuronal activity. An RNN can be unrolled in time (Figure 2C) and viewed as a particular form of a MLP, r = f (W r + W x + b ); for t = 1; ; T: (17) t r t1 x t r Here, neurons in the t-th layer, r receive inputs from the (t 1)-th layer r and t t1 additional inputs from outside of the recurrent network x . Unlike regular MLPs, the connections from each layer to the next are shared across time. Backpropagation also applies to a RNN. While backpropagation in a MLP propagates gradient information from the final layer back (Eq. 13), computing the gradient for a RNN involves propagating information backward in time (backpropagation- through-time, or BPTT) [Werbos, 1990]. Assuming that the loss is computed from outputs at the last time point T and a linear activation function, the key step of backpropagation-through-time is computed similarly to Eq. 13 as @L @L @L | | 2 = W = [W ] =  : (18) r r @r @r @r t t+1 t+2 With an increasing number of time steps in a RNN, weight modifications involve products of many matrices (Eq. 18). An analogous problem is present for very deep feedforward networks (for example, networks with more than 10 layers). The | T norm of this matrix product, k[W ] k, can grow exponentially with T , if W is large (more precisely, the largest eigenvalue of W > 1); or vanish to zero if W r r is small, making it historically difficult to train recurrent networks [Bengio et al., 1994, Pascanu et al., 2013]. Such exploding and vanishing gradient problems can be substantially alleviated with a combination of modern techniques, including network architectures [Hochreiter and Schmidhuber, 1997, He et al., 2016] and initial network connectivity [Le et al., 2015, He et al., 2015] that tend to preserve the norm of the backpropagated gradient. Convolutional neural networks A particularly important type of network ar- chitectures is convolutional neural network (Figure 2D). The use of convolution means that a group of neurons will each process its respective inputs using the same function, in other words, the same set of connection weights. In a typical convolutional neural network processing visual inputs [Fukushima et al., 1983, Le- Cun et al., 1990, Krizhevsky et al., 2012, He et al., 2016], neurons are organized into N “channels” or “feature maps”. Each channel contains N  N channel height width 10 neurons with different spatial selectivity. Each neuron in a convolutional layer is indexed by a tuple i = (i ; i ; i ), representing the channel index (i ), and the C H W C spatial preference indices (i ; i ). The i-th neuron in layer l is typically driven by H W neurons in the previous layer (bias term and activation function omitted), (l) (l) (l1) r = W r : (19) i i i i i i ;j j j j j j C H W C H W C H W C H W j j j C H W Importantly, in convolutional networks, the connection weights do not depend on the absolute spatial location of the i-th neuron, instead they depend solely on the spatial displacement (i j ; i j ) between the pre- and post-synaptic neurons. H H W W (l) (l) W = W (i j ; i j ): (20) H H W W i i i ;j j j i ;j C H W C H W C C Therefore, all neurons within a single channel process different parts of the input space using the same shared set of connection weights, allowing these neurons to have the same stimulus selectivity with receptive fields at different spatial locations. Moreover, neurons only receive inputs from other neurons with similar spatial preferences, i.e. whenji j j andji j j values are small (Figure 2D). H H W W This reusing of weights not only dramatically reduces the number of trainable parameters, but also imposes invariance on processing. For visual processing, convolutional networks typically impose spatial invariance such that objects are processed with the same set of weights regardless of their spatial positions. In a typical convolutional network, across layers the number of neurons per channel (N  N ) decreases (with coarser spatial resolution) while more features are height width extracted (with an increasing number of channels). A classifier is commonly at the end of the system to learn a particular task, such as categorization of visual objects. Activation function Most neurons in ANNs, like their biological counterparts, perform nonlinear computations based on their inputs. These neurons are usually point neurons with a single nonlinear activation function f () that links the sum of inputs to the output activity. The nonlinearity is essential for the power of ANNs [Hornik et al., 1989]. A common choice of activation function is the Rectified Linear Unit (ReLU) function, f (x) = max(x; 0) [Glorot et al., 2011]. The deriva- tive of ReLU at x = 0 is mathematically undefined, but conventionally set to 0 in practice. ReLU and its variants [Clevert et al., 2015] are routinely used in feed- forward networks, while the hyperbolic tangent (tanh) function is often used in recurrent networks [Hochreiter and Schmidhuber, 1997]. ReLU and similar activa- tion functions are asymmetric and non-saturating at high value. Although biological neurons eventually saturate at high rate, they often operate in non-saturating regimes. Therefore, traditional neural circuit models with rate units have also frequently used non-saturating activation functions [Abbott and Chance, 2005, Rubin et al., 2015]. Normalization Normalization methods are important components of many ANNs, in particular very deep neural networks [Ioffe and Szegedy, 2015, Ba et al., 2016b, Wu and He, 2018]. Similar to normalization in biological neural circuits [Carandini and Heeger, 2012], normalization methods in ANNs keep inputs and/or outputs of neurons in desirable ranges. For example, for inputs x (e.g., stimulus) to a layer, 11 Layer Normalization [Ba et al., 2016b] amounts to a form of “z-scoring" across units, so that the actual input x ^ to the i-th neuron is x ^ =  + ; (21) = hx i; (22) = h(x ) i + : (23) where hx i refers to the average over all units in the same layer;  and  are the mean and variance of x. After normalization, different external inputs lead to the same mean and variance for x^, set by the trainable parameters and . The values of and do not depend on the external inputs. The small constant  ensures that is not vanishingly small. 2.4 Variations of training algorithms Variants of SGD-based methods Supervised, reinforcement, and unsupervised learning tasks can all be trained with SGD-based methods. Partly due to the stochas- tic nature of the estimated gradient, directly applying SGD (Eq. 5) often leads to poor training performance. Gradually decaying learning rate value  during training can often improve performance, since smaller learning rate during late training encour- ages finer-tuning of parameters [Bottou et al., 2018]. Various optimization methods based on SGD are used to improve learning [Kingma and Ba, 2014, Sutskever et al., 2013]. One simple and effective technique is momentum [Sutskever et al., 2013, (j) Polyak, 1964], which on step j updates parameters with  based on temporally (j) smoothed gradients v , (j) @L (j) (j1) v = v + ; 0 <  < 1 (24) (j) (j) = v : (25) Alternatively, in adaptive learning rate methods [Duchi et al., 2011, Kingma and Ba, 2014], the learning rate of individual parameter is adjusted based on the statistics (e.g., mean and variance) of its gradient over training steps. For example, in the Adam method [Kingma and Ba, 2014], the value of a parameter update is magnified if its gradient has been consistent across steps (low variance). Adaptive learning rate methods can be viewed as approximately taking into account curvature of the loss function [Duchi et al., 2011]. Regularization Regularization techniques are important during training in order to improve generalization performance by deep networks. Adding a L2 regularization term, L =  W , to the loss function [Tikhonov, 1943] (equivalent to weight reg ij ij decay [Krogh and Hertz, 1992]) discourages the network from using large connection weights, which can improve generalization by implicitly limiting model complexity. Dropout [Srivastava et al., 2014] silences a randomly-selected portion of neurons at each step of training. It reduces the network’s reliance on particular neurons or a precise combination of neurons. Dropout can be thought of as loosely approximating spiking noise. 12 The choice of hyperparameters (learning rate, batch size, network initialization, etc.) is often guided by a combination of theory, empirical evidence, and hardware constraints. For neuroscientific applications, it is important that the scientific con- clusions do not rely heavily on the hyperparameter choices. And if they do, the dependency should be clearly documented. 3 Examples of building ANNs to address neuroscience questions In this section, we overview two common usages of ANNs in addressing neuro- science questions. 3.1 Convolutional networks for visual systems Deep convolutional neural networks are currently the standard tools in computer vision research and applications [Krizhevsky et al., 2012, Simonyan and Zisserman, 2014, He et al., 2016, 2017]. These networks routinely consist of tens, sometimes hundreds, of layers of convolutional processing. Effective training of deep feed- forward neural networks used to be difficult. This trainability problem has been drastically improved by a combination of innovations in various areas. Modern deep networks would be too large and therefore too slow to run, not to mention train, if not for the rapid development of hardware such as general purpose GPUs (Graphics Processing Units) and TPUs (Tensor Processing Units) [Jouppi et al., 2017]. Deep convolutional networks are usually trained with large naturalistic datasets containing millions of high resolution labeled images (e.g., Imagenet [Deng et al., 2009]), using training methods with adaptive learning rates [Kingma and Ba, 2014, Tieleman and Hinton, 2012]. Besides the default use of convolution, a wide range of network architecture innovations improve performance, including the adoption of ReLU activation function [Glorot et al., 2011], normalization methods [Ioffe and Szegedy, 2015], and the use of residual connections that can provide an architectural shortcut from a network layer’s inputs directly to its outputs [He et al., 2016]. Deep convolutional networks have been proposed as computational models of the visual systems, particularly of the ventral visual stream or the “what pathway” for visual object information processing (Figure 3) [Yamins and DiCarlo, 2016]. These models are typically trained using supervised learning on the same image classifica- tion tasks as the ones used in computer vision research, and in many cases, are the exact same convolutional networks developed in computer vision. In comparison, classical models of the visual systems typically rely on hand-designed features (synaptic weights) [Jones and Palmer, 1987, Freeman and Simoncelli, 2011, Riesen- huber and Poggio, 1999], such as Gabor filters, or are trained with unsupervised learning based on the efficient coding principles [Barlow et al., 1961, Olshausen and Field, 1996]. Although classical models have had success at explaining various features of lower-level visual areas, deep convolutional networks surpass them sub- stantially in explaining neural activity in higher-level visual areas in both monkeys [Yamins et al., 2014, Cadieu et al., 2014, Yamins and DiCarlo, 2016] and humans [Khaligh-Razavi and Kriegeskorte, 2014]. Besides being trained to classify objects, 13 Figure 3: Comparing the visual system and deep convolutional neural networks. The same image is passed through monkey’s visual cortex (top) and a deep convo- lutional neural network (bottom), allowing for side-by-side comparisons between biological and artificial neural networks. Neural responses from IT is best predicted by responses from the final layer of the convolutional network, while neural re- sponses from V4 is better predicted by an intermediate network layer (green dashed arrows). Figure adapted from Yamins and DiCarlo [2016]. convolutional networks can also be trained to directly reproduce patterns of neural activity recorded in various visual areas [McIntosh et al., 2016, Prenger et al., 2004]. In a classical work of comparing convolutional networks with higher visual areas [Yamins et al., 2014], Yamins and colleagues trained thousands of convolutional networks with different architectures on a visual categorization task. To study how similar the artificial and biological visual systems are, they quantified how well the network’s responses to naturalistic images can be used to linearly predict responses from the inferior temporal (IT) cortex of monkeys viewing the same images. They found that this neural predictivity is highly correlated with accuracy on the categorization task, suggesting that better IT-predicting models can be built by developing better performing models on challenging natural image classification tasks. They further found that unlike IT, neural responses from the relatively lower visual area, V4, is best predicted by intermediate layers of the networks (Figure 3). As computational models of visual systems, convolutional networks can model complex, high-dimensional inputs to downstream areas, useful for large-scale models using pixel-based visual inputs [Eliasmith et al., 2012]. This process has been made particularly straightforward with the easy access of many pre-trained networks in standard deep learning frameworks like Pytorch [Paszke et al., 2019] and Tensorflow [Abadi et al., 2016]. 14 3.2 Recurrent neural networks for cognitive and motor systems Recurrent neural networks are common machine learning tools to process sequences, such as speech and text. In neuroscience, they have been used to model various aspects of the cognitive, motor, and navigation systems [Mante et al., 2013, Barak et al., 2013, Sussillo et al., 2015, Yang et al., 2019, Wang et al., 2018, Cueva and Wei, 2018]. Unlike convolutional networks used to model visual systems that are trained on large-scale image classification tasks, recurrent networks are usually trained on specific cognitive or motor tasks that neuroscientists are studying. By comparing RNNs trained on the same tasks that animals or humans performed, side-by-side comparisons can be made between RNNs and brains. The comparisons can be made at many levels, including single-neuron activity and selectivity, population decoding, state-space dynamics, and network responses to perturbations. We will expand more on how to analyze RNNs in the next section. An influential work that uses RNNs to model cognition involves a monkey experi- ment for context-dependent perceptual decision-making [Mante et al., 2013]. In this task, a fraction (called motion coherence) of random moving dots moves in the same direction (left or right); independently a fraction (color coherence) of dots are red, and the rest are green. In a single trial, subjects were cued by a context signal to per- form either a motion task (judging the net motion direction is right or left) or a color task (deciding whether there is more red dots than green ones). Monkeys performed the task by temporally integrating evidence for behavioral relevant information (e.g. color) while ignoring the irrelevant feature (motion direction in the color task). Neurons in the prefrontal cortex recorded from behaving animals displayed complex activity patterns, where the irrelevant features are still strongly represented, even though they weakly influence behavioral choices. These counter-intuitive activity patterns were nevertheless captured by a RNN [Mante et al., 2013]. Examining the RNN dynamics revealed a novel mechanism by which the irrelevant features are represented, but selectively filtered out and not integrated over time during evidence accumulation. To better compare neural dynamics between RNNs and biological systems, RNNs used in neuroscience often treat time differently from their counterparts in machine learning. RNNs in machine learning are nearly always discrete time systems (but see Chen et al. [2018]), where state at time step t is obtained through a mapping from the state at time step t 1 (Eq. 15). The use of a discrete time system means that stimuli that are separated by several seconds in real life can be provided to the network in consecutive time points. To allow for more biologically realistic neural dynamics, RNNs used in neuroscience are often based on continuous time dynamical systems [Wilson and Cowan, 1972, Sompolinsky et al., 1988], such as dr = r(t) + f (W r(t) + W x(t) + b ): (26) r x r dt Here  is the single-unit time scale. This continuous-time system can then be discretized using the Euler method with a time step of t(<  ), r(t + t)  r(t) + [r(t) + f (W r(t) + W x(t) + b )]: (27) r x r 15 Besides gradient descent through back-propagation, a different line of algorithms has been used to train RNN models in neuroscience [Sussillo and Abbott, 2009, Laje and Buonomano, 2013, Andalman et al., 2019]. These algorithms are based on the idea of harnessing chaotic systems with weak perturbations [Jaeger and Haas, 2004]. In particular, the FORCE algorithm [Sussillo and Abbott, 2009] allows for rapid learning by modifying the output connections of an RNN to match the target using a recursive least square algorithm. The network output y(t) (assumed to be one-dimensional here) is fed back to the RNN through w , fb dr = r(t) + f (W r(t) + W x(t) + w y(t) + b ); (28) r x fb r dt y(t) = w r(t): (29) Therefore modifying the output connections amounts to a low-rank modification (w w ) of the recurrent connection matrix, fb dr = r(t) + f ([W + w w ]r(t) + W x(t) + b ): (30) r fb x r dt 4 Analyzing and understanding ANNs Common ANNs used in ML or neuroscience are not easily interpretable. For many neuroscience problems, they may serve better as model systems that await further analyses. Successful training of an ANN on a task does not mean knowing how the system works. Therefore, unlike most ML applications, a trained ANN is not the end goal but merely the prerequisite for analyzing that network to gain understanding. Most systems neuroscience techniques to investigate biological neural circuits can be directly applied to understand artificial networks. To facilitate side-by-side comparison between artificial and biological neural networks, activity of an ANN can be visualized and analyzed with the same dimensionality reduction tools (e.g., PCA) used for biological recordings [Mante et al., 2013, Kobak et al., 2016, Williams et al., 2018]. To understand causal relationship from neurons to behavior, arbitrary set of neurons can be lesioned [Yang et al., 2019], or inactivated for a short duration akin to optogenetic manipulation in physiological experiments. Similarly, connections between two selected groups of neurons can be lesioned to understand the causal contribution of cross-population interactions [Andalman et al., 2019]. In this section, we focus on methods that are particularly useful for analyzing ANNs. These methods include optimization-based tuning analysis [Erhan et al., 2009], fixed-point-based dynamical system analysis [Sussillo and Barak, 2013], quantitative comparisons between a model and experimental data [Yamins et al., 2014], and insights from the perspective of biological evolution [Lindsey et al., 2019, Richards et al., 2019]. Similarity comparison Analysis methods such as visualization, lesioning, tuning, fixed-point analysis can offer detailed intuition into neural mechanisms of individual networks. However, with the relative ease of training ANNs, it is possible to train a large amount of neural networks for the same task or dataset [Maheswaranathan et al., 2019, Yamins et al., 2014]. With such volume of data, it is necessary to take 16 advantage of high-throughput quantitative methods that compare different models at scale. Similarity comparison methods compute a scalar similarity score between the neural activity of two networks performing the same task [Kriegeskorte et al., 2008, Kornblith et al., 2019]. These methods are agnostic about the network form and size, and can be applied to artificial and biological networks alike. Consider two networks (or two populations of neurons), sized N and N respectively. 1 2 Their neural activity in response to the same D task conditions can be summarized by a D-by-N matrix R and a D-by-N matrix R (Figure 4A). Representational 1 1 2 2 similarity analysis (RSA) [Kriegeskorte et al., 2008] first computes the dissimilarity or distances of neural responses between different task conditions within each network, yielding a D-by-D dissimilarity matrix for each network (Figure 4B). Next, the correlation between dissimilarity matrices of the two networks is computed. A higher correlation corresponds to more similar representations. Another related line of methods uses linear regression (as used in [Yamins et al., 2014]) to predict R through a linear transformation of R , R  WR . The 2 1 2 1 similarity corresponds to the correlation between R and its predicted value WR . 2 1 Complex tuning analysis Studying tuning properties of single neurons has been one of the most important analysis techniques in neuroscience [Kuffler, 1953]. Classically, tuning properties are studied in sensory areas by showing stimuli param- eterized in a low dimensional space (e.g., oriented bars or gratings in vision [Hubel and Wiesel, 1959]). This method is most effective when the neurons studied have relatively simple response properties. A new class of methods treats the mapping of tuning as a high-dimensional optimization problem and directly searches for the stimulus that most strongly activates a neuron. Gradient-free methods such as genetic algorithms have been used to study complex tuning of biological neurons [Yamane et al., 2008]. In deep neural networks, gradient-based methods can be used [Erhan et al., 2009, Zeiler and Fergus, 2014]. For a neuron with activity r(x) given input x, a gradient-ascent optimization starts with a random x , and proceeds by updating the input x as @r x ! x + x; x =  : (31) @x This method can be used for searching the preferred input to any neuron or any population of neurons in a deep network [Erhan et al., 2009, Bashivan et al., 2019], see Figure 4C for an example. It is particularly useful for studying neurons in higher layers that have more complex tuning properties. The space of x may be too high dimensional (e.g., pixel space) for conducting an effective search, especially for gradient-free methods. In that case, we may utilize a lower dimensional space that is still highly expressive. A generative model learns a function that maps a lower-dimensional latent space to a high dimensional space such as pixel space [Kingma and Welling, 2013, Goodfellow et al., 2014]. Then the search can be conducted instead in the lower-dimensional latent space [Ponce et al., 2019]. ANNs can be used to build models for complex behavior that would not be easily done otherwise, opening up new possibilities such as studying encoding of more 17 Figure 4: Convolutional neural network responses and tuning. (A) The neural response to an image in a convolutional neural network trained to classify hand- written digits. The network consists of two layers of convolutional processing, followed by two fully-connected layers. (B) Dissimilarity matrices (each D-by- D) assessing the similar or dissimilar neural responses to different input images. Dissimilarity matrices are computed for neurons in layers 1 and 4 of the network. D = 50 Images are organized by class (0, 1, etc.), 5 images per class. Neural responses to images in the same class are more similar, i.e. neural representation more category-based, in layer 4 (right) than layer 1 (left). (C) Preferred image stimuli found through gradient-based optimization for sample neurons from each layer. Layers 1 and 2 are convolutional, therefore their neurons have localized preferred stimuli. In contrast, neurons from layers 3 and 4 have non-local preferred stimuli. 18 Figure 5: Analyzing tuning properties of a neural network trained to perform 20 cognitive tasks. In a network trained on multiple cognitive tasks, the tuning property of model units to individual task can be quantified. x-axis: recurrent units; y-axis: different tasks. Color measures the degree (between 0 and 1) to which each unit is engaged in a task. Twelve clusters are identified using a hierarchical clustering method (bottom, colored bars). For instance, cluster 3 is highly selective for pro- versus anti-response tasks (Anti) involving inhibitory control; clusters 10 and 11 are involved in delayed match-to-sample (DMS) and delayed non-match-to-sample (DNMS), respectively; cluster 12 is tuned to DMC. Figure adapted from Yang et al. [2019]. abstract form of information. For example, Yang et al. [2019] studied neural tuning of task structure, rather than stimuli, in rule-guided problem solving. An ANN was trained to perform many different cognitive tasks commonly used in animal experiments, including perceptual decision making, working memory, inhibitory control, and categorization. Complex network organization is formed by training, in which recurrent neurons display selectivity for a subset of tasks (Figure 5). Dynamical systems analysis Tuning properties provide a mostly static view of neural representation and computation. To understand how neural networks compute and process information in time, it is useful to study the dynamics of RNNs [Mante et al., 2013, Sussillo and Barak, 2013, Goudar and Buonomano, 2018, Chaisangmongkon et al., 2017]. One useful method to understand dynamics is to study fixed points and network dynamics around them [Strogatz, 2001]. In a generic dynamical system, dr = F (r) (32) dt a fixed point r is a steady state where the state does not change in time, F (r ) = ss ss 0. The network dynamics at a state r = r + r around a fixed point r is ss ss approximately linear, dr dr = F (r) = F (r + r)  F (r ) + J (r )r; = J (r )r: (33) ss ss ss ss dt dt 19 where J is the Jacobian of F , J = @F =@r , evaluated at r . This is a linear system ij i j ss which can be understood more easily, for example, by studying the eigenvectors and eigenvalues of J (r ). In ANNs, these fixed points can be found by gradient-based ss optimization [Sussillo and Barak, 2013], argmin jjF (r)jj : (34) Fixed points are particularly useful for understanding how networks store memories, accumulate information [Mante et al., 2013], and transition between discrete states [Chaisangmongkon et al., 2017]. This point can be illustrated in a network trained to perform a parametric working memory task [Romo et al., 1999]. In this task, a sample vibrotactile stimulus at frequency f is shown, followed by a delay period of a few seconds; then a test stimulus at frequency f is presented, and subjects must decide whether f is higher or lower than f (Figure 6A). During the delay, neurons 2 1 in the prefrontal cortex of behaving monkeys showed persistent activity at a rate that monotonically varies with f . This parametric working memory encoding emerges from training in an RNN (Figure 6B): in the state-space of this network, neural trajectories during the delay period converge to different fixed points depending on the stored value. These fixed points form an approximate line attractor [Seung, 1996] during the delay period (Figure 6C). There is a dearth of examples in computational neuroscience that accounts for not just a single aspect of neural representation or dynamics, but a sequence of computation to achieve a complex task. ANNs offer a new tool to confront this difficulty. Chaisangmongkon et al. [2017] used this approach to build a model for delayed match-to-category (DMC) tasks. A DMC task (Figure 6D,E) starts with a stimulus sample, say a visual moving pattern, of which a feature (motion direction as an analog quantify from 0 to 360 degrees) is classified into two categories (A in red, B in blue). After a mnemonic delay period, a test stimulus is shown and the task is to decide whether the test has the same category membership as the sample [Freedman and Assad, 2006]. After training to perform this task, a recurrent neural network shows diverse neural activity patterns similar to parietal neurons in monkeys doing the same task (Figure 6F). The trajectory of recurrent neural population in the state space reveals how computation is carried out through epochs of the task (Figure 6G). Understanding neural circuits from objectives, architecture, and training All above methods seek a mechanistic understanding of ANNs after training. A more integrative view links the three basic ingredients in deep learning: learning problem (tasks/objectives), network architecture, and training algorithm to the solution after training [Richards et al., 2019]. This approach is similar to an evolutionary or devel- opmental perspective in biology, which links environments to functions in biological organisms. It can help explain the computational benefit or necessity of observed structures or functions. For example, compared to purely feedforward networks, recurrently-connected deep networks are better at predicting responses of higher visual area neurons to behaviorally challenging images of cluttered scenes [Kar et al., 2019]. This suggests a contribution of recurrent connections to classifying difficult images in the brain. 20 Figure 6: Understanding network computation through state-space and dynam- ical system analysis. (A-C) In a simple parametric working memory task [Romo et al., 1999], the network needs to memorize the (frequency) value of a stimulus through a delay period (A). The network can achieve such parametric working mem- ory by developing a line attractor (B,C). (B) Trial-averaged neural activity during the delay period in the PCA space for different stimulus values. Triangles indicate the start of the delay period. (C) Fixed points found through optimization (orange cross). The direction of a line attractor can be estimated by finding the eigenvector with a corresponding eigenvalue close to 0. The orange line shows the line attractor estimated around one of the fixed points. (D-G) Training both recurrent neural networks and monkeys on a delayed-match-to-category task [Freedman and Assad, 2006]. (D) The task is to decide whether the test and sample stimuli (visual moving pattern) belong to the same category. (E) The two categories are defined based on the motion direction of the stimulus (red: category 1; blue: category 2). (F) In a ANN trained to perform this categorization task, the recurrent units of the model display a wide heterogeneity of onset time for category selectivity, similarly to single neurons recorded from monkey posterior parietal cortex (lateral intraparietal area, LIP) during the task. (G) Neural dynamics of a recurrent neural network underly- ing the performance of the DMC task. The final decision, match (AA or BB) or non-match (AB or BA) corresponds to distinct attractor states located at separate positions in the state space. Similar trajectories of population activity have been found in experimental data. Figure adapted from Chaisangmongkon et al. [2017]. 21 While re-running the biological processes of development and evolution may be difficult, re-training networks with different objectives, architectures, and algorithms is fairly straightforward thanks to recent advances in ML. Whenever training of an ANN leads to a conclusion, it is good practice to vary hyperparameters describing the basic ingredients (to a reasonable degree) to explore the necessary and sufficient conditions for the conclusion [Orhan and Ma, 2019, Yang et al., 2019, Lindsey et al., 2019]. The link from the three ingredients to the network solution is typically not rigorous. However, in certain simplified cases, the link can be firmly established by solving the training process analytically [Saxe et al., 2013, 2019b]. 5 Biologically realistic network architectures and learning Although neuroscientists and cognitive scientists have had much success with stan- dard neural network architectures (vanilla RNNs) and training algorithms (e.g., SGD) used in machine learning, for many neuroscience questions, it is critical to build network architectures and utilize learning algorithms that are biologically plausible. In this section, we outline methods to build networks with more biologically realistic structures, canonical computations, and plasticity rules. 5.1 Structured connections Modern neurophysiological experiments routinely record from multiple brain areas and/or multiple cell types during the same animal behavior. Computational efforts modeling these findings can be greatly facilitated by incorporating into neural networks fundamental biological structures, such as currently-known cell-type- specific connectivity and long-range connections across model areas/layers. In common recurrent networks, the default connectivity is all-to-all. In contrast, both local and long-range connectivity in biological neural systems are usually sparse. One way to have a sparse connectivity matrix W is by element-wise multiplying a f f trainable matrix W with a non-trainable sparse mask M , namely W = W M . To encourage sparsity without strictly imposing it, a L1 regularization term jW j ij ij can be added to the loss function. The scalar coefficient controls the strength of the sparsity constraint. To model cell-type-specific findings, it is important to build neural networks with multiple cell types. A vanilla recurrent network (Eq. 15) (or any other network) can be easily modified to obey Dale’s law by separating excitatory and inhibitory neurons [Song et al., 2016], dr E E I E = r + f (W r W r + W x + b ); (35) E EE EI Ex dt dr I E I I = r + f (W r W r + W x + b ); (36) I IE II Ix dt where an absolute function j  j constrains signs of the connection weights, e.g, W = jW j. After training an ANN to perform the classical “random dot” task EE EE 22 Figure 7: Training a network with Dale’s law. Connectivity matrix for a recurrent network trained on a perceptual decision making task. The network respects Dale’s law with separate groups of excitatory (blue) and inhibitory (red) neurons. Only connections between neurons with high stimulus selectivity are shown. Neurons are sorted based on their stimulus selectivity to choice 1 and 2. Recurrent excitatory connections between neurons selective to the same choice are indicated by two black squares. Figure inspired from Song et al. [2016]. of motion direction discrimination [Roitman and Shadlen, 2002], one can “open the black box” [Sussillo and Barak, 2013] and examine the resulting “wiring diagram” of recurrent network connectivity pattern (Figure 7). With the incorporation of the Dale’s law, the connectivity emerging from training is a heterogeneous version of a biologically-based structured network model of decision-making [Wang, 2002], demonstrating that machine learning brought closer to brain’s hardware can indeed be used to shed insights into biological neural networks. The extensive long-range connectivity across brain areas [Felleman and Van Essen, 1991, Markov et al., 2014, Oh et al., 2014] can be included in ANNs. In classical convolutional neural networks [LeCun et al., 1990, Krizhevsky et al., 2012], each layer only receives feedforward inputs from the immediate preceding layer. However, in some recent networks, each layer also receives feedforward inputs from much earlier layers [Huang et al., 2017, He et al., 2016]. In convolutional recurrent networks, neurons in each layer further receive feedback inputs from later layers and local recurrent connections [Nayebi et al., 2018, Kietzmann et al., 2019]. 5.2 Canonical computation Neuroscientists have identified several canonical computations that are carried out across a wide range of brain areas, including attention, normalization, and gating. Here we discuss how such canonical computations can be introduced into neural networks. They function as modular architectural components that can be plugged into many networks. Interestingly, canonical computations mentioned above all have 23 their parallels in ML-based neural networks. We will highlight the differences and similarities between purely ML implementations and more biological ones. Normalization Divisive normalization is widely observed in biological neural systems [Carandini and Heeger, 2012]. In divisive normalization, activation of a neuron r is no longer determined by its immediate input I , r = f (I ). Instead, it i i i i is normalized by the sum of inputs I to a broader pool of neurons called the normalization pool, r = f ( ): (37) I + The specific choice of a normalization pool depends on the system studied. Bio- logically, although synaptic inputs are additive in the drive to neurons, feedback inhibition can effectively produce normalization [Ardid et al., 2007]. This form of divisive normalization is differentiable. So it can be directly incorporated into ANNs. Normalization is also a critical part of many neural networks in machine learning. Similar to divisive normalization, ML-based normalization methods [Ioffe and Szegedy, 2015, Ba et al., 2016b, Ulyanov et al., 2016, Wu and He, 2018] aim at putting neuronal responses into a range appropriate for downstream areas to process. Unlike divisive normalization, the mean inputs to a pool of neurons is usually subtracted from, instead of dividing, the immediate input (Eq. 22). These methods also compute the standard deviation of inputs to the normalization pool, a step that may not be biologically plausible. Different ML-based normalization methods are distinguished based on their choice of a normalization pool. Attention Attention has been extensively studied in neuroscience [Desimone and Duncan, 1995, Carrasco, 2011]. Computational models are able to capture various aspects of bottom-up [Koch and Ullman, 1987] and top-down attention [Reynolds and Heeger, 2009]. In computational models, top-down attention usually takes the form of a multiplicative gain field to the activity of a specific group of neurons. In the case of spatial attention, consider a group of neurons, each with a preferred spatial location x , and pre-attention activity re(x ) for a certain stimulus. The attended i i spatial location x results in attentional weights (x ), which is higher if x is q i q q similar to x . The attentional weights can then be used to modulate the neural response of neuron i, r (x ) = (x )re(x ). Similarly, feature attention strengthens i q i q i the activity of neurons that are selective to the attended features (e.g., specific color). Such top-down spatial and feature attention can be included in convolutional neural networks [Lindsay and Miller, 2018, Yang et al., 2018]. Meanwhile, attention has become widely used in machine learning [Bahdanau et al., 2015, Xu et al., 2015, Lindsay, 2020], constituting a standard component in recent natural language processing models [Vaswani et al., 2017]. Although the ML attention mechanisms appear rather different from attention models in neuroscience, as we will show below, the two mechanisms are very closely related. In deep learning, attention can be viewed as a differentiable dictionary retrieval pro- cess. A regular dictionary stores a number of key-value pairs (e.g. word-explanation 24 (i) (i) (i) (i) pairs)f(k ;v )g, similar to looking up explanation (v ) of a word (k ). For a (j) given query q, using a dictionary involves searching for the key k that matches (j) (j) q, k = q, and retrieving the corresponding value y = v . This process can (i) be thought of as modulating each value v based on an attentional weight that (i) measures the similarity between the key k and the query q. In the simple binary case, (i) 1; if k = q = (38) 0; otherwise which modulated the output as (i) y = v : (39) In the above case of spatial attention, the i-th key-value pair is (x ; re(x )), while the i i query is the attended spatial location x . Each neuron’s response is modulated based on how similar its preferred spatial location (its value) x is to the attended location (the query) x . The use of ML attention makes the query-key comparison and the value-retrieval (i) process differentiable. A query is compared with every key vector k to obtain an attentional weight (normalized similarity score) , (i) c = score(q;k ); (40) ; ; = normalize(c ; ; c ); (41) 1 N 1 N (i) Here the similarity scoring function can be a simple inner product, score(q;k ) = | (i) q k [Bahdanau et al., 2015], and the normalization function can be the softmax function, i X = P ; such that = 1: (42) i i The use of a normalization function is critical, as it effectively forces the network to focus on a few key vectors (a few attended locations in the case of spatial attention). Gating An important computation for biological neural systems is gating [Abbott, 2006, Wang and Yang, 2018]. Gating refers to the idea of controlling information flow without necessarily distorting its content. Gating in biological systems can be implemented with various mechanisms. Attention modulation multiplies inputs to neurons by a gain factor, providing a graded mechanism of gating at the level of sensory systems [Salinas and Thier, 2000, Olsen et al., 2012]. Another form of gating may involve several types of inhibitory neurons [Wang et al., 2004, Yang et al., 2016]. At the behavioral level, gating often appears to be all or none, as exemplified by effects such as inattentional blindness. In deep learning, multiplicative gating is essential for popular recurrent network architectures such as LSTM (Long Short-Term-Memory) networks (Eq. 43) [Hochre- iter and Schmidhuber, 1997, Gers Felix et al., 2000] and GRU (Gated Recurrent Units) networks [Cho et al., 2014, Chung et al., 2014]. Gated networks are generally 25 easier to train and more powerful than vanilla RNNs. Gating variables dynamically control information flow within these networks through multiplicative interactions. In a LSTM network, there are three types of gating variables. Input and output gates, i o g and g , control the inputs to and outputs of the cell state c , while forget gate g t t t controls whether cell state c keeps its memory c . t t1 g =  (W x + U r + b ); g f t f t1 f g =  (W x + U r + b ); g i t i t1 i g =  (W x + U r + b ); (43) g o t o t1 o c = g c + g  (W x + U r + b ); t t1 c c t c t1 c t t r = g  (c ): t r t Here the symbol denotes the element-wise (Hadamard) product of two vectors of the same length (z = x y means z = x y ). Gating variables are bounded i i i between 0 and 1 by the sigmoid function  , which can be viewed as a smooth differentiable approximate of a binary step function. A gate is opened or closed when its corresponding gate value is near 1 or 0 respectively. All the weights (W and U matrices) are trained. By introducing these gates, a LSTM can in principle keep a memory in its cell state c indefinitely by having the forget gate g = 1 and input gate g = 0 (Figure 8). In addition, the network can choose when to read out from the memory by setting its output gate g = 0 or 1. Despite their great utility to machine learning, LSTMs (and GRUs) cannot be easily related to biological neural circuits. Modifications to LSTMs have been suggested so the gating process could be better explained by neurobiology [Costa et al., 2017]. Although both attention and gating utilize multiplicative interactions, a critical difference is that in attention, the neural modulation is normalized (Eq. 41), whereas in gating it is not. Therefore, neural attention often has one focus, while neural gating can open or close gates to all neurons uniformly. An important insight from ML is that gating should be plastic, which should inspire neuroscientists to investigate learning to gate in the brain. Predictive coding Another canonical computation proposed for the brain is to compute predictions [Rao and Ballard, 1999, Bastos et al., 2012, Heilbron and Chait, 2018]. In predictive coding, a neural system constantly tries to make inference about the external world. Brain areas will selectively propagate information that is unpredicted or surprising, while suppressing responses to expected stimuli. To implement predictive coding in ANNs, feedback connections from higher layers can be trained with a separate loss that compares the output of feedback connections with the neural activity in lower layers [Lotter et al., 2016, Sacramento et al., 2018]. In this way, feedback connections will learn to predict the activity of lower areas. The feedback inputs will then be used to inhibit neural activity in lower layers. 5.3 Learning and plasticity Biological neural systems are products of evolution, development, and learning. In contrast, traditional ANNs are trained with SGD-based rules mostly from scratch. 26 Figure 8: Visualizing LSTM activity in a simple memory task. (A-C) A simple memory task. (A) The network receives a stream of input stimulus, the value of which is randomly and independently sampled at each time point. (B) When the “memorize input” (red) is active, the network needs to remember the current value of the stimulus (A), and output that value when the “report input” (blue) is next active. (C) After training, a single-unit LSTM can perform the task almost perfectly for modest memory duration. (D) When the memorize input is active, this network opens the input gate (allowing inputs) and closes the forget gate (forgetting previous memory). It opens the output gate when the report input is active. The back-propagation algorithm of computing gradient descent is well known to be biologically implausible [Zipser and Andersen, 1988]. Incorporating more realistic learning processes can help us build better models of brains. Selective training and continual learning In typical ANNs, all connections are trained. However, in biological neural systems, synapses are not equally modifiable. Many synapses can be stable for years [Grutzendler et al., 2002, Yang et al., 2009]. To implement selective training of connections, the effective connection matrix W can be expressed as a sum of a sparse trainable synaptic weight matrix and a non-trainable one, W = W + W [Rajan et al., 2016, Masse et al., 2018]. train x Or more generally, selective training can be imposed softly by adding to the loss a regularization term L that makes it more difficult to change the weights of certain reg connections, L = M (W W ) : (44) reg ij ij x;ij ij Here, M determine how strongly the connection W should stick close to the value ij ij W . x;ij Selective training of connections through this form of soft constraints has been used by continual learning techniques to combat catastrophic forgetting. The phenomenon of catastrophic forgetting is commonly observed when ANNs are learning new tasks, 27 they tend to rapidly forget previous learned tasks that are not revisited [McCloskey and Cohen, 1989]. One major class of continual learning methods deals with this issue by selectively training synaptic connections that are deemed unimportant for previously learned tasks or knowledge, while protecting the important ones [Kirkpatrick et al., 2017, Zenke et al., 2017]. Hebbian plasticity The predominant idea for biological learning is Hebbian plas- ticity [Hebb, 2005] and its variants [Song et al., 2000, Bi and Poo, 2001]. Hebbian plasticity is an unsupervised learning method that drives learning of connection weights without target outputs or rewards. It is essential for classical models of associative memory such as Hopfield networks [Hopfield, 1982], and has a deep link to modern neural network architectures with explicit long-term memory modules [Graves et al., 2014]. Supervised learning techniques, especially those based on SGD, can be combined with Hebbian plasticity to develop ANNs that are both more powerful for certain tasks and more biologically realistic. There are two methods to combine Hebbian plasticity with SGD. In the first kind, the effective connection matrix W = W + A is the sum of two connection matrices, W trained by SGD, and A driven by Hebbian plasticity [Ba et al., 2016a, Miconi et al., 2018], A(t + 1) = A(t) + rr : (45) Or in component-form, A (t + 1) = A (t) + r r : (46) ij ij i j In addition to training a separate matrix, SGD can be used to learn the plasticity rules itself [Bengio et al., 1992, Metz et al., 2018]. Here, the plasticity rule is a trainable function of pre- and post-synaptic activity, A (t + 1) = A (t) + f (r ; r ;): (47) ij ij i j Since the system is differentiable, parameters , which collectively describe the plasticity rules, can be updated with SGD-based methods. In its simplest form, f (r ; r ;) = r r , where  = fg. Here, the system can learn to become Hebbian i j i j ( > 0) or anti-Hebbian ( < 0). Learning of a plasticity rule is a form of meta- learning, using an algorithm (here, SGD) to optimize an inner learning rule (here, Hebbian plasticity). Such Hebbian plasticity networks can be extended to include more complex synapses with multiple hidden variables in a “cascade model" of synaptic plasticity [Fusi et al., 2005]. In theory, properly designed complex synapses can substantially boost a neural network’s memory capacity [Benna and Fusi, 2016]. Models of such complex synapses are differentiable, and therefore can be incorporated into ANNs [Kaplanis et al., 2018]. Short-term plasticity In addition to Hebbian plasticity that acts on the time scales from hours to years, biological synapses are subject to short-term plasticity mecha- nisms operating on the timescale of hundreds of milliseconds to seconds [Zucker and Regehr, 2002] that can rapidly modify their effective weights. Classical short-term 28 plasticity rules [Mongillo et al., 2008, Markram et al., 1998] are formulated with spiking neurons, but they can be adapted to rate forms. In these rules, each connec- tion weight w = weux is a product of an original weight we, a facilitating factor u, and a depressing factor x. The facilitating and depressing factors are both influenced by the pre-synaptic activity r(t), dx 1 x(t) = u(t)x(t)r(t); (48) dt du U u(t) = + U (1 u(t))r(t): (49) dt High pre-synaptic activity r(t) increases the facilitating factor u(t) and decreases the depressing factor x(t). Again, the equations governing short-term plasticity are fully differentiable, so they can be incorporated into ANNs in the same way as Hebbian plasticity rules [Masse et al., 2019]. Masse et al. [2019] offers an illustration of how ANNs can be used to test new hypotheses in neuroscience. It was designed to investigate the neural mechanisms of working memory, the brain’s ability to maintain and manipulate information inter- nally in the absence of external stimulation. Working memory has been extensively studied in animal experiments using delayed response tasks, in which a stimulus and its corresponding motor response are separated by a temporal gap when the stimulus must be retained internally. Stimulus-selective self-sustained persistent activity during a mnemonic delay is amply documented and considered as the neural substrate of working memory representation [Goldman-Rakic, 1995, Wang, 2001]. However, recent studies suggested that certain short-term memory traces may be realized by hidden variables instead of spiking activity, such as synaptic efficacy that by virtue of short-term plasticity represents past events [Stokes, 2015, Mongillo et al., 2008]. When an ANN endowed with short-term synaptic plasticity is trained to perform a delayed response task, it does not make an a priori assumption about whether working memory is represented by hidden synaptic efficacy or neural ac- tivity. It was found that activity-silent state can accomplish such a task only when the delay is sufficiently short, whereas persistent activity naturally emerges from training with delay periods longer than the biophysical time constants of short-term synaptic plasticity. More importantly, training always gives rise to persistent activity, even with a short mnemonic delay period, when information must be manipulated internally, such as mentally rotating a directional stimulus by 90 degrees. This work illustrates how ANNs can contribute to resolving important debates in neuroscience. Biologically-realistic gradient descent Backpropagation is commonly viewed as biologically unrealistic because the plasticity rule is not local (see Eq. 13). Efforts have been devoted to approximating gradient descent with algorithms more compatible with the brain’s hardware [Lillicrap et al., 2016, Guerguiev et al., 2017, Roelfsema and Holtmaat, 2018, Lillicrap et al., 2020]. In feedforward networks, the backpropagation algorithm can be implemented with synaptic connections feeding back from the final layer [Xie and Seung, 2003]. This implementation assumes that the feedback connections precisely mirror the feedforward connections. This requirement can be relaxed. If a network uses 29 fixed and random feedback connections, the feedforward connections would start to approximately mirror the feedback connections during training (a phenomenon called “feedback alignment”), allowing for training loss to be decreased [Lillicrap et al., 2016]. Another challenge of approximating backpropagation with feedback connections is that the feedback inputs carrying loss information need to be processed differently from feedforward inputs carrying stimulus information. This issue can be addressed by introducing multi-compartmental neurons into ANNs [Guerguiev et al., 2017]. In such networks, feedforward and feedback inputs are processed separately because they are received by the model neurons’ soma and dendrites respectively. These methods of implementing the backpropagation algorithm through synapses propagating information backwards are so far only used for feedforward networks. For recurrent networks, the backpropagation algorithm propagates information backwards in time. Therefore, it is not clear how to interpret the backpropagation in terms of synaptic connections. Instead, approximations can be made such that the network computes approximated gradient information as it runs forward in time [Williams and Zipser, 1989, Murray, 2019]. For many neuroscientific applications, it is probably not necessary to justify back- propagation by neurobiology. ANNs often start as “blank slate", thus training by backpropagation is tasked to accomplish what for the brain amounts to a combination of genetic programming, development and plasticity in adulthood. 6 Future directions and conclusion Recent years have seen a growing impact of ANN models in neuroscience. We have reviewed many of these efforts in the section Biologically realistic network architectures and learning. In this final section, we outline other existing challenges and ongoing work to make ANNs better models of brains. Spiking neural networks Most biological neurons communicate with spikes. Harnessing the power of machine learning algorithms for spiking networks remains a daunting challenge. Gradient-descent-based training techniques typically require the system to be differentiable, making it challenging to train spiking networks, because spike generation is non-differentiable. However, several recent methods have been proposed to train spiking networks with gradient-based techniques [Courbariaux et al., 2016, Bellec et al., 2018, Zenke and Ganguli, 2018, Nicola and Clopath, 2017, Huh and Sejnowski, 2018]. These methods generally involve approximating spike generation with a differentiable system during backpropagation [Tavanaei et al., 2019]. Techniques to effectively train spiking networks could prove increasingly important and practical, as neuromorphic hardware that operate naturally with spikes become more powerful [Merolla et al., 2014, Pei et al., 2019]. Standardized protocols for developing brain-like recurrent networks In the study of mammalian visual systems, the use of large datasets such as ImageNet [Deng et al., 2009] was crucial for producing neural networks that resemble biological neural circuits in the brain. The same has not been shown for most other systems. Although many studies have shown success using neural networks to model cognitive 30 and motor systems, each work usually has its own set of network architectures, training protocols, and other hyperparameters. Simply applying the most common architectures and training algorithms does not consistently lead to brain-like recurrent networks [Sussillo et al., 2015]. Much work remains to be done to search for datasets/tasks, network architectures, and training regimes that can produce brain- resembling artificial networks across a wide range of experimental tasks. Detailed behavioral and physiological predictions Although many studies have reported similarities between brains and ANNs, more detailed comparisons have revealed striking differences [Szegedy et al., 2013, Hénaff et al., 2019, Sussillo et al., 2015]. Deep convolutional networks can achieve similar or better performance on large image classification tasks compared to humans, however, the mistakes they make can be very different from the ones made by humans [Szegedy et al., 2013, Rajalingham et al., 2018]. It will be important for future ANN models of brains to aim at simultaneously explaining a wider range of physiological and behavioral phenomena. Interpreting learned networks and learning processes With the ease of training neural networks comes the difficulty of analyzing them. Granted, neuroscientists are not foreign to analysis of complex networks, and ANNs are still technologically easier to analyze compared to biological neural networks. However, compared to network models with built-in regularities and small numbers of free parameters, deep neural networks are notoriously complex to analyze and understand, and will likely become even more so as we build more and more sophisticated neural networks. This difficulty is rooted in the use of optimization algorithms to search for parameter values. Since the optimization process in deep learning has no unique optima, the results of optimization necessarily lack the degree of regularities built in hand-designed models. Although we can attempt to understand ANNs from the perspective of its objectives, architectures, and training algorithms [Richards et al., 2019], which are described with a much smaller number of hyperparameters, the link from these hyperparameters to network representation, mechanism, and behavior is mostly informal and based on intuition. Despite the difficulties mentioned above, several lines of research hold promise. To facilitate understanding of learned networks, one can construct variants of neu- ral networks that are more interpretable. For example, low-rank recurrent neural networks utilize recurrent connectivity matrices with low-dimensional structures [Mastrogiuseppe and Ostojic, 2018], allowing for a more straightforward mapping from network connectivity to dynamics and computation. The dynamics of learning in neural networks can be studied analytically in deep linear networks [Saxe et al., 2013] and very wide nonlinear networks, i.e. networks with a sufficiently large number of neurons per layer [Jacot et al., 2018]. In another line of work, the Information Bottleneck theory proposes that learning processes in neural networks are characterized by two phases, the first extracts information for output tasks (prediction), and the second discards (excessive) information about inputs (compression) [Shwartz-Ziv and Tishby, 2017], see also [Saxe et al., 2019a]. Progress in these directions could shed light on why neural networks can generalize 31 to new data despite having many parameters, which would traditionally indicate over-fitting and poor generalization performance. Conclusion Artificial neural networks present a novel approach in computational neuroscience. They have already been used, with certain degree of success, to model various aspects of sensory, cognitive, and motor circuits. Efforts are underway to make ANNs more biologically relevant and applicable to a wider range of neuro- scientific questions. In a sense, instead of being viewed as computational models, ANNs can be studied as model systems like fruit flies, mice, and monkeys, but are easily carried out to explore new task paradigms and computational ideas. Of course, one can be skeptical about ANNs as model systems, on the ground that they are not biological organisms. However, computational models span a wide range of biological realism; there should be no doubt that brain research will benefit from enhanced interactions with machine learning and artificial intelligence. In order for ANNs to have a broad impact in neuroscience, it will be important to devote our efforts in two areas. First, we should continue to bring ANNs closer to neurobiology. Second, we should endeavour to “open the black box” thoroughly after learning to identify neural representation, temporal dynamics, and network connectivity that emerge from learning, leading to testable insights and predictions by neurobiological experiments. Recurrent neural dynamics emphasized in this Primer represent a salient feature of the brain, further development of strongly recurrent ANNs will contribute to acceleration of progress in neuroscience. Acknowledgments: We thank Vishwa Goudar and Jacob Portes for helpful com- ments on a draft of this paper. This work was supported by the Simons Foundation, NSF NeuroNex Award DBI-1707398 and the Gatsby Charitable Foundation to GRY; the ONR grant N00014 and Simons Collaboration in the Global Brain (SCGB) (grant 543057SPI) to XJW. 32 References M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th fUSENIXg Symposium on Operating Systems Design and Implementation (fOSDIg 16), pages 265–283, 2016. L. Abbott. Where are the switches on this thing. 23 problems in systems neuroscience, pages 423–31, 2006. L. Abbott and F. S. Chance. Drivers and modulators from push-pull and balanced synaptic input. Progress in brain research, 149:147–155, 2005. L. F. Abbott. Theoretical neuroscience rising. Neuron, 60:489–495, 2008. A. S. Andalman, V. M. Burns, M. Lovett-Barron, M. Broxton, B. Poole, S. J. Yang, L. Grosenick, T. N. Lerner, R. Chen, T. Benster, et al. Neuronal dynamics regulating brain and behavioral state transitions. Cell, 177(4):970–985, 2019. S. Ardid, X.-J. Wang, and A. Compte. An integrated microcircuit model of attentional processing in the neocortex. J. Neurosci., 27:8486–8495, 2007. J. Ba, G. E. Hinton, V. Mnih, J. Z. Leibo, and C. Ionescu. Using fast weights to attend to the recent past. In Advances in Neural Information Processing Systems, pages 4331–4339, 2016a. J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016b. D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations, O. Barak. Recurrent neural networks as versatile tools of neuroscience research. Current opinion in neurobiology, 46:1–6, 2017. O. Barak, D. Sussillo, R. Romo, M. Tsodyks, and L. Abbott. From fixed points to chaos: three models of delayed discrimination. Progress in neurobiology, 103: 214–222, 2013. H. B. Barlow et al. Possible principles underlying the transformation of sensory messages. Sensory communication, 1:217–234, 1961. P. Bashivan, K. Kar, and J. J. DiCarlo. Neural population control via deep image synthesis. Science, 364(6439):eaav9436, 2019. A. M. Bastos, W. M. Usrey, R. A. Adams, G. R. Mangun, P. Fries, and K. J. Friston. Canonical microcircuits for predictive coding. Neuron, 76:695–711, 2012. G. Bellec, D. Salaj, A. Subramoney, R. Legenstein, and W. Maass. Long short-term memory and learning-to-learn in networks of spiking neurons. In Advances in Neural Information Processing Systems, pages 787–797, 2018. 33 S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, volume 2. Univ. of Texas, 1992. Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, M. K. Benna and S. Fusi. Computational principles of synaptic memory consolida- tion. Nature neuroscience, 19(12):1697, 2016. G. Bi and M. Poo. Synaptic modification by correlated activity: Hebb’s postulate revisited. Annu Rev Neurosci, 24:139–166, 2001. L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. Siam Review, 60(2):223–311, 2018. M. Botvinick, J. X. Wang, W. Dabney, K. J. Miller, and Z. Kurth-Nelson. Deep reinforcement learning and its neuroscientific implications. Neuron, 107:603–616, K. H. Britten, M. N. Shadlen, W. T. Newsome, and J. A. Movshon. The analysis of visual motion: a comparison of neuronal and psychophysical performance. Journal of Neuroscience, 12(12):4745–4765, 1992. C. F. Cadieu, H. Hong, D. L. Yamins, N. Pinto, D. Ardila, E. A. Solomon, N. J. Majaj, and J. J. DiCarlo. Deep neural networks rival the representation of primate it cortex for core visual object recognition. PLoS computational biology, 10(12): e1003963, 2014. M. Carandini and D. J. Heeger. Normalization as a canonical neural computation. Nature Reviews Neuroscience, 13(1):51, 2012. M. Carrasco. Visual attention: The past 25 years. Vision research, 51(13):1484–1525, W. Chaisangmongkon, S. K. Swaminathan, D. J. Freedman, and X.-J. Wang. Com- puting by robust transience: how the fronto-parietal network performs sequential, category-based decisions. Neuron, 93(6):1504–1517, 2017. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. In Advances in neural information processing systems, pages 6571–6583, 2018. K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 34 D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015. J. D. Cohen, K. Dunbar, and J. L. McClelland. On the control of automatic processes: a parallel distributed processing account of the stroop effect. Psychological review, 97(3):332, 1990. R. Costa, I. A. Assael, B. Shillingford, N. de Freitas, and T. Vogels. Cortical micro- circuits as gated-recurrent neural networks. In Advances in Neural Information Processing Systems, pages 272–283, 2017. M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016. C. J. Cueva and X.-X. Wei. Emergence of grid-like representations by train- ing recurrent neural networks to perform spatial localization. arXiv preprint arXiv:1803.07770, 2018. N. D. Daw, S. J. Gershman, B. Seymour, P. Dayan, and R. J. Dolan. Model-based influences on humans’ choices and striatal prediction errors. Neuron, 69(6): 1204–1215, 2011. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. R. Desimone and J. Duncan. Neural mechanisms of selective visual attention. Annual review of neuroscience, 18(1):193–222, 1995. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805, J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(Jul):2121– 2159, 2011. C. Eliasmith, T. C. Stewart, X. Choo, T. Bekolay, T. DeWolf, Y. Tang, and D. Ras- mussen. A large-scale model of the functioning brain. science, 338(6111): 1202–1205, 2012. J. L. Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990. D. Erhan, Y. Bengio, A. Courville, and P. Vincent. Visualizing higher-layer features of a deep network. University of Montreal, 1341(3):1, 2009. D. J. Felleman and D. C. Van Essen. Distributed hierarchical processing in the primate cerebral cortex. Cerebral cortex (New York, NY: 1991), 1(1):1–47, 1991. D. J. Freedman and J. A. Assad. Experience-dependent representation of visual categories in parietal cortex. Nature, 443(7107):85, 2006. 35 J. Freeman and E. P. Simoncelli. Metamers of the ventral stream. Nature neuro- science, 14(9):1195, 2011. K. Fukushima and S. Miyake. Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern recognition, 15:455–469, K. Fukushima, S. Miyake, and T. Ito. Neocognitron: A neural network model for a mechanism of visual pattern recognition. IEEE transactions on systems, man, and cybernetics, (5):826–834, 1983. S. Fusi, P. J. Drew, and L. F. Abbott. Cascade models of synaptically stored memories. Neuron, 45(4):599–611, 2005. A. Gers Felix, S. Jurgen, and F. Cummins. Learning to forget: Continual prediction with lstm. Neural computation, 12(10):2451–2471, 2000. X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 315–323, 2011. J. I. Gold and M. N. Shadlen. The neural basis of decision making. Annual review of neuroscience, 30, 2007. P. S. Goldman-Rakic. Cellular basis of working memory. Neuron, 14:477–485, I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014. I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016. V. Goudar and D. V. Buonomano. Encoding sensory and motor patterns as time- invariant trajectories in recurrent neural networks. Elife, 7:e31134, 2018. A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014. J. Grutzendler, N. Kasthuri, and W.-B. Gan. Long-term dendritic spine stability in the adult cortex. Nature, 420(6917):812–816, 2002. J. Guerguiev, T. P. Lillicrap, and B. A. Richards. Towards deep learning with segregated dendrites. ELife, 6:e22901, 2017. K. Haroush and Z. M. Williams. Neuronal prediction of opponent’s behavior during cooperative social interchange in primates. Cell, 160(6):1233–1245, 2015. D. Hassabis, D. Kumaran, C. Summerfield, and M. Botvinick. Neuroscience-inspired artificial intelligence. Neuron, 95:245–258, 2017. K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015. 36 K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. D. O. Hebb. The organization of behavior: A neuropsychological theory. Psychology Press, 2005. M. Heilbron and M. Chait. Great expectations: is there evidence for predictive coding in auditory cortex? Neuroscience, 389:54–73, 2018. M. Helmstaedter, K. L. Briggman, S. C. Turaga, V. Jain, H. S. Seung, and W. Denk. Connectomic reconstruction of the inner plexiform layer in the mouse retina. Nature, 500(7461):168, 2013. O. J. Hénaff, R. L. Goris, and E. P. Simoncelli. Perceptual straightening of natural videos. Nature neuroscience, 22(6):984–991, 2019. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9 (8):1735–1780, 1997. J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8): 2554–2558, 1982. K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017. D. H. Hubel and T. N. Wiesel. Receptive fields of single neurones in the cat’s striate cortex. The Journal of physiology, 148(3):574–591, 1959. D. H. Hubel and T. N. Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. (Lond.), 160:106–154, 1962. D. Huh and T. J. Sejnowski. Gradient descent for spiking neural networks. In Advances in Neural Information Processing Systems, pages 1433–1443, 2018. S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018. H. Jaeger and H. Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. science, 304(5667):78–80, 2004. 37 M. Januszewski, J. Kornfeld, P. H. Li, A. Pope, T. Blakely, L. Lindsey, J. Maitin- Shepard, M. Tyka, W. Denk, and V. Jain. High-precision automated reconstruction of neurons with flood-filling networks. Nature methods, 15(8):605, 2018. J. P. Jones and L. A. Palmer. An evaluation of the two-dimensional gabor filter model of simple receptive fields in cat striate cortex. Journal of neurophysiology, 58(6):1233–1258, 1987. N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 1–12, 2017. C. Kaplanis, M. Shanahan, and C. Clopath. Continual reinforcement learning with complex synapses. arXiv preprint arXiv:1802.07239, 2018. K. Kar, J. Kubilius, K. Schmidt, E. B. Issa, and J. J. DiCarlo. Evidence that recurrent circuits are critical to the ventral stream’s execution of core object recognition behavior. Nature neuroscience, page 1, 2019. S.-M. Khaligh-Razavi and N. Kriegeskorte. Deep supervised, but not unsupervised, models may explain it cortical representation. PLoS computational biology, 10 (11):e1003915, 2014. R. Kiani and M. N. Shadlen. Representation of confidence associated with a decision by neurons in the parietal cortex. science, 324(5928):759–764, 2009. T. C. Kietzmann, C. J. Spoerer, L. K. Sörensen, R. M. Cichy, O. Hauk, and N. Kriegeskorte. Recurrence is required to capture the representational dynamics of the human visual system. Proceedings of the National Academy of Sciences, 116(43):21854–21863, 2019. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017. D. Kobak, W. Brendel, C. Constantinidis, C. E. Feierstein, A. Kepecs, Z. F. Mainen, X.-L. Qi, R. Romo, N. Uchida, and C. K. Machens. Demixed principal component analysis of neural population data. Elife, 5:e10989, 2016. C. Koch and S. Ullman. Shifts in selective visual attention: towards the underlying neural circuitry. In Matters of intelligence, pages 115–141. Springer, 1987. S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of neural network representations revisited. arXiv preprint arXiv:1905.00414, 2019. 38 N. Kriegeskorte. Deep neural networks: a new framework for modeling biological vision and brain information processing. Annual review of vision science, 1: 417–446, 2015. N. Kriegeskorte, M. Mur, and P. A. Bandettini. Representational similarity analysis- connecting the branches of systems neuroscience. Frontiers in systems neuro- science, 2:4, 2008. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. A. Krogh and J. A. Hertz. A simple weight decay can improve generalization. In Advances in neural information processing systems, pages 950–957, 1992. S. W. Kuffler. Discharge patterns and functional organization of mammalian retina. Journal of neurophysiology, 16(1):37–68, 1953. R. Laje and D. V. Buonomano. Robust timing and motor patterns by taming chaos in recurrent neural networks. Nature neuroscience, 16(7):925–933, 2013. Q. V. Le, N. Jaitly, and G. E. Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015. Y. LeCun. A theoretical framework for back-propagation. In D. Touretzky, G. Hinton, and T. Sejnowski, editors, Proceedings of the 1988 Connectionist Models Summer School, pages 21–28. Burlington, MA: Morgan Kaufmann, 1988. Y. LeCun and Y. Bengio. Convolutional networks for images, speech, and time series. In M. A. Arbib, editor, The handbook of brain theory and neural networks, pages 255–258. Cambridge, MA: MIT Press, 1995. Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, pages 396–404, 1990. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436, 2015. T. P. Lillicrap, D. Cownden, D. B. Tweed, and C. J. Akerman. Random synap- tic feedback weights support error backpropagation for deep learning. Nature communications, 7:13276, 2016. T. P. Lillicrap, A. Santoro, L. Marris, C. J. Akerman, and G. Hinton. Backpropagation and the brain. Nature Reviews Neuroscience, pages 1–12, 2020. G. W. Lindsay. Attention in psychology, neuroscience, and machine learning. Frontiers in Computational Neuroscience, 14:29, 2020. G. W. Lindsay and K. D. Miller. How biological attention mechanisms improve task performance in a large-scale visual system model. eLife, 7:e38105, 2018. 39 J. Lindsey, S. A. Ocko, S. Ganguli, and S. Deny. A unified theory of early visual representations from retina to cortex through anatomically constrained deep cnns. arXiv preprint arXiv:1901.00945, 2019. W. Lotter, G. Kreiman, and D. Cox. Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104, 2016. N. Maheswaranathan, A. H. Williams, M. D. Golub, S. Ganguli, and D. Sussillo. Universality and individuality in neural dynamics across large populations of recurrent networks. arXiv preprint arXiv:1907.08549, 2019. V. Mante, D. Sussillo, K. V. Shenoy, and W. T. Newsome. Context-dependent computation by recurrent dynamics in prefrontal cortex. nature, 503(7474):78, N. T. Markov, M. M. Ercsey-Ravasz, A. R. Ribeiro Gomes, C. Lamy, L. Magrou, J. Vezoli, P. Misery, A. Falchier, R. Quilodran, M. A. Gariel, J. Sallet, R. Gamanut, C. Huissoud, S. Clavagnier, P. Giroud, D. Sappey-Marinier, P. Barone, C. Dehay, Z. Toroczkai, K. Knoblauch, D. C. Van Essen, and H. Kennedy. A weighted and directed interareal connectivity matrix for macaque cerebral cortex. Cereb. Cortex, 24:17–36, 2014. H. Markram, Y. Wang, and M. Tsodyks. Differential signaling via the same axon of neocortical pyramidal neurons. Proceedings of the National Academy of Sciences, 95(9):5323–5328, 1998. N. Y. Masse, G. D. Grant, and D. J. Freedman. Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization. Proceedings of the National Academy of Sciences, 115(44):E10467–E10475, 2018. N. Y. Masse, G. R. Yang, H. F. Song, X.-J. Wang, and D. J. Freedman. Circuit mechanisms for the maintenance and manipulation of information in working memory. Nature neuroscience, page 1, 2019. F. Mastrogiuseppe and S. Ostojic. Linking connectivity, dynamics, and computations in low-rank recurrent neural networks. Neuron, 99(3):609–623, 2018. A. Mathis, P. Mamidanna, K. M. Cury, T. Abe, V. N. Murthy, M. W. Mathis, and M. Bethge. Deeplabcut: markerless pose estimation of user-defined body parts with deep learning. Technical report, Nature Publishing Group, 2018. M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989. L. McIntosh, N. Maheswaranathan, A. Nayebi, S. Ganguli, and S. Baccus. Deep learning models of the retinal response to natural scenes. In Advances in neural information processing systems, pages 1369–1377, 2016. P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, et al. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science, 345(6197):668–673, 2014. 40 L. Metz, N. Maheswaranathan, B. Cheung, and J. Sohl-Dickstein. Meta- learning update rules for unsupervised representation learning. arXiv preprint arXiv:1804.00222, 2018. T. Miconi, J. Clune, and K. O. Stanley. Differentiable plasticity: training plastic neural networks with backpropagation. arXiv preprint arXiv:1804.02464, 2018. V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. G. Mongillo, O. Barak, and M. Tsodyks. Synaptic theory of working memory. Science, 319(5869):1543–1546, 2008. J. M. Murray. Local online learning in recurrent networks with random feedback. eLife, 8:e43299, 2019. T. Nath, A. Mathis, A. C. Chen, A. Patel, M. Bethge, and M. W. Mathis. Using deeplabcut for 3d markerless pose estimation across species and behaviors. Nature protocols, 14(7):2152–2176, 2019. A. Nayebi, D. Bear, J. Kubilius, K. Kar, S. Ganguli, D. Sussillo, J. J. DiCarlo, and D. L. Yamins. Task-driven convolutional recurrent models of the visual system. In Advances in Neural Information Processing Systems, pages 5290–5301, 2018. W. Nicola and C. Clopath. Supervised learning in spiking neural networks with force training. Nature communications, 8(1):2208, 2017. Y. Niv. Reinforcement learning in the brain. Journal of Mathematical Psychology, 53(3):139–154, 2009. S. W. Oh, J. A. Harris, L. Ng, B. Winslow, N. Cain, S. Mihalas, Q. Wang, C. Lau, L. Kuan, A. M. Henry, et al. A mesoscale connectome of the mouse brain. Nature, 508(7495):207, 2014. E. Oja. Simplified neuron model as a principal component analyzer. Journal of mathematical biology, 15(3):267–273, 1982. S. R. Olsen, D. S. Bortone, H. Adesnik, and M. Scanziani. Gain control by layer six in cortical circuits of vision. Nature, 483(7387):47–52, 2012. B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996. A. E. Orhan and W. J. Ma. A diverse range of factors affect the nature of neural representations underlying short-term memory. Nature neuroscience, page 1, 2019. C. Pandarinath, D. J. O’Shea, J. Collins, R. Jozefowicz, S. D. Stavisky, J. C. Kao, E. M. Trautmann, M. T. Kaufman, S. I. Ryu, L. R. Hochberg, et al. Inferring single-trial neural population dynamics using sequential auto-encoders. Nature methods, page 1, 2018. 41 R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning, pages 1310–1318, A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024–8035, 2019. J. Pei, L. Deng, S. Song, M. Zhao, Y. Zhang, S. Wu, G. Wang, Z. Zou, Z. Wu, W. He, et al. Towards artificial general intelligence with hybrid tianjic chip architecture. Nature, 572(7767):106–111, 2019. B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964. C. R. Ponce, W. Xiao, P. F. Schade, T. S. Hartmann, G. Kreiman, and M. S. Living- stone. Evolving images for visual neurons using a deep generative network reveals coding principles and neuronal preferences. Cell, 177(4):999–1009, 2019. R. Prenger, M. C.-K. Wu, S. V. David, and J. L. Gallant. Nonlinear v1 responses to natural scenes revealed by neural network analysis. Neural Networks, 17(5-6): 663–679, 2004. R. Rajalingham, E. B. Issa, P. Bashivan, K. Kar, K. Schmidt, and J. J. DiCarlo. Large-scale, high-resolution comparison of the core visual object recognition behavior of humans, monkeys, and state-of-the-art deep artificial neural networks. Journal of Neuroscience, 38(33):7255–7269, 2018. K. Rajan, C. D. Harvey, and D. W. Tank. Recurrent network models of sequence generation and memory. Neuron, 90(1):128–142, 2016. R. P. Rao and D. H. Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience, 2(1):79, 1999. J. H. Reynolds and D. J. Heeger. The normalization model of attention. Neuron, 61 (2):168–185, 2009. B. A. Richards, T. P. Lillicrap, P. Beaudoin, Y. Bengio, R. Bogacz, A. Christensen, C. Clopath, R. P. Costa, A. de Berker, S. Ganguli, et al. A deep learning framework for neuroscience. Nature neuroscience, 22:1761–1770, 2019. M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nature neuroscience, 2(11):1019–1025, 1999. M. Rigotti, D. D. Ben Dayan Rubin, X.-J. Wang, and S. Fusi. Internal representation of task rules by recurrent dynamics: the importance of the diversity of neural responses. Frontiers in computational neuroscience, 4:24, 2010. M. Rigotti, O. Barak, M. R. Warden, X.-J. Wang, N. D. Daw, E. K. Miller, and S. Fusi. The importance of mixed selectivity in complex cognitive tasks. Nature, 497(7451):585, 2013. 42 H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951. P. R. Roelfsema and A. Holtmaat. Control of synaptic plasticity in deep cortical networks. Nature Reviews Neuroscience, 19:166, 2018. J. D. Roitman and M. N. Shadlen. Response of neurons in the lateral intraparietal area during a combined visual discrimination reaction time task. J. Neurosci., 22: 9475–9489, 2002. R. Romo, C. D. Brody, A. Hernández, and L. Lemus. Neuronal correlates of parametric working memory in the prefrontal cortex. Nature, 399(6735):470–473, F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958. F. Rosenblatt. Principles of neurodynamics: Perceptions and the theory of brain mechanisms. 1962. D. B. Rubin, S. D. Van Hooser, and K. D. Miller. The stabilized supralinear network: a unifying circuit motif underlying multi-input integration in sensory cortex. Neuron, 85(2):402–417, 2015. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986. J. Sacramento, R. P. Costa, Y. Bengio, and W. Senn. Dendritic cortical microcircuits approximate the backpropagation algorithm. In Advances in neural information processing systems, pages 8721–8732, 2018. E. Salinas and P. Thier. Gain modulation: a major computational principle of the central nervous system. Neuron, 27(1):15–21, 2000. A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dy- namics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox. On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124020, 2019a. A. M. Saxe, J. L. McClelland, and S. Ganguli. A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences, 116(23):11537–11546, 2019b. W. Schultz, P. Dayan, and P. R. Montague. A neural substrate of prediction and reward. Science, 275(5306):1593–1599, 1997. H. S. Seung. How the brain keeps the eyes still. Proc. Natl. Acad. Sci. (USA), 93: 13339–13344, 1996. Y. Shu, A. Hasenstaub, and D. A. McCormick. Turning on and off recurrent balanced cortical activity. Nature, 423(6937):288–293, 2003. 43 R. Shwartz-Ziv and N. Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017. D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hu- bert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. H. Sompolinsky, A. Crisanti, and H.-J. Sommers. Chaos in random neural networks. Physical review letters, 61(3):259, 1988. H. F. Song, G. R. Yang, and X.-J. Wang. Training excitatory-inhibitory recurrent neural networks for cognitive tasks: a simple and flexible framework. PLoS computational biology, 12(2):e1004792, 2016. H. F. Song, G. R. Yang, and X.-J. Wang. Reward-based training of recurrent neural networks for cognitive and value-based tasks. Elife, 6:e21492, 2017. S. Song, K. D. Miller, and L. F. Abbott. Competitive hebbian learning through spike-timing-dependent synaptic plasticity. Nature neuroscience, 3(9):919–926, N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014. M. G. Stokes. ‘activity-silent’working memory in prefrontal cortex: a dynamic coding framework. Trends in cognitive sciences, 19(7):394–405, 2015. S. Strogatz. Nonlinear dynamics and chaos: with applications to physics, biology, chemistry, and engineering (studies in nonlinearity). 2001. D. Sussillo. Neural circuits as computational dynamical systems. Current opinion in neurobiology, 25:156–163, 2014. D. Sussillo and L. F. Abbott. Generating coherent patterns of activity from chaotic neural networks. Neuron, 63(4):544–557, 2009. D. Sussillo and O. Barak. Opening the black box: low-dimensional dynamics in high-dimensional recurrent neural networks. Neural computation, 25(3):626–649, D. Sussillo, M. M. Churchland, M. T. Kaufman, and K. V. Shenoy. A neural network that finds a naturalistic solution for the production of muscle activity. Nature neuroscience, 18(7):1025, 2015. I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147, 2013. R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 44 C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fer- gus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, A. Tavanaei, M. Ghodrati, S. R. Kheradpisheh, T. Masquelier, and A. Maida. Deep learning in spiking neural networks. Neural Networks, 111:47–63, 2019. T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012. A. N. Tikhonov. On the stability of inverse problems. In Dokl. Akad. Nauk SSSR, volume 39, pages 195–198, 1943. D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016. C. Van Vreeswijk and H. Sompolinsky. Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274(5293):1724–1726, 1996. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017. J. Wang, D. Narain, E. A. Hosseini, and M. Jazayeri. Flexible timing by temporal scaling of cortical responses. Nature neuroscience, 21(1):102, 2018. X.-J. Wang. Synaptic reverberation underlying mnemonic persistent activity. Trends in Neurosci., 24:455–463, 2001. X.-J. Wang. Probabilistic decision making by slow reverberation in cortical circuits. Neuron, 36(5):955–968, 2002. X.-J. Wang. Decision making in recurrent neuronal circuits. Neuron, 60(2):215–234, X.-J. Wang and G. R. Yang. A disinhibitory circuit motif and flexible information routing in the brain. Curr. Opin. Neurobiol., 49:75–83, 2018. X.-J. Wang, J. Tegnér, C. Constantinidis, and P. S. Goldman-Rakic. Division of labor among distinct subtypes of inhibitory neurons in a cortical microcircuit of working memory. Proc Natl Acad Sci U S A, 101:1368–1373, 2004. P. J. Werbos. Backpropagation through time: what it does and how to do it. Pro- ceedings of the IEEE, 78(10):1550–1560, 1990. A. H. Williams, T. H. Kim, F. Wang, S. Vyas, S. I. Ryu, K. V. Shenoy, M. Schnitzer, T. G. Kolda, and S. Ganguli. Unsupervised discovery of demixed, low-dimensional neural dynamics across multiple timescales through tensor component analysis. Neuron, 98(6):1099–1115, 2018. R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989. 45 H. R. Wilson and J. D. Cowan. Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical journal, 12(1):1–24, 1972. Y. Wu and K. He. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018. X. Xie and H. S. Seung. Equivalence of backpropagation and contrastive hebbian learning in a layered network. Neural computation, 15(2):441–454, 2003. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, Y. Yamane, E. T. Carlson, K. C. Bowman, Z. Wang, and C. E. Connor. A neural code for three-dimensional object shape in macaque inferotemporal cortex. Nature neuroscience, 11(11):1352, 2008. D. L. Yamins and J. J. DiCarlo. Using goal-driven deep learning models to understand sensory cortex. Nature neuroscience, 19(3):356, 2016. D. L. Yamins, H. Hong, C. F. Cadieu, E. A. Solomon, D. Seibert, and J. J. DiCarlo. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences, 111(23):8619– 8624, 2014. G. Yang, F. Pan, and W.-B. Gan. Stably maintained dendritic spines are associated with lifelong memories. Nature, 462(7275):920–924, 2009. G. R. Yang, J. D. Murray, and X.-J. Wang. A dendritic disinhibitory circuit mecha- nism for pathway-specific gating. Nat Commun, 7:12815, 2016. G. R. Yang, I. Ganichev, X.-J. Wang, J. Shlens, and D. Sussillo. A dataset and architecture for visual reasoning with a working memory. In European Conference on Computer Vision, pages 729–745. Springer, 2018. G. R. Yang, M. R. Joglekar, H. F. Song, W. T. Newsome, and X.-J. Wang. Task representations in neural networks trained to perform many cognitive tasks. Nature neuroscience, page 1, 2019. M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014. F. Zenke and S. Ganguli. Superspike: Supervised learning in multilayer spiking neural networks. Neural computation, 30(6):1514–1541, 2018. F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3987–3995. JMLR. org, 2017. C. Zhuang, S. Yan, A. Nayebi, and D. Yamins. Self-supervised neural network models of higher visual cortex development. In 2019 Conference on Cognitive Computational Neuroscience, pages 566–569, 2019. 46 D. Zipser and R. A. Andersen. A back-propagation programmed network that simulates response properties of a subset of posterior parietal neurons. Nature, 331(6158):679, 1988. R. S. Zucker and W. G. Regehr. Short-term synaptic plasticity. Annual review of physiology, 64(1):355–405, 2002. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Quantitative Biology arXiv (Cornell University)

Artificial neural networks for neuroscientists: A primer

Quantitative Biology , Volume 2020 (2006) – Jun 1, 2020

Loading next page...
 
/lp/arxiv-cornell-university/artificial-neural-networks-for-neuroscientists-a-primer-9msCNs6CIm
ISSN
0896-6273
eISSN
ARCH-3345
DOI
10.1016/j.neuron.2020.09.005
Publisher site
See Article on Publisher Site

Abstract

Artificial neural networks (ANNs) are essential tools in machine learning that have drawn increasing attention in neuroscience. Be- sides offering powerful techniques for data analysis, ANNs provide a new approach for neuroscientists to build models for complex behav- iors, heterogeneous neural activity and circuit connectivity, as well as to explore optimization in neural systems, in ways that traditional models are not designed for. In this pedagogical Primer, we introduce ANNs and demonstrate how they have been fruitfully deployed to study neuroscientific questions. We first discuss basic concepts and methods of ANNs. Then, with a focus on bringing this mathematical framework closer to neurobiology, we detail how to customize the analysis, structure, and learning of ANNs to better address a wide range of challenges in brain research. To help the readers garner hands-on experience, this Primer is accompanied with tutorial-style code in PyTorch and Jupyter Notebook, covering major topics. 1 Artificial neural networks in neuroscience Learning with artificial neural networks (ANNs), or deep learning, have emerged as a dominant framework in machine learning (ML) nowadays [LeCun et al., 2015], leading to breakthroughs across a wide range of applications, including computer vision [Krizhevsky et al., 2012], natural language processing [Devlin et al., 2018], and strategic games [Silver et al., 2017]. Some key ideas in this field can be traced to brain research: supervised learning rules have their roots in the theory of training perceptrons which in turn was inspired by the brain [Rosenblatt, 1962]; the hierarchical architecture [Fukushima and Miyake, 1982] and convolutional principle [LeCun and Bengio, 1995] were closely linked to our knowledge about the primate visual system [Hubel and Wiesel, 1962, Felleman and Van Essen, 1991]. Today, there is a continued exchange of ideas from neuroscience to the field of artificial intelligence [Hassabis et al., 2017]. At the same time, machine learning offers new and powerful tools for systems neu- roscience. One utility of the deep learning framework is to analyze neuroscientific data (Figure 1). Indeed, the advances in computer vision, especially convolutional Preprint. arXiv:2006.01001v2 [q-bio.NC] 24 Sep 2020 neural networks, have revolutionized image and video data processing. For instance, uncontrolled behaviors over time, such as micro-movements of animals in a labora- tory experiment, can now be tracked and quantified efficiently with the help of deep neural networks [Mathis et al., 2018]. Innovative neurotechnologies are producing a deluge of big data from brain connectomics, transcriptome and neurophysiology, the analyses of which can benefit from machine learning. Examples include image segmentation to achieve detailed, m scale, reconstruction of connectivity in a neural microcircuit [Januszewski et al., 2018, Helmstaedter et al., 2013], and estimation of neural firing rate from spiking data [Pandarinath et al., 2018]. This primer will not be focused on data analysis; instead, our primary aim is to present basic concepts and methods for the development of ANN models of biologi- cal neural circuits in the field of computational neuroscience. It is noteworthy that ANNs should not be confused with neural network models in general. Mathematical models are all “artificial" inasmuch as they are not biological. We denote by ANNs specifically models that are in part inspired by neuroscience yet for which biologi- cally justification is not the primary concern, in contrast to other types of models that strive to be built on quantitative data from the two pillars of neuroscience: neuroanatomy and neurophysiology. The use of ANNs in neuroscience [Zipser and Andersen, 1988] and cognitive science [Cohen et al., 1990] dates back to the early days of ANNs [Rumelhart et al., 1986]. In recent years, ANNs are becoming increasingly common model systems in neuroscience [Yamins and DiCarlo, 2016, Kriegeskorte, 2015, Sussillo, 2014, Barak, 2017]. There are three reasons for which ANNs or deep learning models have already been, and will likely continue to be, particularly useful for neuroscientists. First, fresh modeling approaches are needed to meet new challenges in brain research. Over the past decades, computational neuroscience has made great strides and be- come an integrated part of systems neuroscience [Abbott, 2008]. Much insights have been gained through integration of experiments and theory, including the idea of excitation and inhibition balance [Van Vreeswijk and Sompolinsky, 1996, Shu et al., 2003] and normalization [Carandini and Heeger, 2012]. Progress was also made in developing models of basic cognitive functions such as simple decision- making [Gold and Shadlen, 2007, Wang, 2008]. However, real-life problems can be incredibly complex, the underlying brain systems are often difficult to capture with “hand-constructed” computational models. For example, object classification in the brain is carried out through many layers of complex linear-nonlinear pro- cessing. Building functional models of the visual systems that achieve behavioral performance close to humans’ remained a formidable challenge not only for neu- roscientists, but also for computer vision researchers. By directly training neural network models on complex tasks and behaviors, deep learning provides a way to efficiently generate candidate models for brain functions that otherwise could be near impossible to model (Figure 1). By learning to perform a variety of complex behaviors of animals, ANNs could serve as potential model systems for biological neural networks, complementing nonhuman animal models for understanding the human brain. 2 Figure 1: Reasons for using ANNs for neuroscience research. (Top left) Neu- ral/Behavioral data analysis. ANNs can serve as image processing tools for efficient pose estimation (color dots). Figure inspired from Nath et al. [2019]. (Top right) Modeling complex behaviors. ANNs can perform object discrimination tasks involv- ing challenging naturalistic visual objects. Figure adapted from Kar et al. [2019]. (Bottom left) Illustrating that ANNs can be used to model complex neural activ- ity/connectivity patterns (blue lines). (Bottom right) Understanding neural circuits from an optimization perspective. In this view, functional neural networks (star sym- bol) are results of the optimization (arrows) of an objective function in an abstract space of a model constrained by the neural network architecture (colored space). A second reason for advocating deep networks in systems neuroscience is the acknowledgment that relatively simple models often do not account for a wide diversity of activity patterns in heterogeneous neural populations (Figure 1). One can rightly argue that this is a virtue rather than defect because simplicity and generality are hallmarks of good theories. However, complex neural signals also tell us that existing models may be insufficient to elucidate mysteries of the brain. This is perhaps especially true in the case of the prefrontal cortex. Neurons in prefrontal cortex often show complex mixed selectivity to various task variables [Rigotti et al., 2010, 2013]. Such complex patterns are often not straightforward to interpret and understand using hand-built models that by design strive for simplicity. ANNs are promising to capture the complex nature of neural activity. Thirdly, besides providing mechanistic models of biological systems, machine learning can be used to probe the “why” question in neuroscience [Barlow et al., 1961]. Brains are biological machines evolved under pressure to compute robustly and efficiently. Even when we understand how a system works, we may still ask why it works that way. Similarly to biological systems evolving to survive, ANNs 3 are trained to optimize objective functions given various architectural constraints (the number of neurons, economy of circuit wiring, etc.) (Figure 1). By identifying the particular objective and set of constraints that lead to brain-resembling ANNs, we could potentially gain insights into the evolutionary pressure faced by biological systems [Richards et al., 2019]. In this pedagogical primer, we will discuss how ANNs can benefit neuroscientists in the three ways described above. In section 2, we will first introduce the key ingredients common in any study of ANNs. In section 3, we will describe two major applications of ANNs as neuroscientific models: convolutional networks as models for sensory, especially visual, systems, and recurrent neural networks as models for cognitive and motor systems. In the following sections 4 and 5, we will overview how to customize the analysis and architectural design of ANNs to better address a wide range of neuroscience questions. To help the readers gain hands-on experience, we accompany this primer with tutorial-style code in PyTorch and Jupyter Notebook (https://github.com/gyyang/nn-brain), covering all major topics. 2 Basic ingredients and variations in artificial neural networks In this section, we will introduce basic concepts in ANNs and their common varia- tions. Readers can skip this section if they are familiar with ANNs and deep learning. For a more thorough introduction, readers can refer to Goodfellow et al. [2016]. 2.1 Basic ingredient: learning problem, architecture, and algorithm A typical study using deep networks consists of three basic ingredients: learning problem, network architecture, and training algorithm. Weights of connections between units or neurons in a neural network are constrained by the network ar- chitecture, but their specific values are randomly assigned at initialization. These weights constitute a large number of parameters, collected denoted by  which also includes other model parameters (see below), to be trained using an algorithm. The training algorithm specifies how connection weights change to better solve a learning problem, such as to fit a dataset or perform a task. We will go over a simple example, where a multi-layer-perceptron (MLP) is trained to perform a simple digit-classification task using supervised learning. Learning problem In supervised learning, a system learns to fit a dataset contain- (i) (i) ing a set of inputsfx g; i = 1; ; N . Each input x is paired with a target output (i) y . Symbols in bold represent vectors (column vectors by default). The goal is target to learn parameters  of a neural network function F (;) that predicts the target (i) (i) (i) outputs given inputs, y = F (x ;)  y . In the simple digit-classification target task MNIST [LeCun et al., 1998], each input is an image containing a single digit, while the target output is a probability distribution over all classes (0, 1, ..., 9) given by a 10-dimensional vector or simply an integer corresponding to the class of that object. 4 More precisely, the system is trained to optimize the value of an objective function, (i) (i) or commonly, minimize the value of a loss function L = L(y ;y ), target N i (i) (i) (i) where L(y ;y ) quantifies the difference between the target output y and target target (i) the actual output y . Network architecture ANNs are incredibly versatile, including a wide range of architectures. Of all architectures, the most fundamental one is a Multi-Layer Perceptron (MLP) [Rosenblatt, 1958, 1962] (Figure 2A). A MLP consists of multiple layers of neurons, where neurons in the l-th layer only receive inputs from the (l 1)- th layer, and only project to the (l + 1)-th layer. (1) r = x; (1) (l) (l) (l1) (l) r = f (W r + b ); 1 < l < N; (2) (N ) (N1) (N ) y = W r + b : (3) (l) Here x is an external input, r denotes the neural activity of neurons in the l-th (l) layer, and W is the connection matrix from the (l 1)-th to the l-th layer. f () is a (usually nonlinear) activation function of the model neurons. The output of the (N ) (l) (N ) network is read out through connections W . Parameters b and b are biases for model neurons and output units respectively. If the network is trained to classify, then the output is often normalized such that y = 1, where y represents the j j predicted probability of class j. When there are enough neurons per layer, MLPs can in theory approximate arbitrary functions [Hornik et al., 1989]. However, in practice, the network size is limited, and good solutions may not be found through training even when they exist. MLPs are often used in combination with, or as parts of, more modern neural network architectures. Training algorithm The signature method of training in deep learning is stochas- tic gradient descent (SGD) [Robbins and Monro, 1951, Rumelhart et al., 1986]. Trainable parameters, collectively denoted as , are updated in the opposite direction of the gradient of the loss, @L=@. Intuitively, the j-th parameter  should be reduced by training if the cost function L increases with it; and increased otherwise. For each step of training, since it is usually too expensive to evaluate the loss using the entire training set, the loss is computed using a small number M of randomly selected training examples (a minibatch), indexed by B = fk ; ; k g, 1 M (k) (k) L = L(y ;y ); (4) batch target k2B hence the name “stochastic”. For simplicity, we assume a minibatch size of 1 and omit batch in the following equations (L will be referred to as L, etc.). The batch gradient, @L=@ is the direction of parameter change that would lead to maximum increase in the loss function when the change is small enough. To decrease the loss, trainable parameters are updated in the opposite direction of the gradient, with a magnitude proportional to the learning rate , @L =  : (5) 5 Figure 2: Schematics of common neural network architectures. (A) A multi- layer perceptron (MLP). (B) A recurrent neural network (middle) receives a stream of inputs (left). After training, an output unit (right) should produce a desired output. Figure inspired from Mante et al. [2013]. (C) A recurrent neural network is unrolled in time as a feedforward system with each layer corresponding to the network state at one time step. c and r describe the network state and output activity at time t t t respectively. c is a function of r and the input x . (D) A convolutional neural t t1 t network for processing images. Each layer contains a number of channels (4 in layer 1, 6 in layer 2). A channel (represented by a square) consists of spatially organized neurons, each receiving connections from neurons with similar spatial preferences. The spatial extent of these connections is described by the kernel size. Figure inspired from LeCun et al. [1998]. Parameters such as W and b are usually trainable. Other parameters are set by the modelers and called hyperparameters, for example, the learning rate . A crucial requirement for computing gradients is differentiability, namely derivatives of functions in the model are well defined. For a feedforward network without any intermediate (hidden) layer [Rosenblatt, 1962] processing a single example x (minibatch size 1), y = Wx + b; or equivalently, y = W x + b ; (6) i ij j i computing the gradient is straightforward, @L @L @y @L = = x ; (7) @W @y @W @y ij k ij i with @y =@W equal to x when k = i, otherwise 0. In vector notation, k ij j @L @L = x : (8) @W @y 6 Here we follow the convention that @L=@W and @L=@y have the same form as W and y, respectively. Assuming that 1 1 2 2 L = ky y k = (y y ) ; (9) target j target;j 2 2 we have, @L = (y y )x ; (10) target @W @L W / = (y y )x : (11) ij target;i i j @W ij This modification only depends on local information about the input and output units of each connection. Hence, if y > y , W should change to increase the net target;i i ij input and W has the same sign as x . The opposite is true if y < y . ij j target;i i For a multi-layer network, the differentiation is done using the back-propagation algorithm [Rumelhart et al., 1986, LeCun, 1988]. To compute the loss L, the network is run in a forward pass (Eq. 1-3). Next, to efficiently compute the exact gradient @L=@, information about the loss needs to be passed backward, in the opposite direction of the forward pass, hence the name back-propagation. To illustrate the concept, consider a N -layer linear feedforward network (Eq. 1-3, (l) (l) but with f (x) = x). To compute @L=@W , we need to compute @L=@r . From (l+1) (l+1) (l) (l+1) r = W r + b , we have (l+1) X X X @r @L @L @L @L j (l+1) (l+1) = = W = [W ] : (12) ji ij (l) (l+1) (l) (l+1) (l+1) @r @r @r @r @r i j j i j j j j In vector notation, @L @L @L (l+1) | (l+1) | (l+2) | = [W ] = [W ] [W ] =  : (13) (l) (l+1) (l+2) @r @r @r (l) Therefore, starting with @L=@y, @L=@r can be recursively computed from (l+1) @L=@r , for l = N 1; ; 1. This computation flows in the opposite direction of the forward pass, and is called the backward pass. In general, back-propagation applies to neural networks with arbitrary differential components. Computing the exact gradient through back-propagation is considered unrealistic biologically because updating connections at each layer requires precise, non-local information of connection weights at downstream layers (in the form of connection matrix transposed, Eq. 13). 2.2 Variations of learning problems/objective functions In this and the following sections (2.3, 2.4), we introduce common variations of learning problems, network architectures, and training algorithms. Traditionally, learning problems are divided into three kinds: supervised, reinforce- ment, and unsupervised learning problems. The difference across these three kinds of learning problems lies in the goal or objective. In supervised learning, each input 7 is associated with a target. The system learns to produce outputs that match the targets. In reinforcement learning, instead of explicit (high-dimensional) targets, the system receives a series of scalar rewards. It learns to produce outputs (actions) that maximize total rewards. Unsupervised learning refers to a diverse set of problems where the system is not provided with explicit targets or rewards. Due to space limitations, we will mainly focus on networks trained with supervised learning in this Primer. Supervised learning As mentioned before, for supervised learning tasks, input (i) (i) and target output pairs are provided f(x ;y )g. The goal is to minimize the target difference between target outputs and actual outputs predicted by the network. In many common supervised learning problems, the target outputs are behavioral outputs. For example, in a typical object classification task, each input is an image containing a single object, while the target output is an integer corresponding to the class of that object (e.g., dog, cat, etc.). In other cases, the target output can directly be neural recording data [McIntosh et al., 2016, Rajan et al., 2016, Andalman et al., 2019]. The classical perceptual decision-making task with random-dot motion [Britten et al., 1992, Roitman and Shadlen, 2002] can be formulated as a supervised learning problem, because there is a correct answer. In this task, animals watch randomly moving dots and report the dots’ overall motion direction by choosing one of two alternatives, A or B. This task can be simplified as a network receiving a stream of (i) noisy inputs x at every time point t of the i-th trial, which can represent the net evidence in support of A and against B. At the end of each trial t = T , the system (i) (i) should learn to report the sign of the average input y = sign(hx i ), +1 for target t choice A and1 for choice B. Reinforcement learning For reinforcement learning [Sutton and Barto, 2018], a model (an agent) interacts with an environment, such as a (virtual) maze. At time step t, the agent receives an observation o from the environment, produces an action a that updates the environment state to s , and receives a scalar reward r (negative t t+1 t value for punishment). For example, a model navigating a virtual maze can receive pixel-based visual inputs as observations o , produce actions a that move itself in t t the maze, and receive rewards when it exits the maze. The objective is to produce appropriate actions a given past and present observations that maximize cumulative rewards r . In many classical reinforcement learning problems, the observation o equals to the environment state s , which contains complete information about t t the environment. Reinforcement learning (without neural networks) has been widely used by neuro- scientists and cognitive scientists to study value-based learning and decision-making tasks [Schultz et al., 1997, Daw et al., 2011, Niv, 2009]. For example, in the multi- armed bandit task, the agent chooses between multiple options repeatedly, where each option produces rewards with a certain probability. Reinforcement learning theory can model how the agent’s behavior adapts over time, and help neuroscientists study the neural mechanism of value-based behavior. 8 Deep reinforcement learning trains deep neural networks using reinforcement learn- ing [Mnih et al., 2015], enabling applications to many more complex problems. Deep reinforcement learning can in principle be used to study most tasks performed by lab animals [Botvinick et al., 2020], since animals are usually motivated to perform the task via rewards. Although many such tasks can also be formulated as supervised learning problems when there exists a correct choice (e.g., perceptual decision mak- ing), many other tasks can only be described as reinforcement learning tasks because answers are subjective [Haroush and Williams, 2015, Kiani and Shadlen, 2009]. For example, a perceptual decision-making task where there is a correct answer (A, not B) can be extended to assess animals’ confidence about their choice [Kiani and Shadlen, 2009, Song et al., 2017]. In addition to the two alternatives that result in a large reward for the correct choice and no reward otherwise, monkeys are presented a sure-bet option that guarantees a small reward. Since a small reward is better than no reward, subjects are more likely to choose the sure-bet option when they are less confident about making a perceptual judgement. Reinforcement learning is necessary here because there is no ground-truth choice output: the optimal choice depends on the animals’ own confidence level at their perceptual decision. (i) Unsupervised learning For unsupervised learning, only inputs fx g are pro- vided, the objective function is defined solely with the inputs and the network parameters L(x;) (no targets or rewards). For example, finding the first compo- nent in Principal Component Analysis (PCA) can be formulated as unsupervised learning in a simple neural network. A single neuron y reading out from a group of input neurons x, (y = w x), can learn to extract the first principle component by maximizing its variance Var(y) while keeping its connection weights normalized (kwk = 1) [Oja, 1982]. Unsupervised learning is particularly relevant for modeling development of sensory cortices. Although widely-used in machine learning, the kind of labeled data needed for supervised learning, such as image-object class pairs, is rare for most animals. Unsupervised learning has been used to explain neural responses of early visual areas [Barlow et al., 1961, Olshausen and Field, 1996], and more recently, of higher visual areas [Zhuang et al., 2019]. Compared to reinforcement and unsupervised learning, supervised learning can be particularly effective because the network receives more informative feedback in the form of high-dimensional target outputs. Therefore, it is common to formu- late a reinforcement/unsupervised learning problem (or parts of it) as a supervised one. For example, consider an unsupervised learning problem of compressing high- dimensional inputs x into lower-dimensional representation z while retaining as much information as possible about the inputs (not necessarily in the information- theoretic sense). One approach to this problem is to train autoencoder networks [Rumelhart et al., 1986, Kingma and Welling, 2013] using supervised learning. An autoencoder consists of an encoder that maps input x into a low-dimensional latent representation z = f (x), and a decoder that maps the latent back to a encode high-dimensional representation y = f (z). To make sure z contains informa- decode tion about x, autoencoders use the original input as the supervised learning target, y = x. target 9 2.3 Variations of network architectures Recurrent neural network Besides MLP, another fundamental ANN architecture is recurrent neural networks (RNNs) that process information in time (Figure 2B). In a “vanilla” or Elman RNN [Elman, 1990], activity of model neurons at time t, r , is driven by recurrent connectivity W , and by inputs x through connectivity W . r t x The output of the network is read out through connections W . c = W r + W x + b ; (14) t r t1 x t r r = f (c ); (15) t t y = W r + b : (16) t y t y Here c represents the cell state, analogous to membrane potential or input current, while r represents the neuronal activity. An RNN can be unrolled in time (Figure 2C) and viewed as a particular form of a MLP, r = f (W r + W x + b ); for t = 1; ; T: (17) t r t1 x t r Here, neurons in the t-th layer, r receive inputs from the (t 1)-th layer r and t t1 additional inputs from outside of the recurrent network x . Unlike regular MLPs, the connections from each layer to the next are shared across time. Backpropagation also applies to a RNN. While backpropagation in a MLP propagates gradient information from the final layer back (Eq. 13), computing the gradient for a RNN involves propagating information backward in time (backpropagation- through-time, or BPTT) [Werbos, 1990]. Assuming that the loss is computed from outputs at the last time point T and a linear activation function, the key step of backpropagation-through-time is computed similarly to Eq. 13 as @L @L @L | | 2 = W = [W ] =  : (18) r r @r @r @r t t+1 t+2 With an increasing number of time steps in a RNN, weight modifications involve products of many matrices (Eq. 18). An analogous problem is present for very deep feedforward networks (for example, networks with more than 10 layers). The | T norm of this matrix product, k[W ] k, can grow exponentially with T , if W is large (more precisely, the largest eigenvalue of W > 1); or vanish to zero if W r r is small, making it historically difficult to train recurrent networks [Bengio et al., 1994, Pascanu et al., 2013]. Such exploding and vanishing gradient problems can be substantially alleviated with a combination of modern techniques, including network architectures [Hochreiter and Schmidhuber, 1997, He et al., 2016] and initial network connectivity [Le et al., 2015, He et al., 2015] that tend to preserve the norm of the backpropagated gradient. Convolutional neural networks A particularly important type of network ar- chitectures is convolutional neural network (Figure 2D). The use of convolution means that a group of neurons will each process its respective inputs using the same function, in other words, the same set of connection weights. In a typical convolutional neural network processing visual inputs [Fukushima et al., 1983, Le- Cun et al., 1990, Krizhevsky et al., 2012, He et al., 2016], neurons are organized into N “channels” or “feature maps”. Each channel contains N  N channel height width 10 neurons with different spatial selectivity. Each neuron in a convolutional layer is indexed by a tuple i = (i ; i ; i ), representing the channel index (i ), and the C H W C spatial preference indices (i ; i ). The i-th neuron in layer l is typically driven by H W neurons in the previous layer (bias term and activation function omitted), (l) (l) (l1) r = W r : (19) i i i i i i ;j j j j j j C H W C H W C H W C H W j j j C H W Importantly, in convolutional networks, the connection weights do not depend on the absolute spatial location of the i-th neuron, instead they depend solely on the spatial displacement (i j ; i j ) between the pre- and post-synaptic neurons. H H W W (l) (l) W = W (i j ; i j ): (20) H H W W i i i ;j j j i ;j C H W C H W C C Therefore, all neurons within a single channel process different parts of the input space using the same shared set of connection weights, allowing these neurons to have the same stimulus selectivity with receptive fields at different spatial locations. Moreover, neurons only receive inputs from other neurons with similar spatial preferences, i.e. whenji j j andji j j values are small (Figure 2D). H H W W This reusing of weights not only dramatically reduces the number of trainable parameters, but also imposes invariance on processing. For visual processing, convolutional networks typically impose spatial invariance such that objects are processed with the same set of weights regardless of their spatial positions. In a typical convolutional network, across layers the number of neurons per channel (N  N ) decreases (with coarser spatial resolution) while more features are height width extracted (with an increasing number of channels). A classifier is commonly at the end of the system to learn a particular task, such as categorization of visual objects. Activation function Most neurons in ANNs, like their biological counterparts, perform nonlinear computations based on their inputs. These neurons are usually point neurons with a single nonlinear activation function f () that links the sum of inputs to the output activity. The nonlinearity is essential for the power of ANNs [Hornik et al., 1989]. A common choice of activation function is the Rectified Linear Unit (ReLU) function, f (x) = max(x; 0) [Glorot et al., 2011]. The deriva- tive of ReLU at x = 0 is mathematically undefined, but conventionally set to 0 in practice. ReLU and its variants [Clevert et al., 2015] are routinely used in feed- forward networks, while the hyperbolic tangent (tanh) function is often used in recurrent networks [Hochreiter and Schmidhuber, 1997]. ReLU and similar activa- tion functions are asymmetric and non-saturating at high value. Although biological neurons eventually saturate at high rate, they often operate in non-saturating regimes. Therefore, traditional neural circuit models with rate units have also frequently used non-saturating activation functions [Abbott and Chance, 2005, Rubin et al., 2015]. Normalization Normalization methods are important components of many ANNs, in particular very deep neural networks [Ioffe and Szegedy, 2015, Ba et al., 2016b, Wu and He, 2018]. Similar to normalization in biological neural circuits [Carandini and Heeger, 2012], normalization methods in ANNs keep inputs and/or outputs of neurons in desirable ranges. For example, for inputs x (e.g., stimulus) to a layer, 11 Layer Normalization [Ba et al., 2016b] amounts to a form of “z-scoring" across units, so that the actual input x ^ to the i-th neuron is x ^ =  + ; (21) = hx i; (22) = h(x ) i + : (23) where hx i refers to the average over all units in the same layer;  and  are the mean and variance of x. After normalization, different external inputs lead to the same mean and variance for x^, set by the trainable parameters and . The values of and do not depend on the external inputs. The small constant  ensures that is not vanishingly small. 2.4 Variations of training algorithms Variants of SGD-based methods Supervised, reinforcement, and unsupervised learning tasks can all be trained with SGD-based methods. Partly due to the stochas- tic nature of the estimated gradient, directly applying SGD (Eq. 5) often leads to poor training performance. Gradually decaying learning rate value  during training can often improve performance, since smaller learning rate during late training encour- ages finer-tuning of parameters [Bottou et al., 2018]. Various optimization methods based on SGD are used to improve learning [Kingma and Ba, 2014, Sutskever et al., 2013]. One simple and effective technique is momentum [Sutskever et al., 2013, (j) Polyak, 1964], which on step j updates parameters with  based on temporally (j) smoothed gradients v , (j) @L (j) (j1) v = v + ; 0 <  < 1 (24) (j) (j) = v : (25) Alternatively, in adaptive learning rate methods [Duchi et al., 2011, Kingma and Ba, 2014], the learning rate of individual parameter is adjusted based on the statistics (e.g., mean and variance) of its gradient over training steps. For example, in the Adam method [Kingma and Ba, 2014], the value of a parameter update is magnified if its gradient has been consistent across steps (low variance). Adaptive learning rate methods can be viewed as approximately taking into account curvature of the loss function [Duchi et al., 2011]. Regularization Regularization techniques are important during training in order to improve generalization performance by deep networks. Adding a L2 regularization term, L =  W , to the loss function [Tikhonov, 1943] (equivalent to weight reg ij ij decay [Krogh and Hertz, 1992]) discourages the network from using large connection weights, which can improve generalization by implicitly limiting model complexity. Dropout [Srivastava et al., 2014] silences a randomly-selected portion of neurons at each step of training. It reduces the network’s reliance on particular neurons or a precise combination of neurons. Dropout can be thought of as loosely approximating spiking noise. 12 The choice of hyperparameters (learning rate, batch size, network initialization, etc.) is often guided by a combination of theory, empirical evidence, and hardware constraints. For neuroscientific applications, it is important that the scientific con- clusions do not rely heavily on the hyperparameter choices. And if they do, the dependency should be clearly documented. 3 Examples of building ANNs to address neuroscience questions In this section, we overview two common usages of ANNs in addressing neuro- science questions. 3.1 Convolutional networks for visual systems Deep convolutional neural networks are currently the standard tools in computer vision research and applications [Krizhevsky et al., 2012, Simonyan and Zisserman, 2014, He et al., 2016, 2017]. These networks routinely consist of tens, sometimes hundreds, of layers of convolutional processing. Effective training of deep feed- forward neural networks used to be difficult. This trainability problem has been drastically improved by a combination of innovations in various areas. Modern deep networks would be too large and therefore too slow to run, not to mention train, if not for the rapid development of hardware such as general purpose GPUs (Graphics Processing Units) and TPUs (Tensor Processing Units) [Jouppi et al., 2017]. Deep convolutional networks are usually trained with large naturalistic datasets containing millions of high resolution labeled images (e.g., Imagenet [Deng et al., 2009]), using training methods with adaptive learning rates [Kingma and Ba, 2014, Tieleman and Hinton, 2012]. Besides the default use of convolution, a wide range of network architecture innovations improve performance, including the adoption of ReLU activation function [Glorot et al., 2011], normalization methods [Ioffe and Szegedy, 2015], and the use of residual connections that can provide an architectural shortcut from a network layer’s inputs directly to its outputs [He et al., 2016]. Deep convolutional networks have been proposed as computational models of the visual systems, particularly of the ventral visual stream or the “what pathway” for visual object information processing (Figure 3) [Yamins and DiCarlo, 2016]. These models are typically trained using supervised learning on the same image classifica- tion tasks as the ones used in computer vision research, and in many cases, are the exact same convolutional networks developed in computer vision. In comparison, classical models of the visual systems typically rely on hand-designed features (synaptic weights) [Jones and Palmer, 1987, Freeman and Simoncelli, 2011, Riesen- huber and Poggio, 1999], such as Gabor filters, or are trained with unsupervised learning based on the efficient coding principles [Barlow et al., 1961, Olshausen and Field, 1996]. Although classical models have had success at explaining various features of lower-level visual areas, deep convolutional networks surpass them sub- stantially in explaining neural activity in higher-level visual areas in both monkeys [Yamins et al., 2014, Cadieu et al., 2014, Yamins and DiCarlo, 2016] and humans [Khaligh-Razavi and Kriegeskorte, 2014]. Besides being trained to classify objects, 13 Figure 3: Comparing the visual system and deep convolutional neural networks. The same image is passed through monkey’s visual cortex (top) and a deep convo- lutional neural network (bottom), allowing for side-by-side comparisons between biological and artificial neural networks. Neural responses from IT is best predicted by responses from the final layer of the convolutional network, while neural re- sponses from V4 is better predicted by an intermediate network layer (green dashed arrows). Figure adapted from Yamins and DiCarlo [2016]. convolutional networks can also be trained to directly reproduce patterns of neural activity recorded in various visual areas [McIntosh et al., 2016, Prenger et al., 2004]. In a classical work of comparing convolutional networks with higher visual areas [Yamins et al., 2014], Yamins and colleagues trained thousands of convolutional networks with different architectures on a visual categorization task. To study how similar the artificial and biological visual systems are, they quantified how well the network’s responses to naturalistic images can be used to linearly predict responses from the inferior temporal (IT) cortex of monkeys viewing the same images. They found that this neural predictivity is highly correlated with accuracy on the categorization task, suggesting that better IT-predicting models can be built by developing better performing models on challenging natural image classification tasks. They further found that unlike IT, neural responses from the relatively lower visual area, V4, is best predicted by intermediate layers of the networks (Figure 3). As computational models of visual systems, convolutional networks can model complex, high-dimensional inputs to downstream areas, useful for large-scale models using pixel-based visual inputs [Eliasmith et al., 2012]. This process has been made particularly straightforward with the easy access of many pre-trained networks in standard deep learning frameworks like Pytorch [Paszke et al., 2019] and Tensorflow [Abadi et al., 2016]. 14 3.2 Recurrent neural networks for cognitive and motor systems Recurrent neural networks are common machine learning tools to process sequences, such as speech and text. In neuroscience, they have been used to model various aspects of the cognitive, motor, and navigation systems [Mante et al., 2013, Barak et al., 2013, Sussillo et al., 2015, Yang et al., 2019, Wang et al., 2018, Cueva and Wei, 2018]. Unlike convolutional networks used to model visual systems that are trained on large-scale image classification tasks, recurrent networks are usually trained on specific cognitive or motor tasks that neuroscientists are studying. By comparing RNNs trained on the same tasks that animals or humans performed, side-by-side comparisons can be made between RNNs and brains. The comparisons can be made at many levels, including single-neuron activity and selectivity, population decoding, state-space dynamics, and network responses to perturbations. We will expand more on how to analyze RNNs in the next section. An influential work that uses RNNs to model cognition involves a monkey experi- ment for context-dependent perceptual decision-making [Mante et al., 2013]. In this task, a fraction (called motion coherence) of random moving dots moves in the same direction (left or right); independently a fraction (color coherence) of dots are red, and the rest are green. In a single trial, subjects were cued by a context signal to per- form either a motion task (judging the net motion direction is right or left) or a color task (deciding whether there is more red dots than green ones). Monkeys performed the task by temporally integrating evidence for behavioral relevant information (e.g. color) while ignoring the irrelevant feature (motion direction in the color task). Neurons in the prefrontal cortex recorded from behaving animals displayed complex activity patterns, where the irrelevant features are still strongly represented, even though they weakly influence behavioral choices. These counter-intuitive activity patterns were nevertheless captured by a RNN [Mante et al., 2013]. Examining the RNN dynamics revealed a novel mechanism by which the irrelevant features are represented, but selectively filtered out and not integrated over time during evidence accumulation. To better compare neural dynamics between RNNs and biological systems, RNNs used in neuroscience often treat time differently from their counterparts in machine learning. RNNs in machine learning are nearly always discrete time systems (but see Chen et al. [2018]), where state at time step t is obtained through a mapping from the state at time step t 1 (Eq. 15). The use of a discrete time system means that stimuli that are separated by several seconds in real life can be provided to the network in consecutive time points. To allow for more biologically realistic neural dynamics, RNNs used in neuroscience are often based on continuous time dynamical systems [Wilson and Cowan, 1972, Sompolinsky et al., 1988], such as dr = r(t) + f (W r(t) + W x(t) + b ): (26) r x r dt Here  is the single-unit time scale. This continuous-time system can then be discretized using the Euler method with a time step of t(<  ), r(t + t)  r(t) + [r(t) + f (W r(t) + W x(t) + b )]: (27) r x r 15 Besides gradient descent through back-propagation, a different line of algorithms has been used to train RNN models in neuroscience [Sussillo and Abbott, 2009, Laje and Buonomano, 2013, Andalman et al., 2019]. These algorithms are based on the idea of harnessing chaotic systems with weak perturbations [Jaeger and Haas, 2004]. In particular, the FORCE algorithm [Sussillo and Abbott, 2009] allows for rapid learning by modifying the output connections of an RNN to match the target using a recursive least square algorithm. The network output y(t) (assumed to be one-dimensional here) is fed back to the RNN through w , fb dr = r(t) + f (W r(t) + W x(t) + w y(t) + b ); (28) r x fb r dt y(t) = w r(t): (29) Therefore modifying the output connections amounts to a low-rank modification (w w ) of the recurrent connection matrix, fb dr = r(t) + f ([W + w w ]r(t) + W x(t) + b ): (30) r fb x r dt 4 Analyzing and understanding ANNs Common ANNs used in ML or neuroscience are not easily interpretable. For many neuroscience problems, they may serve better as model systems that await further analyses. Successful training of an ANN on a task does not mean knowing how the system works. Therefore, unlike most ML applications, a trained ANN is not the end goal but merely the prerequisite for analyzing that network to gain understanding. Most systems neuroscience techniques to investigate biological neural circuits can be directly applied to understand artificial networks. To facilitate side-by-side comparison between artificial and biological neural networks, activity of an ANN can be visualized and analyzed with the same dimensionality reduction tools (e.g., PCA) used for biological recordings [Mante et al., 2013, Kobak et al., 2016, Williams et al., 2018]. To understand causal relationship from neurons to behavior, arbitrary set of neurons can be lesioned [Yang et al., 2019], or inactivated for a short duration akin to optogenetic manipulation in physiological experiments. Similarly, connections between two selected groups of neurons can be lesioned to understand the causal contribution of cross-population interactions [Andalman et al., 2019]. In this section, we focus on methods that are particularly useful for analyzing ANNs. These methods include optimization-based tuning analysis [Erhan et al., 2009], fixed-point-based dynamical system analysis [Sussillo and Barak, 2013], quantitative comparisons between a model and experimental data [Yamins et al., 2014], and insights from the perspective of biological evolution [Lindsey et al., 2019, Richards et al., 2019]. Similarity comparison Analysis methods such as visualization, lesioning, tuning, fixed-point analysis can offer detailed intuition into neural mechanisms of individual networks. However, with the relative ease of training ANNs, it is possible to train a large amount of neural networks for the same task or dataset [Maheswaranathan et al., 2019, Yamins et al., 2014]. With such volume of data, it is necessary to take 16 advantage of high-throughput quantitative methods that compare different models at scale. Similarity comparison methods compute a scalar similarity score between the neural activity of two networks performing the same task [Kriegeskorte et al., 2008, Kornblith et al., 2019]. These methods are agnostic about the network form and size, and can be applied to artificial and biological networks alike. Consider two networks (or two populations of neurons), sized N and N respectively. 1 2 Their neural activity in response to the same D task conditions can be summarized by a D-by-N matrix R and a D-by-N matrix R (Figure 4A). Representational 1 1 2 2 similarity analysis (RSA) [Kriegeskorte et al., 2008] first computes the dissimilarity or distances of neural responses between different task conditions within each network, yielding a D-by-D dissimilarity matrix for each network (Figure 4B). Next, the correlation between dissimilarity matrices of the two networks is computed. A higher correlation corresponds to more similar representations. Another related line of methods uses linear regression (as used in [Yamins et al., 2014]) to predict R through a linear transformation of R , R  WR . The 2 1 2 1 similarity corresponds to the correlation between R and its predicted value WR . 2 1 Complex tuning analysis Studying tuning properties of single neurons has been one of the most important analysis techniques in neuroscience [Kuffler, 1953]. Classically, tuning properties are studied in sensory areas by showing stimuli param- eterized in a low dimensional space (e.g., oriented bars or gratings in vision [Hubel and Wiesel, 1959]). This method is most effective when the neurons studied have relatively simple response properties. A new class of methods treats the mapping of tuning as a high-dimensional optimization problem and directly searches for the stimulus that most strongly activates a neuron. Gradient-free methods such as genetic algorithms have been used to study complex tuning of biological neurons [Yamane et al., 2008]. In deep neural networks, gradient-based methods can be used [Erhan et al., 2009, Zeiler and Fergus, 2014]. For a neuron with activity r(x) given input x, a gradient-ascent optimization starts with a random x , and proceeds by updating the input x as @r x ! x + x; x =  : (31) @x This method can be used for searching the preferred input to any neuron or any population of neurons in a deep network [Erhan et al., 2009, Bashivan et al., 2019], see Figure 4C for an example. It is particularly useful for studying neurons in higher layers that have more complex tuning properties. The space of x may be too high dimensional (e.g., pixel space) for conducting an effective search, especially for gradient-free methods. In that case, we may utilize a lower dimensional space that is still highly expressive. A generative model learns a function that maps a lower-dimensional latent space to a high dimensional space such as pixel space [Kingma and Welling, 2013, Goodfellow et al., 2014]. Then the search can be conducted instead in the lower-dimensional latent space [Ponce et al., 2019]. ANNs can be used to build models for complex behavior that would not be easily done otherwise, opening up new possibilities such as studying encoding of more 17 Figure 4: Convolutional neural network responses and tuning. (A) The neural response to an image in a convolutional neural network trained to classify hand- written digits. The network consists of two layers of convolutional processing, followed by two fully-connected layers. (B) Dissimilarity matrices (each D-by- D) assessing the similar or dissimilar neural responses to different input images. Dissimilarity matrices are computed for neurons in layers 1 and 4 of the network. D = 50 Images are organized by class (0, 1, etc.), 5 images per class. Neural responses to images in the same class are more similar, i.e. neural representation more category-based, in layer 4 (right) than layer 1 (left). (C) Preferred image stimuli found through gradient-based optimization for sample neurons from each layer. Layers 1 and 2 are convolutional, therefore their neurons have localized preferred stimuli. In contrast, neurons from layers 3 and 4 have non-local preferred stimuli. 18 Figure 5: Analyzing tuning properties of a neural network trained to perform 20 cognitive tasks. In a network trained on multiple cognitive tasks, the tuning property of model units to individual task can be quantified. x-axis: recurrent units; y-axis: different tasks. Color measures the degree (between 0 and 1) to which each unit is engaged in a task. Twelve clusters are identified using a hierarchical clustering method (bottom, colored bars). For instance, cluster 3 is highly selective for pro- versus anti-response tasks (Anti) involving inhibitory control; clusters 10 and 11 are involved in delayed match-to-sample (DMS) and delayed non-match-to-sample (DNMS), respectively; cluster 12 is tuned to DMC. Figure adapted from Yang et al. [2019]. abstract form of information. For example, Yang et al. [2019] studied neural tuning of task structure, rather than stimuli, in rule-guided problem solving. An ANN was trained to perform many different cognitive tasks commonly used in animal experiments, including perceptual decision making, working memory, inhibitory control, and categorization. Complex network organization is formed by training, in which recurrent neurons display selectivity for a subset of tasks (Figure 5). Dynamical systems analysis Tuning properties provide a mostly static view of neural representation and computation. To understand how neural networks compute and process information in time, it is useful to study the dynamics of RNNs [Mante et al., 2013, Sussillo and Barak, 2013, Goudar and Buonomano, 2018, Chaisangmongkon et al., 2017]. One useful method to understand dynamics is to study fixed points and network dynamics around them [Strogatz, 2001]. In a generic dynamical system, dr = F (r) (32) dt a fixed point r is a steady state where the state does not change in time, F (r ) = ss ss 0. The network dynamics at a state r = r + r around a fixed point r is ss ss approximately linear, dr dr = F (r) = F (r + r)  F (r ) + J (r )r; = J (r )r: (33) ss ss ss ss dt dt 19 where J is the Jacobian of F , J = @F =@r , evaluated at r . This is a linear system ij i j ss which can be understood more easily, for example, by studying the eigenvectors and eigenvalues of J (r ). In ANNs, these fixed points can be found by gradient-based ss optimization [Sussillo and Barak, 2013], argmin jjF (r)jj : (34) Fixed points are particularly useful for understanding how networks store memories, accumulate information [Mante et al., 2013], and transition between discrete states [Chaisangmongkon et al., 2017]. This point can be illustrated in a network trained to perform a parametric working memory task [Romo et al., 1999]. In this task, a sample vibrotactile stimulus at frequency f is shown, followed by a delay period of a few seconds; then a test stimulus at frequency f is presented, and subjects must decide whether f is higher or lower than f (Figure 6A). During the delay, neurons 2 1 in the prefrontal cortex of behaving monkeys showed persistent activity at a rate that monotonically varies with f . This parametric working memory encoding emerges from training in an RNN (Figure 6B): in the state-space of this network, neural trajectories during the delay period converge to different fixed points depending on the stored value. These fixed points form an approximate line attractor [Seung, 1996] during the delay period (Figure 6C). There is a dearth of examples in computational neuroscience that accounts for not just a single aspect of neural representation or dynamics, but a sequence of computation to achieve a complex task. ANNs offer a new tool to confront this difficulty. Chaisangmongkon et al. [2017] used this approach to build a model for delayed match-to-category (DMC) tasks. A DMC task (Figure 6D,E) starts with a stimulus sample, say a visual moving pattern, of which a feature (motion direction as an analog quantify from 0 to 360 degrees) is classified into two categories (A in red, B in blue). After a mnemonic delay period, a test stimulus is shown and the task is to decide whether the test has the same category membership as the sample [Freedman and Assad, 2006]. After training to perform this task, a recurrent neural network shows diverse neural activity patterns similar to parietal neurons in monkeys doing the same task (Figure 6F). The trajectory of recurrent neural population in the state space reveals how computation is carried out through epochs of the task (Figure 6G). Understanding neural circuits from objectives, architecture, and training All above methods seek a mechanistic understanding of ANNs after training. A more integrative view links the three basic ingredients in deep learning: learning problem (tasks/objectives), network architecture, and training algorithm to the solution after training [Richards et al., 2019]. This approach is similar to an evolutionary or devel- opmental perspective in biology, which links environments to functions in biological organisms. It can help explain the computational benefit or necessity of observed structures or functions. For example, compared to purely feedforward networks, recurrently-connected deep networks are better at predicting responses of higher visual area neurons to behaviorally challenging images of cluttered scenes [Kar et al., 2019]. This suggests a contribution of recurrent connections to classifying difficult images in the brain. 20 Figure 6: Understanding network computation through state-space and dynam- ical system analysis. (A-C) In a simple parametric working memory task [Romo et al., 1999], the network needs to memorize the (frequency) value of a stimulus through a delay period (A). The network can achieve such parametric working mem- ory by developing a line attractor (B,C). (B) Trial-averaged neural activity during the delay period in the PCA space for different stimulus values. Triangles indicate the start of the delay period. (C) Fixed points found through optimization (orange cross). The direction of a line attractor can be estimated by finding the eigenvector with a corresponding eigenvalue close to 0. The orange line shows the line attractor estimated around one of the fixed points. (D-G) Training both recurrent neural networks and monkeys on a delayed-match-to-category task [Freedman and Assad, 2006]. (D) The task is to decide whether the test and sample stimuli (visual moving pattern) belong to the same category. (E) The two categories are defined based on the motion direction of the stimulus (red: category 1; blue: category 2). (F) In a ANN trained to perform this categorization task, the recurrent units of the model display a wide heterogeneity of onset time for category selectivity, similarly to single neurons recorded from monkey posterior parietal cortex (lateral intraparietal area, LIP) during the task. (G) Neural dynamics of a recurrent neural network underly- ing the performance of the DMC task. The final decision, match (AA or BB) or non-match (AB or BA) corresponds to distinct attractor states located at separate positions in the state space. Similar trajectories of population activity have been found in experimental data. Figure adapted from Chaisangmongkon et al. [2017]. 21 While re-running the biological processes of development and evolution may be difficult, re-training networks with different objectives, architectures, and algorithms is fairly straightforward thanks to recent advances in ML. Whenever training of an ANN leads to a conclusion, it is good practice to vary hyperparameters describing the basic ingredients (to a reasonable degree) to explore the necessary and sufficient conditions for the conclusion [Orhan and Ma, 2019, Yang et al., 2019, Lindsey et al., 2019]. The link from the three ingredients to the network solution is typically not rigorous. However, in certain simplified cases, the link can be firmly established by solving the training process analytically [Saxe et al., 2013, 2019b]. 5 Biologically realistic network architectures and learning Although neuroscientists and cognitive scientists have had much success with stan- dard neural network architectures (vanilla RNNs) and training algorithms (e.g., SGD) used in machine learning, for many neuroscience questions, it is critical to build network architectures and utilize learning algorithms that are biologically plausible. In this section, we outline methods to build networks with more biologically realistic structures, canonical computations, and plasticity rules. 5.1 Structured connections Modern neurophysiological experiments routinely record from multiple brain areas and/or multiple cell types during the same animal behavior. Computational efforts modeling these findings can be greatly facilitated by incorporating into neural networks fundamental biological structures, such as currently-known cell-type- specific connectivity and long-range connections across model areas/layers. In common recurrent networks, the default connectivity is all-to-all. In contrast, both local and long-range connectivity in biological neural systems are usually sparse. One way to have a sparse connectivity matrix W is by element-wise multiplying a f f trainable matrix W with a non-trainable sparse mask M , namely W = W M . To encourage sparsity without strictly imposing it, a L1 regularization term jW j ij ij can be added to the loss function. The scalar coefficient controls the strength of the sparsity constraint. To model cell-type-specific findings, it is important to build neural networks with multiple cell types. A vanilla recurrent network (Eq. 15) (or any other network) can be easily modified to obey Dale’s law by separating excitatory and inhibitory neurons [Song et al., 2016], dr E E I E = r + f (W r W r + W x + b ); (35) E EE EI Ex dt dr I E I I = r + f (W r W r + W x + b ); (36) I IE II Ix dt where an absolute function j  j constrains signs of the connection weights, e.g, W = jW j. After training an ANN to perform the classical “random dot” task EE EE 22 Figure 7: Training a network with Dale’s law. Connectivity matrix for a recurrent network trained on a perceptual decision making task. The network respects Dale’s law with separate groups of excitatory (blue) and inhibitory (red) neurons. Only connections between neurons with high stimulus selectivity are shown. Neurons are sorted based on their stimulus selectivity to choice 1 and 2. Recurrent excitatory connections between neurons selective to the same choice are indicated by two black squares. Figure inspired from Song et al. [2016]. of motion direction discrimination [Roitman and Shadlen, 2002], one can “open the black box” [Sussillo and Barak, 2013] and examine the resulting “wiring diagram” of recurrent network connectivity pattern (Figure 7). With the incorporation of the Dale’s law, the connectivity emerging from training is a heterogeneous version of a biologically-based structured network model of decision-making [Wang, 2002], demonstrating that machine learning brought closer to brain’s hardware can indeed be used to shed insights into biological neural networks. The extensive long-range connectivity across brain areas [Felleman and Van Essen, 1991, Markov et al., 2014, Oh et al., 2014] can be included in ANNs. In classical convolutional neural networks [LeCun et al., 1990, Krizhevsky et al., 2012], each layer only receives feedforward inputs from the immediate preceding layer. However, in some recent networks, each layer also receives feedforward inputs from much earlier layers [Huang et al., 2017, He et al., 2016]. In convolutional recurrent networks, neurons in each layer further receive feedback inputs from later layers and local recurrent connections [Nayebi et al., 2018, Kietzmann et al., 2019]. 5.2 Canonical computation Neuroscientists have identified several canonical computations that are carried out across a wide range of brain areas, including attention, normalization, and gating. Here we discuss how such canonical computations can be introduced into neural networks. They function as modular architectural components that can be plugged into many networks. Interestingly, canonical computations mentioned above all have 23 their parallels in ML-based neural networks. We will highlight the differences and similarities between purely ML implementations and more biological ones. Normalization Divisive normalization is widely observed in biological neural systems [Carandini and Heeger, 2012]. In divisive normalization, activation of a neuron r is no longer determined by its immediate input I , r = f (I ). Instead, it i i i i is normalized by the sum of inputs I to a broader pool of neurons called the normalization pool, r = f ( ): (37) I + The specific choice of a normalization pool depends on the system studied. Bio- logically, although synaptic inputs are additive in the drive to neurons, feedback inhibition can effectively produce normalization [Ardid et al., 2007]. This form of divisive normalization is differentiable. So it can be directly incorporated into ANNs. Normalization is also a critical part of many neural networks in machine learning. Similar to divisive normalization, ML-based normalization methods [Ioffe and Szegedy, 2015, Ba et al., 2016b, Ulyanov et al., 2016, Wu and He, 2018] aim at putting neuronal responses into a range appropriate for downstream areas to process. Unlike divisive normalization, the mean inputs to a pool of neurons is usually subtracted from, instead of dividing, the immediate input (Eq. 22). These methods also compute the standard deviation of inputs to the normalization pool, a step that may not be biologically plausible. Different ML-based normalization methods are distinguished based on their choice of a normalization pool. Attention Attention has been extensively studied in neuroscience [Desimone and Duncan, 1995, Carrasco, 2011]. Computational models are able to capture various aspects of bottom-up [Koch and Ullman, 1987] and top-down attention [Reynolds and Heeger, 2009]. In computational models, top-down attention usually takes the form of a multiplicative gain field to the activity of a specific group of neurons. In the case of spatial attention, consider a group of neurons, each with a preferred spatial location x , and pre-attention activity re(x ) for a certain stimulus. The attended i i spatial location x results in attentional weights (x ), which is higher if x is q i q q similar to x . The attentional weights can then be used to modulate the neural response of neuron i, r (x ) = (x )re(x ). Similarly, feature attention strengthens i q i q i the activity of neurons that are selective to the attended features (e.g., specific color). Such top-down spatial and feature attention can be included in convolutional neural networks [Lindsay and Miller, 2018, Yang et al., 2018]. Meanwhile, attention has become widely used in machine learning [Bahdanau et al., 2015, Xu et al., 2015, Lindsay, 2020], constituting a standard component in recent natural language processing models [Vaswani et al., 2017]. Although the ML attention mechanisms appear rather different from attention models in neuroscience, as we will show below, the two mechanisms are very closely related. In deep learning, attention can be viewed as a differentiable dictionary retrieval pro- cess. A regular dictionary stores a number of key-value pairs (e.g. word-explanation 24 (i) (i) (i) (i) pairs)f(k ;v )g, similar to looking up explanation (v ) of a word (k ). For a (j) given query q, using a dictionary involves searching for the key k that matches (j) (j) q, k = q, and retrieving the corresponding value y = v . This process can (i) be thought of as modulating each value v based on an attentional weight that (i) measures the similarity between the key k and the query q. In the simple binary case, (i) 1; if k = q = (38) 0; otherwise which modulated the output as (i) y = v : (39) In the above case of spatial attention, the i-th key-value pair is (x ; re(x )), while the i i query is the attended spatial location x . Each neuron’s response is modulated based on how similar its preferred spatial location (its value) x is to the attended location (the query) x . The use of ML attention makes the query-key comparison and the value-retrieval (i) process differentiable. A query is compared with every key vector k to obtain an attentional weight (normalized similarity score) , (i) c = score(q;k ); (40) ; ; = normalize(c ; ; c ); (41) 1 N 1 N (i) Here the similarity scoring function can be a simple inner product, score(q;k ) = | (i) q k [Bahdanau et al., 2015], and the normalization function can be the softmax function, i X = P ; such that = 1: (42) i i The use of a normalization function is critical, as it effectively forces the network to focus on a few key vectors (a few attended locations in the case of spatial attention). Gating An important computation for biological neural systems is gating [Abbott, 2006, Wang and Yang, 2018]. Gating refers to the idea of controlling information flow without necessarily distorting its content. Gating in biological systems can be implemented with various mechanisms. Attention modulation multiplies inputs to neurons by a gain factor, providing a graded mechanism of gating at the level of sensory systems [Salinas and Thier, 2000, Olsen et al., 2012]. Another form of gating may involve several types of inhibitory neurons [Wang et al., 2004, Yang et al., 2016]. At the behavioral level, gating often appears to be all or none, as exemplified by effects such as inattentional blindness. In deep learning, multiplicative gating is essential for popular recurrent network architectures such as LSTM (Long Short-Term-Memory) networks (Eq. 43) [Hochre- iter and Schmidhuber, 1997, Gers Felix et al., 2000] and GRU (Gated Recurrent Units) networks [Cho et al., 2014, Chung et al., 2014]. Gated networks are generally 25 easier to train and more powerful than vanilla RNNs. Gating variables dynamically control information flow within these networks through multiplicative interactions. In a LSTM network, there are three types of gating variables. Input and output gates, i o g and g , control the inputs to and outputs of the cell state c , while forget gate g t t t controls whether cell state c keeps its memory c . t t1 g =  (W x + U r + b ); g f t f t1 f g =  (W x + U r + b ); g i t i t1 i g =  (W x + U r + b ); (43) g o t o t1 o c = g c + g  (W x + U r + b ); t t1 c c t c t1 c t t r = g  (c ): t r t Here the symbol denotes the element-wise (Hadamard) product of two vectors of the same length (z = x y means z = x y ). Gating variables are bounded i i i between 0 and 1 by the sigmoid function  , which can be viewed as a smooth differentiable approximate of a binary step function. A gate is opened or closed when its corresponding gate value is near 1 or 0 respectively. All the weights (W and U matrices) are trained. By introducing these gates, a LSTM can in principle keep a memory in its cell state c indefinitely by having the forget gate g = 1 and input gate g = 0 (Figure 8). In addition, the network can choose when to read out from the memory by setting its output gate g = 0 or 1. Despite their great utility to machine learning, LSTMs (and GRUs) cannot be easily related to biological neural circuits. Modifications to LSTMs have been suggested so the gating process could be better explained by neurobiology [Costa et al., 2017]. Although both attention and gating utilize multiplicative interactions, a critical difference is that in attention, the neural modulation is normalized (Eq. 41), whereas in gating it is not. Therefore, neural attention often has one focus, while neural gating can open or close gates to all neurons uniformly. An important insight from ML is that gating should be plastic, which should inspire neuroscientists to investigate learning to gate in the brain. Predictive coding Another canonical computation proposed for the brain is to compute predictions [Rao and Ballard, 1999, Bastos et al., 2012, Heilbron and Chait, 2018]. In predictive coding, a neural system constantly tries to make inference about the external world. Brain areas will selectively propagate information that is unpredicted or surprising, while suppressing responses to expected stimuli. To implement predictive coding in ANNs, feedback connections from higher layers can be trained with a separate loss that compares the output of feedback connections with the neural activity in lower layers [Lotter et al., 2016, Sacramento et al., 2018]. In this way, feedback connections will learn to predict the activity of lower areas. The feedback inputs will then be used to inhibit neural activity in lower layers. 5.3 Learning and plasticity Biological neural systems are products of evolution, development, and learning. In contrast, traditional ANNs are trained with SGD-based rules mostly from scratch. 26 Figure 8: Visualizing LSTM activity in a simple memory task. (A-C) A simple memory task. (A) The network receives a stream of input stimulus, the value of which is randomly and independently sampled at each time point. (B) When the “memorize input” (red) is active, the network needs to remember the current value of the stimulus (A), and output that value when the “report input” (blue) is next active. (C) After training, a single-unit LSTM can perform the task almost perfectly for modest memory duration. (D) When the memorize input is active, this network opens the input gate (allowing inputs) and closes the forget gate (forgetting previous memory). It opens the output gate when the report input is active. The back-propagation algorithm of computing gradient descent is well known to be biologically implausible [Zipser and Andersen, 1988]. Incorporating more realistic learning processes can help us build better models of brains. Selective training and continual learning In typical ANNs, all connections are trained. However, in biological neural systems, synapses are not equally modifiable. Many synapses can be stable for years [Grutzendler et al., 2002, Yang et al., 2009]. To implement selective training of connections, the effective connection matrix W can be expressed as a sum of a sparse trainable synaptic weight matrix and a non-trainable one, W = W + W [Rajan et al., 2016, Masse et al., 2018]. train x Or more generally, selective training can be imposed softly by adding to the loss a regularization term L that makes it more difficult to change the weights of certain reg connections, L = M (W W ) : (44) reg ij ij x;ij ij Here, M determine how strongly the connection W should stick close to the value ij ij W . x;ij Selective training of connections through this form of soft constraints has been used by continual learning techniques to combat catastrophic forgetting. The phenomenon of catastrophic forgetting is commonly observed when ANNs are learning new tasks, 27 they tend to rapidly forget previous learned tasks that are not revisited [McCloskey and Cohen, 1989]. One major class of continual learning methods deals with this issue by selectively training synaptic connections that are deemed unimportant for previously learned tasks or knowledge, while protecting the important ones [Kirkpatrick et al., 2017, Zenke et al., 2017]. Hebbian plasticity The predominant idea for biological learning is Hebbian plas- ticity [Hebb, 2005] and its variants [Song et al., 2000, Bi and Poo, 2001]. Hebbian plasticity is an unsupervised learning method that drives learning of connection weights without target outputs or rewards. It is essential for classical models of associative memory such as Hopfield networks [Hopfield, 1982], and has a deep link to modern neural network architectures with explicit long-term memory modules [Graves et al., 2014]. Supervised learning techniques, especially those based on SGD, can be combined with Hebbian plasticity to develop ANNs that are both more powerful for certain tasks and more biologically realistic. There are two methods to combine Hebbian plasticity with SGD. In the first kind, the effective connection matrix W = W + A is the sum of two connection matrices, W trained by SGD, and A driven by Hebbian plasticity [Ba et al., 2016a, Miconi et al., 2018], A(t + 1) = A(t) + rr : (45) Or in component-form, A (t + 1) = A (t) + r r : (46) ij ij i j In addition to training a separate matrix, SGD can be used to learn the plasticity rules itself [Bengio et al., 1992, Metz et al., 2018]. Here, the plasticity rule is a trainable function of pre- and post-synaptic activity, A (t + 1) = A (t) + f (r ; r ;): (47) ij ij i j Since the system is differentiable, parameters , which collectively describe the plasticity rules, can be updated with SGD-based methods. In its simplest form, f (r ; r ;) = r r , where  = fg. Here, the system can learn to become Hebbian i j i j ( > 0) or anti-Hebbian ( < 0). Learning of a plasticity rule is a form of meta- learning, using an algorithm (here, SGD) to optimize an inner learning rule (here, Hebbian plasticity). Such Hebbian plasticity networks can be extended to include more complex synapses with multiple hidden variables in a “cascade model" of synaptic plasticity [Fusi et al., 2005]. In theory, properly designed complex synapses can substantially boost a neural network’s memory capacity [Benna and Fusi, 2016]. Models of such complex synapses are differentiable, and therefore can be incorporated into ANNs [Kaplanis et al., 2018]. Short-term plasticity In addition to Hebbian plasticity that acts on the time scales from hours to years, biological synapses are subject to short-term plasticity mecha- nisms operating on the timescale of hundreds of milliseconds to seconds [Zucker and Regehr, 2002] that can rapidly modify their effective weights. Classical short-term 28 plasticity rules [Mongillo et al., 2008, Markram et al., 1998] are formulated with spiking neurons, but they can be adapted to rate forms. In these rules, each connec- tion weight w = weux is a product of an original weight we, a facilitating factor u, and a depressing factor x. The facilitating and depressing factors are both influenced by the pre-synaptic activity r(t), dx 1 x(t) = u(t)x(t)r(t); (48) dt du U u(t) = + U (1 u(t))r(t): (49) dt High pre-synaptic activity r(t) increases the facilitating factor u(t) and decreases the depressing factor x(t). Again, the equations governing short-term plasticity are fully differentiable, so they can be incorporated into ANNs in the same way as Hebbian plasticity rules [Masse et al., 2019]. Masse et al. [2019] offers an illustration of how ANNs can be used to test new hypotheses in neuroscience. It was designed to investigate the neural mechanisms of working memory, the brain’s ability to maintain and manipulate information inter- nally in the absence of external stimulation. Working memory has been extensively studied in animal experiments using delayed response tasks, in which a stimulus and its corresponding motor response are separated by a temporal gap when the stimulus must be retained internally. Stimulus-selective self-sustained persistent activity during a mnemonic delay is amply documented and considered as the neural substrate of working memory representation [Goldman-Rakic, 1995, Wang, 2001]. However, recent studies suggested that certain short-term memory traces may be realized by hidden variables instead of spiking activity, such as synaptic efficacy that by virtue of short-term plasticity represents past events [Stokes, 2015, Mongillo et al., 2008]. When an ANN endowed with short-term synaptic plasticity is trained to perform a delayed response task, it does not make an a priori assumption about whether working memory is represented by hidden synaptic efficacy or neural ac- tivity. It was found that activity-silent state can accomplish such a task only when the delay is sufficiently short, whereas persistent activity naturally emerges from training with delay periods longer than the biophysical time constants of short-term synaptic plasticity. More importantly, training always gives rise to persistent activity, even with a short mnemonic delay period, when information must be manipulated internally, such as mentally rotating a directional stimulus by 90 degrees. This work illustrates how ANNs can contribute to resolving important debates in neuroscience. Biologically-realistic gradient descent Backpropagation is commonly viewed as biologically unrealistic because the plasticity rule is not local (see Eq. 13). Efforts have been devoted to approximating gradient descent with algorithms more compatible with the brain’s hardware [Lillicrap et al., 2016, Guerguiev et al., 2017, Roelfsema and Holtmaat, 2018, Lillicrap et al., 2020]. In feedforward networks, the backpropagation algorithm can be implemented with synaptic connections feeding back from the final layer [Xie and Seung, 2003]. This implementation assumes that the feedback connections precisely mirror the feedforward connections. This requirement can be relaxed. If a network uses 29 fixed and random feedback connections, the feedforward connections would start to approximately mirror the feedback connections during training (a phenomenon called “feedback alignment”), allowing for training loss to be decreased [Lillicrap et al., 2016]. Another challenge of approximating backpropagation with feedback connections is that the feedback inputs carrying loss information need to be processed differently from feedforward inputs carrying stimulus information. This issue can be addressed by introducing multi-compartmental neurons into ANNs [Guerguiev et al., 2017]. In such networks, feedforward and feedback inputs are processed separately because they are received by the model neurons’ soma and dendrites respectively. These methods of implementing the backpropagation algorithm through synapses propagating information backwards are so far only used for feedforward networks. For recurrent networks, the backpropagation algorithm propagates information backwards in time. Therefore, it is not clear how to interpret the backpropagation in terms of synaptic connections. Instead, approximations can be made such that the network computes approximated gradient information as it runs forward in time [Williams and Zipser, 1989, Murray, 2019]. For many neuroscientific applications, it is probably not necessary to justify back- propagation by neurobiology. ANNs often start as “blank slate", thus training by backpropagation is tasked to accomplish what for the brain amounts to a combination of genetic programming, development and plasticity in adulthood. 6 Future directions and conclusion Recent years have seen a growing impact of ANN models in neuroscience. We have reviewed many of these efforts in the section Biologically realistic network architectures and learning. In this final section, we outline other existing challenges and ongoing work to make ANNs better models of brains. Spiking neural networks Most biological neurons communicate with spikes. Harnessing the power of machine learning algorithms for spiking networks remains a daunting challenge. Gradient-descent-based training techniques typically require the system to be differentiable, making it challenging to train spiking networks, because spike generation is non-differentiable. However, several recent methods have been proposed to train spiking networks with gradient-based techniques [Courbariaux et al., 2016, Bellec et al., 2018, Zenke and Ganguli, 2018, Nicola and Clopath, 2017, Huh and Sejnowski, 2018]. These methods generally involve approximating spike generation with a differentiable system during backpropagation [Tavanaei et al., 2019]. Techniques to effectively train spiking networks could prove increasingly important and practical, as neuromorphic hardware that operate naturally with spikes become more powerful [Merolla et al., 2014, Pei et al., 2019]. Standardized protocols for developing brain-like recurrent networks In the study of mammalian visual systems, the use of large datasets such as ImageNet [Deng et al., 2009] was crucial for producing neural networks that resemble biological neural circuits in the brain. The same has not been shown for most other systems. Although many studies have shown success using neural networks to model cognitive 30 and motor systems, each work usually has its own set of network architectures, training protocols, and other hyperparameters. Simply applying the most common architectures and training algorithms does not consistently lead to brain-like recurrent networks [Sussillo et al., 2015]. Much work remains to be done to search for datasets/tasks, network architectures, and training regimes that can produce brain- resembling artificial networks across a wide range of experimental tasks. Detailed behavioral and physiological predictions Although many studies have reported similarities between brains and ANNs, more detailed comparisons have revealed striking differences [Szegedy et al., 2013, Hénaff et al., 2019, Sussillo et al., 2015]. Deep convolutional networks can achieve similar or better performance on large image classification tasks compared to humans, however, the mistakes they make can be very different from the ones made by humans [Szegedy et al., 2013, Rajalingham et al., 2018]. It will be important for future ANN models of brains to aim at simultaneously explaining a wider range of physiological and behavioral phenomena. Interpreting learned networks and learning processes With the ease of training neural networks comes the difficulty of analyzing them. Granted, neuroscientists are not foreign to analysis of complex networks, and ANNs are still technologically easier to analyze compared to biological neural networks. However, compared to network models with built-in regularities and small numbers of free parameters, deep neural networks are notoriously complex to analyze and understand, and will likely become even more so as we build more and more sophisticated neural networks. This difficulty is rooted in the use of optimization algorithms to search for parameter values. Since the optimization process in deep learning has no unique optima, the results of optimization necessarily lack the degree of regularities built in hand-designed models. Although we can attempt to understand ANNs from the perspective of its objectives, architectures, and training algorithms [Richards et al., 2019], which are described with a much smaller number of hyperparameters, the link from these hyperparameters to network representation, mechanism, and behavior is mostly informal and based on intuition. Despite the difficulties mentioned above, several lines of research hold promise. To facilitate understanding of learned networks, one can construct variants of neu- ral networks that are more interpretable. For example, low-rank recurrent neural networks utilize recurrent connectivity matrices with low-dimensional structures [Mastrogiuseppe and Ostojic, 2018], allowing for a more straightforward mapping from network connectivity to dynamics and computation. The dynamics of learning in neural networks can be studied analytically in deep linear networks [Saxe et al., 2013] and very wide nonlinear networks, i.e. networks with a sufficiently large number of neurons per layer [Jacot et al., 2018]. In another line of work, the Information Bottleneck theory proposes that learning processes in neural networks are characterized by two phases, the first extracts information for output tasks (prediction), and the second discards (excessive) information about inputs (compression) [Shwartz-Ziv and Tishby, 2017], see also [Saxe et al., 2019a]. Progress in these directions could shed light on why neural networks can generalize 31 to new data despite having many parameters, which would traditionally indicate over-fitting and poor generalization performance. Conclusion Artificial neural networks present a novel approach in computational neuroscience. They have already been used, with certain degree of success, to model various aspects of sensory, cognitive, and motor circuits. Efforts are underway to make ANNs more biologically relevant and applicable to a wider range of neuro- scientific questions. In a sense, instead of being viewed as computational models, ANNs can be studied as model systems like fruit flies, mice, and monkeys, but are easily carried out to explore new task paradigms and computational ideas. Of course, one can be skeptical about ANNs as model systems, on the ground that they are not biological organisms. However, computational models span a wide range of biological realism; there should be no doubt that brain research will benefit from enhanced interactions with machine learning and artificial intelligence. In order for ANNs to have a broad impact in neuroscience, it will be important to devote our efforts in two areas. First, we should continue to bring ANNs closer to neurobiology. Second, we should endeavour to “open the black box” thoroughly after learning to identify neural representation, temporal dynamics, and network connectivity that emerge from learning, leading to testable insights and predictions by neurobiological experiments. Recurrent neural dynamics emphasized in this Primer represent a salient feature of the brain, further development of strongly recurrent ANNs will contribute to acceleration of progress in neuroscience. Acknowledgments: We thank Vishwa Goudar and Jacob Portes for helpful com- ments on a draft of this paper. This work was supported by the Simons Foundation, NSF NeuroNex Award DBI-1707398 and the Gatsby Charitable Foundation to GRY; the ONR grant N00014 and Simons Collaboration in the Global Brain (SCGB) (grant 543057SPI) to XJW. 32 References M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th fUSENIXg Symposium on Operating Systems Design and Implementation (fOSDIg 16), pages 265–283, 2016. L. Abbott. Where are the switches on this thing. 23 problems in systems neuroscience, pages 423–31, 2006. L. Abbott and F. S. Chance. Drivers and modulators from push-pull and balanced synaptic input. Progress in brain research, 149:147–155, 2005. L. F. Abbott. Theoretical neuroscience rising. Neuron, 60:489–495, 2008. A. S. Andalman, V. M. Burns, M. Lovett-Barron, M. Broxton, B. Poole, S. J. Yang, L. Grosenick, T. N. Lerner, R. Chen, T. Benster, et al. Neuronal dynamics regulating brain and behavioral state transitions. Cell, 177(4):970–985, 2019. S. Ardid, X.-J. Wang, and A. Compte. An integrated microcircuit model of attentional processing in the neocortex. J. Neurosci., 27:8486–8495, 2007. J. Ba, G. E. Hinton, V. Mnih, J. Z. Leibo, and C. Ionescu. Using fast weights to attend to the recent past. In Advances in Neural Information Processing Systems, pages 4331–4339, 2016a. J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016b. D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations, O. Barak. Recurrent neural networks as versatile tools of neuroscience research. Current opinion in neurobiology, 46:1–6, 2017. O. Barak, D. Sussillo, R. Romo, M. Tsodyks, and L. Abbott. From fixed points to chaos: three models of delayed discrimination. Progress in neurobiology, 103: 214–222, 2013. H. B. Barlow et al. Possible principles underlying the transformation of sensory messages. Sensory communication, 1:217–234, 1961. P. Bashivan, K. Kar, and J. J. DiCarlo. Neural population control via deep image synthesis. Science, 364(6439):eaav9436, 2019. A. M. Bastos, W. M. Usrey, R. A. Adams, G. R. Mangun, P. Fries, and K. J. Friston. Canonical microcircuits for predictive coding. Neuron, 76:695–711, 2012. G. Bellec, D. Salaj, A. Subramoney, R. Legenstein, and W. Maass. Long short-term memory and learning-to-learn in networks of spiking neurons. In Advances in Neural Information Processing Systems, pages 787–797, 2018. 33 S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, volume 2. Univ. of Texas, 1992. Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, M. K. Benna and S. Fusi. Computational principles of synaptic memory consolida- tion. Nature neuroscience, 19(12):1697, 2016. G. Bi and M. Poo. Synaptic modification by correlated activity: Hebb’s postulate revisited. Annu Rev Neurosci, 24:139–166, 2001. L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. Siam Review, 60(2):223–311, 2018. M. Botvinick, J. X. Wang, W. Dabney, K. J. Miller, and Z. Kurth-Nelson. Deep reinforcement learning and its neuroscientific implications. Neuron, 107:603–616, K. H. Britten, M. N. Shadlen, W. T. Newsome, and J. A. Movshon. The analysis of visual motion: a comparison of neuronal and psychophysical performance. Journal of Neuroscience, 12(12):4745–4765, 1992. C. F. Cadieu, H. Hong, D. L. Yamins, N. Pinto, D. Ardila, E. A. Solomon, N. J. Majaj, and J. J. DiCarlo. Deep neural networks rival the representation of primate it cortex for core visual object recognition. PLoS computational biology, 10(12): e1003963, 2014. M. Carandini and D. J. Heeger. Normalization as a canonical neural computation. Nature Reviews Neuroscience, 13(1):51, 2012. M. Carrasco. Visual attention: The past 25 years. Vision research, 51(13):1484–1525, W. Chaisangmongkon, S. K. Swaminathan, D. J. Freedman, and X.-J. Wang. Com- puting by robust transience: how the fronto-parietal network performs sequential, category-based decisions. Neuron, 93(6):1504–1517, 2017. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. In Advances in neural information processing systems, pages 6571–6583, 2018. K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 34 D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015. J. D. Cohen, K. Dunbar, and J. L. McClelland. On the control of automatic processes: a parallel distributed processing account of the stroop effect. Psychological review, 97(3):332, 1990. R. Costa, I. A. Assael, B. Shillingford, N. de Freitas, and T. Vogels. Cortical micro- circuits as gated-recurrent neural networks. In Advances in Neural Information Processing Systems, pages 272–283, 2017. M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016. C. J. Cueva and X.-X. Wei. Emergence of grid-like representations by train- ing recurrent neural networks to perform spatial localization. arXiv preprint arXiv:1803.07770, 2018. N. D. Daw, S. J. Gershman, B. Seymour, P. Dayan, and R. J. Dolan. Model-based influences on humans’ choices and striatal prediction errors. Neuron, 69(6): 1204–1215, 2011. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. R. Desimone and J. Duncan. Neural mechanisms of selective visual attention. Annual review of neuroscience, 18(1):193–222, 1995. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805, J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(Jul):2121– 2159, 2011. C. Eliasmith, T. C. Stewart, X. Choo, T. Bekolay, T. DeWolf, Y. Tang, and D. Ras- mussen. A large-scale model of the functioning brain. science, 338(6111): 1202–1205, 2012. J. L. Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990. D. Erhan, Y. Bengio, A. Courville, and P. Vincent. Visualizing higher-layer features of a deep network. University of Montreal, 1341(3):1, 2009. D. J. Felleman and D. C. Van Essen. Distributed hierarchical processing in the primate cerebral cortex. Cerebral cortex (New York, NY: 1991), 1(1):1–47, 1991. D. J. Freedman and J. A. Assad. Experience-dependent representation of visual categories in parietal cortex. Nature, 443(7107):85, 2006. 35 J. Freeman and E. P. Simoncelli. Metamers of the ventral stream. Nature neuro- science, 14(9):1195, 2011. K. Fukushima and S. Miyake. Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern recognition, 15:455–469, K. Fukushima, S. Miyake, and T. Ito. Neocognitron: A neural network model for a mechanism of visual pattern recognition. IEEE transactions on systems, man, and cybernetics, (5):826–834, 1983. S. Fusi, P. J. Drew, and L. F. Abbott. Cascade models of synaptically stored memories. Neuron, 45(4):599–611, 2005. A. Gers Felix, S. Jurgen, and F. Cummins. Learning to forget: Continual prediction with lstm. Neural computation, 12(10):2451–2471, 2000. X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 315–323, 2011. J. I. Gold and M. N. Shadlen. The neural basis of decision making. Annual review of neuroscience, 30, 2007. P. S. Goldman-Rakic. Cellular basis of working memory. Neuron, 14:477–485, I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014. I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016. V. Goudar and D. V. Buonomano. Encoding sensory and motor patterns as time- invariant trajectories in recurrent neural networks. Elife, 7:e31134, 2018. A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014. J. Grutzendler, N. Kasthuri, and W.-B. Gan. Long-term dendritic spine stability in the adult cortex. Nature, 420(6917):812–816, 2002. J. Guerguiev, T. P. Lillicrap, and B. A. Richards. Towards deep learning with segregated dendrites. ELife, 6:e22901, 2017. K. Haroush and Z. M. Williams. Neuronal prediction of opponent’s behavior during cooperative social interchange in primates. Cell, 160(6):1233–1245, 2015. D. Hassabis, D. Kumaran, C. Summerfield, and M. Botvinick. Neuroscience-inspired artificial intelligence. Neuron, 95:245–258, 2017. K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015. 36 K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. D. O. Hebb. The organization of behavior: A neuropsychological theory. Psychology Press, 2005. M. Heilbron and M. Chait. Great expectations: is there evidence for predictive coding in auditory cortex? Neuroscience, 389:54–73, 2018. M. Helmstaedter, K. L. Briggman, S. C. Turaga, V. Jain, H. S. Seung, and W. Denk. Connectomic reconstruction of the inner plexiform layer in the mouse retina. Nature, 500(7461):168, 2013. O. J. Hénaff, R. L. Goris, and E. P. Simoncelli. Perceptual straightening of natural videos. Nature neuroscience, 22(6):984–991, 2019. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9 (8):1735–1780, 1997. J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8): 2554–2558, 1982. K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017. D. H. Hubel and T. N. Wiesel. Receptive fields of single neurones in the cat’s striate cortex. The Journal of physiology, 148(3):574–591, 1959. D. H. Hubel and T. N. Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. (Lond.), 160:106–154, 1962. D. Huh and T. J. Sejnowski. Gradient descent for spiking neural networks. In Advances in Neural Information Processing Systems, pages 1433–1443, 2018. S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018. H. Jaeger and H. Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. science, 304(5667):78–80, 2004. 37 M. Januszewski, J. Kornfeld, P. H. Li, A. Pope, T. Blakely, L. Lindsey, J. Maitin- Shepard, M. Tyka, W. Denk, and V. Jain. High-precision automated reconstruction of neurons with flood-filling networks. Nature methods, 15(8):605, 2018. J. P. Jones and L. A. Palmer. An evaluation of the two-dimensional gabor filter model of simple receptive fields in cat striate cortex. Journal of neurophysiology, 58(6):1233–1258, 1987. N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 1–12, 2017. C. Kaplanis, M. Shanahan, and C. Clopath. Continual reinforcement learning with complex synapses. arXiv preprint arXiv:1802.07239, 2018. K. Kar, J. Kubilius, K. Schmidt, E. B. Issa, and J. J. DiCarlo. Evidence that recurrent circuits are critical to the ventral stream’s execution of core object recognition behavior. Nature neuroscience, page 1, 2019. S.-M. Khaligh-Razavi and N. Kriegeskorte. Deep supervised, but not unsupervised, models may explain it cortical representation. PLoS computational biology, 10 (11):e1003915, 2014. R. Kiani and M. N. Shadlen. Representation of confidence associated with a decision by neurons in the parietal cortex. science, 324(5928):759–764, 2009. T. C. Kietzmann, C. J. Spoerer, L. K. Sörensen, R. M. Cichy, O. Hauk, and N. Kriegeskorte. Recurrence is required to capture the representational dynamics of the human visual system. Proceedings of the National Academy of Sciences, 116(43):21854–21863, 2019. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017. D. Kobak, W. Brendel, C. Constantinidis, C. E. Feierstein, A. Kepecs, Z. F. Mainen, X.-L. Qi, R. Romo, N. Uchida, and C. K. Machens. Demixed principal component analysis of neural population data. Elife, 5:e10989, 2016. C. Koch and S. Ullman. Shifts in selective visual attention: towards the underlying neural circuitry. In Matters of intelligence, pages 115–141. Springer, 1987. S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of neural network representations revisited. arXiv preprint arXiv:1905.00414, 2019. 38 N. Kriegeskorte. Deep neural networks: a new framework for modeling biological vision and brain information processing. Annual review of vision science, 1: 417–446, 2015. N. Kriegeskorte, M. Mur, and P. A. Bandettini. Representational similarity analysis- connecting the branches of systems neuroscience. Frontiers in systems neuro- science, 2:4, 2008. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. A. Krogh and J. A. Hertz. A simple weight decay can improve generalization. In Advances in neural information processing systems, pages 950–957, 1992. S. W. Kuffler. Discharge patterns and functional organization of mammalian retina. Journal of neurophysiology, 16(1):37–68, 1953. R. Laje and D. V. Buonomano. Robust timing and motor patterns by taming chaos in recurrent neural networks. Nature neuroscience, 16(7):925–933, 2013. Q. V. Le, N. Jaitly, and G. E. Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015. Y. LeCun. A theoretical framework for back-propagation. In D. Touretzky, G. Hinton, and T. Sejnowski, editors, Proceedings of the 1988 Connectionist Models Summer School, pages 21–28. Burlington, MA: Morgan Kaufmann, 1988. Y. LeCun and Y. Bengio. Convolutional networks for images, speech, and time series. In M. A. Arbib, editor, The handbook of brain theory and neural networks, pages 255–258. Cambridge, MA: MIT Press, 1995. Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, pages 396–404, 1990. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436, 2015. T. P. Lillicrap, D. Cownden, D. B. Tweed, and C. J. Akerman. Random synap- tic feedback weights support error backpropagation for deep learning. Nature communications, 7:13276, 2016. T. P. Lillicrap, A. Santoro, L. Marris, C. J. Akerman, and G. Hinton. Backpropagation and the brain. Nature Reviews Neuroscience, pages 1–12, 2020. G. W. Lindsay. Attention in psychology, neuroscience, and machine learning. Frontiers in Computational Neuroscience, 14:29, 2020. G. W. Lindsay and K. D. Miller. How biological attention mechanisms improve task performance in a large-scale visual system model. eLife, 7:e38105, 2018. 39 J. Lindsey, S. A. Ocko, S. Ganguli, and S. Deny. A unified theory of early visual representations from retina to cortex through anatomically constrained deep cnns. arXiv preprint arXiv:1901.00945, 2019. W. Lotter, G. Kreiman, and D. Cox. Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104, 2016. N. Maheswaranathan, A. H. Williams, M. D. Golub, S. Ganguli, and D. Sussillo. Universality and individuality in neural dynamics across large populations of recurrent networks. arXiv preprint arXiv:1907.08549, 2019. V. Mante, D. Sussillo, K. V. Shenoy, and W. T. Newsome. Context-dependent computation by recurrent dynamics in prefrontal cortex. nature, 503(7474):78, N. T. Markov, M. M. Ercsey-Ravasz, A. R. Ribeiro Gomes, C. Lamy, L. Magrou, J. Vezoli, P. Misery, A. Falchier, R. Quilodran, M. A. Gariel, J. Sallet, R. Gamanut, C. Huissoud, S. Clavagnier, P. Giroud, D. Sappey-Marinier, P. Barone, C. Dehay, Z. Toroczkai, K. Knoblauch, D. C. Van Essen, and H. Kennedy. A weighted and directed interareal connectivity matrix for macaque cerebral cortex. Cereb. Cortex, 24:17–36, 2014. H. Markram, Y. Wang, and M. Tsodyks. Differential signaling via the same axon of neocortical pyramidal neurons. Proceedings of the National Academy of Sciences, 95(9):5323–5328, 1998. N. Y. Masse, G. D. Grant, and D. J. Freedman. Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization. Proceedings of the National Academy of Sciences, 115(44):E10467–E10475, 2018. N. Y. Masse, G. R. Yang, H. F. Song, X.-J. Wang, and D. J. Freedman. Circuit mechanisms for the maintenance and manipulation of information in working memory. Nature neuroscience, page 1, 2019. F. Mastrogiuseppe and S. Ostojic. Linking connectivity, dynamics, and computations in low-rank recurrent neural networks. Neuron, 99(3):609–623, 2018. A. Mathis, P. Mamidanna, K. M. Cury, T. Abe, V. N. Murthy, M. W. Mathis, and M. Bethge. Deeplabcut: markerless pose estimation of user-defined body parts with deep learning. Technical report, Nature Publishing Group, 2018. M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989. L. McIntosh, N. Maheswaranathan, A. Nayebi, S. Ganguli, and S. Baccus. Deep learning models of the retinal response to natural scenes. In Advances in neural information processing systems, pages 1369–1377, 2016. P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, et al. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science, 345(6197):668–673, 2014. 40 L. Metz, N. Maheswaranathan, B. Cheung, and J. Sohl-Dickstein. Meta- learning update rules for unsupervised representation learning. arXiv preprint arXiv:1804.00222, 2018. T. Miconi, J. Clune, and K. O. Stanley. Differentiable plasticity: training plastic neural networks with backpropagation. arXiv preprint arXiv:1804.02464, 2018. V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. G. Mongillo, O. Barak, and M. Tsodyks. Synaptic theory of working memory. Science, 319(5869):1543–1546, 2008. J. M. Murray. Local online learning in recurrent networks with random feedback. eLife, 8:e43299, 2019. T. Nath, A. Mathis, A. C. Chen, A. Patel, M. Bethge, and M. W. Mathis. Using deeplabcut for 3d markerless pose estimation across species and behaviors. Nature protocols, 14(7):2152–2176, 2019. A. Nayebi, D. Bear, J. Kubilius, K. Kar, S. Ganguli, D. Sussillo, J. J. DiCarlo, and D. L. Yamins. Task-driven convolutional recurrent models of the visual system. In Advances in Neural Information Processing Systems, pages 5290–5301, 2018. W. Nicola and C. Clopath. Supervised learning in spiking neural networks with force training. Nature communications, 8(1):2208, 2017. Y. Niv. Reinforcement learning in the brain. Journal of Mathematical Psychology, 53(3):139–154, 2009. S. W. Oh, J. A. Harris, L. Ng, B. Winslow, N. Cain, S. Mihalas, Q. Wang, C. Lau, L. Kuan, A. M. Henry, et al. A mesoscale connectome of the mouse brain. Nature, 508(7495):207, 2014. E. Oja. Simplified neuron model as a principal component analyzer. Journal of mathematical biology, 15(3):267–273, 1982. S. R. Olsen, D. S. Bortone, H. Adesnik, and M. Scanziani. Gain control by layer six in cortical circuits of vision. Nature, 483(7387):47–52, 2012. B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996. A. E. Orhan and W. J. Ma. A diverse range of factors affect the nature of neural representations underlying short-term memory. Nature neuroscience, page 1, 2019. C. Pandarinath, D. J. O’Shea, J. Collins, R. Jozefowicz, S. D. Stavisky, J. C. Kao, E. M. Trautmann, M. T. Kaufman, S. I. Ryu, L. R. Hochberg, et al. Inferring single-trial neural population dynamics using sequential auto-encoders. Nature methods, page 1, 2018. 41 R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning, pages 1310–1318, A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024–8035, 2019. J. Pei, L. Deng, S. Song, M. Zhao, Y. Zhang, S. Wu, G. Wang, Z. Zou, Z. Wu, W. He, et al. Towards artificial general intelligence with hybrid tianjic chip architecture. Nature, 572(7767):106–111, 2019. B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964. C. R. Ponce, W. Xiao, P. F. Schade, T. S. Hartmann, G. Kreiman, and M. S. Living- stone. Evolving images for visual neurons using a deep generative network reveals coding principles and neuronal preferences. Cell, 177(4):999–1009, 2019. R. Prenger, M. C.-K. Wu, S. V. David, and J. L. Gallant. Nonlinear v1 responses to natural scenes revealed by neural network analysis. Neural Networks, 17(5-6): 663–679, 2004. R. Rajalingham, E. B. Issa, P. Bashivan, K. Kar, K. Schmidt, and J. J. DiCarlo. Large-scale, high-resolution comparison of the core visual object recognition behavior of humans, monkeys, and state-of-the-art deep artificial neural networks. Journal of Neuroscience, 38(33):7255–7269, 2018. K. Rajan, C. D. Harvey, and D. W. Tank. Recurrent network models of sequence generation and memory. Neuron, 90(1):128–142, 2016. R. P. Rao and D. H. Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience, 2(1):79, 1999. J. H. Reynolds and D. J. Heeger. The normalization model of attention. Neuron, 61 (2):168–185, 2009. B. A. Richards, T. P. Lillicrap, P. Beaudoin, Y. Bengio, R. Bogacz, A. Christensen, C. Clopath, R. P. Costa, A. de Berker, S. Ganguli, et al. A deep learning framework for neuroscience. Nature neuroscience, 22:1761–1770, 2019. M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nature neuroscience, 2(11):1019–1025, 1999. M. Rigotti, D. D. Ben Dayan Rubin, X.-J. Wang, and S. Fusi. Internal representation of task rules by recurrent dynamics: the importance of the diversity of neural responses. Frontiers in computational neuroscience, 4:24, 2010. M. Rigotti, O. Barak, M. R. Warden, X.-J. Wang, N. D. Daw, E. K. Miller, and S. Fusi. The importance of mixed selectivity in complex cognitive tasks. Nature, 497(7451):585, 2013. 42 H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951. P. R. Roelfsema and A. Holtmaat. Control of synaptic plasticity in deep cortical networks. Nature Reviews Neuroscience, 19:166, 2018. J. D. Roitman and M. N. Shadlen. Response of neurons in the lateral intraparietal area during a combined visual discrimination reaction time task. J. Neurosci., 22: 9475–9489, 2002. R. Romo, C. D. Brody, A. Hernández, and L. Lemus. Neuronal correlates of parametric working memory in the prefrontal cortex. Nature, 399(6735):470–473, F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958. F. Rosenblatt. Principles of neurodynamics: Perceptions and the theory of brain mechanisms. 1962. D. B. Rubin, S. D. Van Hooser, and K. D. Miller. The stabilized supralinear network: a unifying circuit motif underlying multi-input integration in sensory cortex. Neuron, 85(2):402–417, 2015. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986. J. Sacramento, R. P. Costa, Y. Bengio, and W. Senn. Dendritic cortical microcircuits approximate the backpropagation algorithm. In Advances in neural information processing systems, pages 8721–8732, 2018. E. Salinas and P. Thier. Gain modulation: a major computational principle of the central nervous system. Neuron, 27(1):15–21, 2000. A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dy- namics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox. On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124020, 2019a. A. M. Saxe, J. L. McClelland, and S. Ganguli. A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences, 116(23):11537–11546, 2019b. W. Schultz, P. Dayan, and P. R. Montague. A neural substrate of prediction and reward. Science, 275(5306):1593–1599, 1997. H. S. Seung. How the brain keeps the eyes still. Proc. Natl. Acad. Sci. (USA), 93: 13339–13344, 1996. Y. Shu, A. Hasenstaub, and D. A. McCormick. Turning on and off recurrent balanced cortical activity. Nature, 423(6937):288–293, 2003. 43 R. Shwartz-Ziv and N. Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017. D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hu- bert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. H. Sompolinsky, A. Crisanti, and H.-J. Sommers. Chaos in random neural networks. Physical review letters, 61(3):259, 1988. H. F. Song, G. R. Yang, and X.-J. Wang. Training excitatory-inhibitory recurrent neural networks for cognitive tasks: a simple and flexible framework. PLoS computational biology, 12(2):e1004792, 2016. H. F. Song, G. R. Yang, and X.-J. Wang. Reward-based training of recurrent neural networks for cognitive and value-based tasks. Elife, 6:e21492, 2017. S. Song, K. D. Miller, and L. F. Abbott. Competitive hebbian learning through spike-timing-dependent synaptic plasticity. Nature neuroscience, 3(9):919–926, N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014. M. G. Stokes. ‘activity-silent’working memory in prefrontal cortex: a dynamic coding framework. Trends in cognitive sciences, 19(7):394–405, 2015. S. Strogatz. Nonlinear dynamics and chaos: with applications to physics, biology, chemistry, and engineering (studies in nonlinearity). 2001. D. Sussillo. Neural circuits as computational dynamical systems. Current opinion in neurobiology, 25:156–163, 2014. D. Sussillo and L. F. Abbott. Generating coherent patterns of activity from chaotic neural networks. Neuron, 63(4):544–557, 2009. D. Sussillo and O. Barak. Opening the black box: low-dimensional dynamics in high-dimensional recurrent neural networks. Neural computation, 25(3):626–649, D. Sussillo, M. M. Churchland, M. T. Kaufman, and K. V. Shenoy. A neural network that finds a naturalistic solution for the production of muscle activity. Nature neuroscience, 18(7):1025, 2015. I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147, 2013. R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 44 C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fer- gus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, A. Tavanaei, M. Ghodrati, S. R. Kheradpisheh, T. Masquelier, and A. Maida. Deep learning in spiking neural networks. Neural Networks, 111:47–63, 2019. T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012. A. N. Tikhonov. On the stability of inverse problems. In Dokl. Akad. Nauk SSSR, volume 39, pages 195–198, 1943. D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016. C. Van Vreeswijk and H. Sompolinsky. Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274(5293):1724–1726, 1996. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017. J. Wang, D. Narain, E. A. Hosseini, and M. Jazayeri. Flexible timing by temporal scaling of cortical responses. Nature neuroscience, 21(1):102, 2018. X.-J. Wang. Synaptic reverberation underlying mnemonic persistent activity. Trends in Neurosci., 24:455–463, 2001. X.-J. Wang. Probabilistic decision making by slow reverberation in cortical circuits. Neuron, 36(5):955–968, 2002. X.-J. Wang. Decision making in recurrent neuronal circuits. Neuron, 60(2):215–234, X.-J. Wang and G. R. Yang. A disinhibitory circuit motif and flexible information routing in the brain. Curr. Opin. Neurobiol., 49:75–83, 2018. X.-J. Wang, J. Tegnér, C. Constantinidis, and P. S. Goldman-Rakic. Division of labor among distinct subtypes of inhibitory neurons in a cortical microcircuit of working memory. Proc Natl Acad Sci U S A, 101:1368–1373, 2004. P. J. Werbos. Backpropagation through time: what it does and how to do it. Pro- ceedings of the IEEE, 78(10):1550–1560, 1990. A. H. Williams, T. H. Kim, F. Wang, S. Vyas, S. I. Ryu, K. V. Shenoy, M. Schnitzer, T. G. Kolda, and S. Ganguli. Unsupervised discovery of demixed, low-dimensional neural dynamics across multiple timescales through tensor component analysis. Neuron, 98(6):1099–1115, 2018. R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989. 45 H. R. Wilson and J. D. Cowan. Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical journal, 12(1):1–24, 1972. Y. Wu and K. He. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018. X. Xie and H. S. Seung. Equivalence of backpropagation and contrastive hebbian learning in a layered network. Neural computation, 15(2):441–454, 2003. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, Y. Yamane, E. T. Carlson, K. C. Bowman, Z. Wang, and C. E. Connor. A neural code for three-dimensional object shape in macaque inferotemporal cortex. Nature neuroscience, 11(11):1352, 2008. D. L. Yamins and J. J. DiCarlo. Using goal-driven deep learning models to understand sensory cortex. Nature neuroscience, 19(3):356, 2016. D. L. Yamins, H. Hong, C. F. Cadieu, E. A. Solomon, D. Seibert, and J. J. DiCarlo. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences, 111(23):8619– 8624, 2014. G. Yang, F. Pan, and W.-B. Gan. Stably maintained dendritic spines are associated with lifelong memories. Nature, 462(7275):920–924, 2009. G. R. Yang, J. D. Murray, and X.-J. Wang. A dendritic disinhibitory circuit mecha- nism for pathway-specific gating. Nat Commun, 7:12815, 2016. G. R. Yang, I. Ganichev, X.-J. Wang, J. Shlens, and D. Sussillo. A dataset and architecture for visual reasoning with a working memory. In European Conference on Computer Vision, pages 729–745. Springer, 2018. G. R. Yang, M. R. Joglekar, H. F. Song, W. T. Newsome, and X.-J. Wang. Task representations in neural networks trained to perform many cognitive tasks. Nature neuroscience, page 1, 2019. M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014. F. Zenke and S. Ganguli. Superspike: Supervised learning in multilayer spiking neural networks. Neural computation, 30(6):1514–1541, 2018. F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3987–3995. JMLR. org, 2017. C. Zhuang, S. Yan, A. Nayebi, and D. Yamins. Self-supervised neural network models of higher visual cortex development. In 2019 Conference on Cognitive Computational Neuroscience, pages 566–569, 2019. 46 D. Zipser and R. A. Andersen. A back-propagation programmed network that simulates response properties of a subset of posterior parietal neurons. Nature, 331(6158):679, 1988. R. S. Zucker and W. G. Regehr. Short-term synaptic plasticity. Annual review of physiology, 64(1):355–405, 2002.

Journal

Quantitative BiologyarXiv (Cornell University)

Published: Jun 1, 2020

There are no references for this article.