Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Going in circles is the way forward: the role of recurrence in visual inference

Going in circles is the way forward: the role of recurrence in visual inference Going in circles is the way forward: the role of recurrence in visual inference 1 1−4 Ruben S. van Bergen , Nikolaus Kriegeskorte Zuckerman Mind Brain Behavior Institute, Columbia University, New York, NY, United States Department of Psychology, Columbia University, New York, NY, United States Department of Neuroscience, Columbia University, New York, NY, United States and Affiliated member, Electrical Engineering, Columbia University, New York, NY, United States Biological visual systems exhibit abundant recurrent connectivity. State-of-the-art neural network models for visual recognition, by contrast, rely heavily or exclusively on feedforward computation. Any finite-time recurrent neural network (RNN) can be unrolled along time to yield an equivalent feedforward neural network (FNN). This important insight suggests that computational neurosci- entists may not need to engage recurrent computation, and that computer-vision engineers may be limiting themselves to a special case of FNN if they build recurrent models. Here we argue, to the contrary, that FNNs are a special case of RNNs and that computational neuroscientists and engineers should engage recurrence to understand how brains and machines can (1) achieve greater and more flexible computational depth, (2) compress complex computations into limited hardware, (3) integrate priors and priorities into visual inference through expectation and attention, (4) exploit sequential dependencies in their data for better inference and prediction, and (5) leverage the power of iterative computation. INTRODUCTION tions their models must operate under when deployed in, for example, a smartphone. Moreover, as long as neural network models continue to dominate computer vision, The primate visual cortex uses a recurrent algorithm to more efficient hardware implementations are likely to be 1–3 process sensory input . Anatomically, connectivity is more similar to biological neural networks than current cyclic. Neurons are connected in cycles within local cortical implementations using conventional processors and graphics 4–6 circuits . Global inter-area connections are dense and processing units (GPUs). 7–9 mostly bidirectional . Physiologically, the dynamics of A second explanation for the discrepancy is that the neural responses bear temporal signatures indicative of abundance of recurrent connections in cortex belies a 1,10,11 recurrent processing . Behaviorally, visual perception superficial role in neural computation. Perhaps the can be disturbed by carefully timed interventions that core computations can be performed by a feedforward coincide with the arrival of re-entrant information to a visual network , while recurrent processing serves more auxiliary 12–15 area . The evidence for recurrent computation in the and modulatory functions, such as divisive normalization primate brain, thus, is unequivocal. What is less obvious, 36–39 and attention . This perspective is convenient because however, is why the brain uses a recurrent algorithm. it enables us to hold on to the feedforward model in our This question has recently been brought into sharper minds. The auxiliary and modulatory functions let us focus by the successes of deep feedforward neural network acknowledge recurrence without fundamentally changing 16,17 models (FNNs) . These models now match or exceed the way we envision the algorithm of recognition. 18–20 human performance on certain visual tasks , and However, there is a third and more exciting expla- 21–23 better predict primate recognition behavior and neural nation for the discrepancy between recurrent brains and 24–29 activity than current alternative models. feedforward models: Although feedforward computation is Although computer vision and computational neuro- powerful, a recurrent algorithm provides a fundamentally 30–33 science both have a long history of recurrent models , superior solution to the problem of visual inference, and feedforward models have earned a dominant status in both this algorithm is implemented in primate visual cortex. This fields. How should we account for this discrepancy between recurrent algorithm explains how primate vision can be so brains and models? efficient in terms of space, time, energy, and data, while One answer is that the discrepancy reflects the fact that being so rich and robust in terms of the inferences and their generalization to novel environments. brains and computer-vision systems operate on different hardware and under different constraints on space, time, In this review, we argue for the latter possibility, and energy. Perhaps we have come to a point at which discussing a range of potential computational functions the two fields must go their separate ways. However, this of recurrence and citing the evidence suggesting that the answer is unsatisfying. Computational neuroscience must primate brain employs them. We aim to distinguish estab- still find out how visual inference works in brains. And lished from more speculative, and superficial from more although engineers face quantitatively different constraints profound forms of recurrence, so as to clarify the most when building computer-vision systems, they, too, must exciting directions for future research that will close the gap care about the spatial, temporal, and energetic limita- between models and brains. arXiv:2003.12128v3 [q-bio.NC] 16 Nov 2020 2 UNROLLING A RECURRENT NETWORK the distinction can easily be blurred. Consider the simple network in Fig. 1a. It consists of three processing stages, arranged hierarchically, which we will refer to as areas, by What exactly do we mean when we say that a neural network analogy to cortex. Each area contains a number of neurons – whether biological or artificial – is recurrent rather than (real or artificial) that apply fixed operations to their input. feedforward? This may seem obvious, but it turns out that Visual input enters in the first area, where it undergoes some transformation, the result of which is passed as input to the second area, and so forth. Information travels exclusively "#€$%%&'()*")& +#€,%-.))%/0 -#€,%-.))%/01€./)(22%&€3/€034% in one direction – the “forward” direction, from input to €�‚€� €�‚€� €�‚€� output – and so this is an example of a feedforward archi- tecture. Notably, the number of transformations between area 3 area 3 area 3 area 3 area 3 input and output is fixed, and equal to the number of areas in the network. area 2 area 2 area 2 area 2 area 2 Now compare this to the architecture in Fig. 1b. Here, we have added lateral and feedback connections to the area 1 area 1 area 1 area 1 area 1 network. Lateral connections allow the output of an area to !‚€� !‚€� !‚€� be fed back into the same area, to influence its computations  €�  €  €! in the next processing step. Feedback connections allow the output of an area to influence information processing in a lower area. There is some freedom in the order in which &#€,%-.))%/01€./)(22%&€3/€56"-%€ %#€7%%6€'%%&'()*")&€85.6%)94(&%2: computations may occur in such a network. The order we €�‚€� €�‚€� illustrate here starts with a full feed-forward pass through the network. In subsequent time steps, neural activations area 3 (� = 3) area 9 are updated in ascending order through the hierarchy, based on the activations that were computed in the previous time area 2 (� = 3) area 8 step. This order of operations can be seen more clearly if we area 1 (� = 3) area 7 ’unroll’ the network in time, as shown in Fig. 1c. In this illustration, the network is unrolled for a fixed number of area 3 (� = 2) area 6 time steps (3). In fact, recurrent processing can be run for any desired duration before its output is read out – a notion area 2 (� = 2) area 5 we will return to later. Notice how this temporally unrolled, small network resembles a larger feedforward neural network area 1 (� = 2) area 4 with more connections and areas between its input and output. We can emphasize this recurrent-feedforward equiv- area 3 (� = 1) area 3 alence by interpreting the computational graph over time as a spatial architecture, and visually arranging the induced area 2 (� = 1) area 2 areas and connections in a linear spatial sequence – an operation we call unrolling in space (Fig. 1d). This results area 1 (� = 1) area 1 in a deep feedforward architecture with many skip connec- tions between areas that are separated by more than one !‚€� !‚€� level in this new hierarchy, and with many connections that FIG. 1: Unrolling recurrent neural networks. (a) A simple feedforward are exact copies of one another (sharing identical connection neural network. (b) The same network with lateral (blue) and feedback weights). (red) connections added, to make it recurrent. (c) ”Unrolling” the Thus, any finite-time RNN can be transformed into an network in time clarifies the order of its computations. Here, the network is unrolled for three time steps before its output is read out, equivalent FNN. But this should not be taken to mean that but we could choose to run the network for more or fewer steps. Areas RNNs are a special case of FNNs. In fact, FNNs are a are staggered from left to right to show the order in which their neural special case of finite-time RNNs (Fig. 2a), comprising activities are updated. (d) Alternatively, we can unroll the recurrent those which happen to have no cycles. More practically, network’s time steps in space, by arranging the areas and connec- not every unrolled finite-time RNN is a realistic FNN (Fig. tions from different time steps in a linear spatial sequence. Note how all arrows now once again point in the same (forward) direction, 2b). By realistic networks, we mean networks that conform from input to output. Throughout panels (a-b), connections that are to the real-world constraints the system must operate under. identical (sharing the same weight matrices) are indicated by corre- For computational neuroscience, a realistic network is one sponding symbols. (e) If we lift the weight-sharing constraints from that fits in the brain of the animal and does not require the previous network, this induces a deep feedforward ”super-model”, a deeper network architecture or more processing steps which can implement the spatially-unrolled recurrent network as a special case. This more general architecture may include additional than the animal can accommodate. For computer vision, connections (examples shown as light gray arrows) not present in the a realistic network is one that can be trained and deployed spatially-unrolled recurrent net. on available hardware at the training and deployment stages. 3  € finite-time RNNs feedforward NNs unrealistic unrolled ftRNNs realistic realistically unrollable ftRNNs FIG. 2: Relationships between recurrent and feedforward networks. This figure illustrates relationships between discrete-time feedforward (FNN) and discrete-time recurrent (RNN) neural network models. (a) The architecture of any RNN can be reduced to an FNN by removing all its recurrent connections (e.g., going from Fig. 1b back to Fig. 1a), or equivalently, setting the weights of these connections to zero. Vice versa, any FNN can be expanded to an infinite variety of RNNs by adding lateral or feedback connections. Feedforward networks, thus, form an architectural subset of RNNs. Here we specifically consider RNNs that accomplish their task in a finite number of time steps. These finite-time RNNs (ftRNNs) have the special property that they can be unrolled into equivalent FNNs. White points linked by arcs indicate pairs of computationally equivalent architectures. Thus, the feedforward NNs contain a subset of architectures that can be obtained by unrolling a ftRNN. (b) These sets of networks can be further subdivided into subsets that are or are not realistic to implement with the computational resources available for a brain or engineered device (areas below and above the dotted line, respectively). Deeper networks and, more generally, networks with more neurons and connections tend to require more memory and computation to train and run. Some realistic ftRNNs remain realistic when expressed as an FNN (blue ellipse). Others, however, become too complex, when unrolled, to be feasible (black arc crossing the realism line). This is because the unrolling operation induces a much deeper architecture with many more neural connections to be stored. These not-realistically-unrollable ftRNNs are especially interesting, since they correspond to recurrent solutions that cannot be replaced by feedforward architectures. For example, there may be limits on the storage and energy models are strictly feedforward architectures. available, which would limit the complexity of the archi- tecture and computational graph. A realistic finite-time CONTINUOUS- VERSUS DISCRETE-TIME RNN, when unrolled, can yield an unworkably deep FNN. DYNAMICS Although the most widely used current method for training RNNs (backpropagation through time) requires unrolling, FNNs used in computer vision do not have meaningful an RNN is not equivalent to its unrolled FNN twin at the stage of real-world deployment: the RNN’s recurrent dynamics. Each unit in the network instantaneously trans- forms its input into an output. This is in contrast to a connections need not be physically duplicated, but can be reused across cycles of computation. feedforward network of biological neurons. When given a static input, biological neurons do not immediately produce 40–42 An important recent observation is that the archi- their final responses. The movement of electric charges tecture that results from spatially unrolling a recurrent and neurotransmitters, and the opening and closing of ion network, resembles the architectures of state-of-the art channels takes time, so the network will gradually transition FNNs used in computer vision, which similarly contain skip from its initial to its final state, with its trajectory contin- connections and can be very deep. These deep FNNs ually perturbed by noise. Such continuous-time dynamics may form a super-class of models (Fig. 1e), which can be described by differential equations. When these reduce to “recurrent-equivalent” architectures when certain cannot be solved analytically (as is typically the case), the subsets of weights are constrained to be identical. Liao dynamics can be simulated in discrete steps. In each step, & Poggio showed that deep feedforward architectures the current state of each simulated neuron is updated. The known as residual networks (ResNets) are formally equiv- future state of the network thus depends on its current state, alent to recurrent architectures when certain connection as it does in an RNN. Consequently, the computational weights are constrained to be identical. Moreover, when graph of the simulation algorithm contains loops from each ResNets were trained with such recurrent-equivalent weight- neuron back to itself. Running the simulation over time sharing constraints, their performance on computer vision amounts to unrolling this loopy computational graph, even benchmarks was similar to unconstrained ResNets (even though the network architecture did not contain loops. though the weight sharing drastically reduces the parameter Computational neuroscientists commonly study models of count and limits the component computations that the feedforward and recurrent neural networks with continuous- network can perform). This is especially noteworthy time dynamics . Here our focus is on neural network given that ResNets, and architecturally related DenseNets, models that are motivated by the goal to capture compu- are currently among the top-ranking FNNs on prominent tations, rather than their precise neural implementation. 19,43 computer vision benchmarks , as well as measures of The discrete-time behavior of such a model is not derived brain-similarity . Today’s best artificial vision models, from a continuous-time description in differential equations. thus, actually implement computational graphs closely Moreover, the model is optimized in its discrete-time imple- related to those of recurrent networks, even though these mentation. However, an implicit assumption in the field is 4 that such models could be implemented in biological brains, the FNNs when its confidence threshold was set to match and thus in continuous-time dynamical systems. the FNN’s computational cost (number of floating point operations) on average across images (Fig. 3). Flexible computational depth would be advantageous for animals, REASONS TO RECUR who may need to respond rapidly in some situations, must limit metabolic expenditures in general, and may benefit from slower and more energetically costly inferences when We have described how a recurrent network can be high accuracy is required. Computer vision faces similar unrolled into a deep feedforward architecture. The resulting requirements in certain applications. For example, a vision feedforward super-model offers greater computational flexi- algorithm in a smartphone should respond rapidly and bility, since weight-sharing constraints can be omitted and conserve energy in general, but should also be able to additional skip connections added to the network (Fig. 1e). recognize hard images, and it should allow trading off mean So what would be the benefit of restricting ourselves to accuracy for speed and energy (e.g., when the battery is recurrent architectures? We will first discuss the benefits of low). recurrence in terms of overarching principles, before consid- ering more specific implementations of these principles. Recurrent architectures can compress complex Recurrence provides greater and more flexible computations in limited hardware computational depth Another benefit of recurrent solutions is that they require Recurrence enables arbitrary computational depth fewer components in space when physically implemented in recurrent circuits, such as brains. Compare Figs. 1b and One important advantage of recurrent algorithms is that 1e: the recurrent network is anatomically more compact they can be run for any desired length of time before their than the feedforward network and has fewer connections. output is collected. We can define computational depth It is easy to see why evolution might have favored a as the maximum path length (i.e. number of successive recurrent implementations for many brain functions: Space, neural projections, and the energy to develop and maintain connections and nonlinear transformations) between input and output. A recurrent neural network (RNN) can achieve them are all costly for the organism. In addition, synaptic efficacies must be either learned from limited experience arbitrary computational depth despite having a finite count of parameters and being limited to finite spatial compo- or encoded in a limited-capacity genome. Beyond saving nents. In other words, it can multiply its limited spatial space, material, and energy, thus, smaller descriptive resources along time. These deeper computations can complexity (or parameter count) might ease development serve to expand on the number of hypotheses considered and learning. (in generative inference) or on the number of nonlinear Engineered devices face the same set of costs, although features computed (in discriminative inference), or to extend their relative weighting changes from application to appli- the representation into the future or past, or to iteratively cation. In particular, a larger number of units and weights converge to a good estimate of some latent variable of must either be represented in the memory of a conven- interest. tional computer or implemented in specialized (e.g., neuro- morphic) hardware. The connection weights in an NN model need to be learned from limited data. This requires Recurrence enables more flexible expenditure of energy and extensive training, e.g., in a supervised setting, with millions time in exchange for inferential accuracy of hand-labeled examples that show the network the desired output for a given input. The larger number of param- In addition to enabling an arbitrarily deep computation given eters associated with a feedforward solution might overfit the training data. The learned parameters then do not enough time, an RNN can adjust its computational depth to the task at hand. The computational depth of a feedforward generalize well to new examples of the same task. net, by contrast, is a fixed number determined by the archi- FNNs often turn out to generalize surprisingly well even 45–47 tecture. when they have very large numbers of parameters . This Spoerer et al. implemented a recurrent model that termi- phenomenon is thought to reflect a regularizing effect of the nates computations when it reaches a confidence threshold learning algorithm, stochastic gradient descent. Indeed, the (defined by the entropy of the posterior, a measure of the trend is towards ever deeper networks with more connections model’s uncertainty) . The model terminates rapidly for to be optimized, and this trend is associated with continuing many images, but expends more time and energy on hard gains in performance on computer vision benchmarks . images to reach its confidence threshold. Adjusting the Nevertheless, it could turn out that recurrent architectures confidence threshold enables trading off speed for accuracy that achieve high computational depth with fewer param- in terms of average performance. When compared to a eters bring benefits not only in terms of their storage, but range of FNNs requiring different amounts of computation, also in terms of statistical efficiency, the ability generalize the RNN achieved roughly the same accuracy as each of accurately based on limited experience. This would imply entropy threshold [nats] that recurrent networks have an inductive bias that makes proposed for recurrent processing for visual inference, from up for the limited experiential data. This is explored further superficial to increasingly profound forms of recurrence. in subsequent sections, where we discuss how RNNs can exploit temporal dependency structures, and enable iterative inference. Feedback connections are required to integrate information from outside the visual hierarchy Energy is another factor to consider in both biology and engineering. Larger FNNs take longer to train on bigger computing clusters, while drawing greater amounts A key, established role of recurrent connections in biological of power – a trend that is not sustainable. In the long vision is to propagate information from outside the visual run, therefore, computer vision too may benefit from the cortex, so that it can aid visual inference . Here, we will anatomical compression that can be achieved through clever briefly discuss two such outside influences: attention and use of recurrence. expectations. Importantly, however, not every deep feedforward model can be compressed into an equivalent recurrent implemen- tation. This anatomical compression can only be achieved Attentional prioritization requires feedback connections when the same function may be applied iteratively or recur- sively within the network. The crucial question, therefore, Animals have needs and goals that change from moment is: what are these functions? What operations can be to moment. Perception is attuned to an animal’s current applied repeatedly in a productive manner? The remainder objectives. For instance, a primate foraging for red berries of this paper will reflect on the various roles that have been may be more successful if its visual perception apparatus prioritizes or enhances the processing of red items. Since current goals are represented outside the visual cortex RNN (e.g., in frontal regions), top-down connections are clearly required for this information to influence visual processing. Such top-down effects have been grouped under the label ”attention”, and they have been the subject of an entire sub-field of study. For our purposes, it is sufficient to note that the effects and mechanisms of top-down attention are well-documented and pervasive in visual cortex (for review, 36–38 see [ ]), and thus there is no question that this is one important function of recurrent connections. FNNs Integrating prior expectations into visual inference requires feedback connections computational cost [number of floating-point operations ×10 ] Organisms may constrain their visual inferences by expectations . Visual input can be ambiguous and FIG. 3: Recurrence enables a network to trade speed for accuracy unreliable, and thus open to multiple interpretations. To while approximately emulating the accuracies of feedforward models constrain the inference, an observer can make use of on average at matched computational cost. Circles denote the perfor- 51–53 prior knowledge . One form of prior knowledge is mance of a recurrent neural network (RNN) that was run for different numbers of time steps, until it achieved a desired threshold of image environmental constants (e.g., ”light tends to come from classification confidence (quantified by the entropy of the class proba- above” ). Such unvarying knowledge may be stored bilities in the final network layer). Squares correspond to three archi- within visual cortex, especially when it pertains to the tecturally similar feedforward networks (FNN) with different compu- overall prevalence of basic visual features (e.g., local tational costs. On the x-axis is the computational cost of running edge orientations ). Another form of prior knowledge is these models, measured by the number of floating point operations. For the feedforward models, this cost is fixed by the architecture. contextual information specific to the current situation. For the recurrent models, it is the average number of operations Such time-varying knowledge may require a flexible repre- that was required to meet the given entropy threshold. The y-axis sentation outside visual cortex (e.g., ”I rang the doorbell at shows the classification accuracy achieved by each model. The perfor- my mother’s house, so I expect to see her open the door”). mance of the recurrent model for different certainty thresholds follows a smooth curve, trading off computational cost (and thus computa- Such expectations, represented in higher cortical regions, tional speed) and accuracy. Note that this curve passes almost exactly require feedback connections to affect processing in visual through the cost-accuracy combinations achieved by the feedforward cortex . models. Thus, a single recurrent model can emulate the performance The top-down imposition of attention and expectation of multiple feedforward models as it trades off speed and accuracy. must be mediated by feedback connections. However, it When the confidence threshold of termination was set such that the RNN matched the accuracy of a given FNN, the RNN required a similar is unclear whether these influences fundamentally change number of floating-point operations on average as the FNN. (Figure the nature of visual representations or merely modulate adapted with permission from the authors .) these representations, adjusting the gain depending on the accuracy [proportion top-1 correct] ‰##„'Œ%( !"€#$%€&$''(&  � ‚ƒ„…†‡ ˆ†‰Š‹Œˆ� † €�‚€� €�‚ƒ„…†‡…€ƒ…ˆ‰‡Š‹Œ�Ž‹�‰�„‡‚‹‰�‹� #$‚#%Œ � ˆ� † area 3 area 3 €�‚€� ‹Œˆ†�Œˆ† area 2 area 2 ‡ˆ†‚€� area 1 area 1 ˆ†‚€�  €�  € *"€+!,-!.€/,0'1 €�‚€� €�‚€� €�‚€� ! " €�‚ƒ„…†‡…€‰Œ�ƒ‹„�‰…‡‚‹‰�‹� area 3 area 3 area 3 €�‚€�‡ � ! "# €�‚€�‡ � ! $# #$Š� Œ‚&Œ#� area 2 area 2 area 2 ˆ†‚€‡� ‡� ! "# ˆ†‚€‡� ‡� ! $# area 1 area 1 area 1 ˆ†‚€� ˆ†‚€� ˆ†‚€� ! "  €�  €  €) €�‚ƒ„…†‡Š�…Œ�ŒˆŠ�ˆ�„‡… ��� Œ…�ˆ… 2"€30'1!045'€4.6'1'.2' €�‚€� area 3 area 3 area 3 area 2 area 2 area 2 area 1 area 1 area 1 ˆ†‚€� ˆ†‚€�  €�  €  €) FIG. 4: Increasingly profound modes of recurrent processing, unrolled in time. Visual cortex likely combines all three modes of recurrence illustrated here. The left side of each panel shows the computational graph induced by each form of recurrence, while the right side illustrates a (simplified) example of how this recurrence can be used. In these examples, circles correspond to neurons (or neural assemblies) encoding the feature illustrated within the circle, and lines that connect to circles indicate neural connections with significant activity. (a) Top-down influences from outside the visual processing hierarchy may be incorporated through two computational sweeps: a feedback sweep priming the network with top-down information and a feedforward sweep to interpret visual input and combine this interpretation with the top-down signal. Note that the lateral connections here merely copy neural activities in each area to the next time point; this identity transformation could also be implemented in other ways, such as slow membrane time constants or other forms of local memory. In the example on the right, a top-down signal communicates the expectation that the upcoming input will be horizontal motion. This primes neurons encoding this direction of motion to be more easily or strongly activated, and sharpens the interpretation of the subsequent (ambiguous) visual input. (b) To efficiently perform inference on time-varying visual input, recurrent connections may implement a fixed temporal prediction function akin to the transition kernel in a Kalman filter, extrapolating the ongoing dynamics of the world one time step into the future. For instance, in the example on the right, a downward moving square was perceived at t = 1. This motion is predicted to continue, and this prediction constrains the interpretation of the (ambiguous) visual input at the next time point. For simplicity, only lateral recurrence is shown in this example. Note that each input is mapped onto its corresponding output in a single recurrent time step. (c) Static input may also benefit from recurrent processing that iteratively refines an initial, coarse feedforward interpretation. In this mode of recurrence, there are several processing time steps between input and output, whereas in (b) there was one input and output for each time step. Illustrated on the right is an iterative hierarchical inference algorithm. Here, a higher-level hypothesis, generated in the first time step, refines the underlying lower-level representation in the next time step, which in turn improves the higher-level hypothesis, and so forth, until the network converges to an optimal interpretation of the input across the entire hierarchy. For simplicity, lateral recurrent interactions are not shown in this example. 7 current relevance of different features of the visual input. Recurrent dynamics can simulate and predict the dynamics of the world As illustrated in Fig. 4a, for a given input this would require only two ”sweeps” of computation through the visual processing hierarchy: a feedback sweep that primes visual Dynamic compression of the past exploits the temporal areas with top-down information, and a bottom-up sweep dependency structure of the sensory data. The purpose to interpret the visual input and integrate or modify this of representing the past is to act well in the future. This interpretation with the top-down signal (not necessarily in suggests that a neural network should exploit temporal that order). Importantly, if the feedback signal merely dependencies not just to compress the past, but also to enhances or suppresses some visual features, then the core predict the future. In fact, an optimal representation of even inference algorithm need not be fundamentally recurrent – just the present requires prediction, because the sensory one can imagine that the bottom-up part of such a network data is delayed and noisy. is modeled perfectly by an FNN, while an optional recurrent Changes in the world are governed by laws of dynamics, module could be added in order to implement top-down which by definition are temporally invariant. An ideal contextual influences. observer will exploit these laws in visual inference and optimally combine previous with present observations to estimate the current state. This implies an extrapolation of the past to generate predictions that improve the inter- pretation of the present sensory input. When the dynamics Recurrent networks can exploit temporal are linear and noise is Gaussian, the optimal way to infer the dependency structure present state by combining past and present evidence is the Kalman filter – an algorithm widely used in engineering 60–63 Contextual constraints on visual inference include not only applications. A number of authors have proposed that information from outside the visual hierarchy, such as infor- the visual cortex may implement an algorithm similar to a mation from other sensory modalities and memory, as Kalman filter. This theory is consistent with temporal biases 64–66 discussed in the previous section. The recent stimulus that are evident in human perceptual judgments . history within the visual modality also provides context, Kalman filters employ a fixed temporal transitional kernel. likely represented within the visual system. This kernel takes a representation of the world (e.g., variables encoding the present state of a physical system, such as positions and velocities) at time t, and transforms it into a predicted representation for time t + 1, to be integrated with new sensory evidence that arrives at that Recurrent networks can dynamically compress the stimulus time. While the resulting prediction varies as a function of history the kernel’s input, the kernel itself is constant, reflecting the temporal shift-invariance of the laws governing the dynamics. Recurrent neural networks provide a general- The primate visual system is thought to contain a hierarchy, ization of the Kalman filter and can represent nonlinear not only of processing stages and spatial scales, but also 56,57 dynamical systems with non-Gaussian noise. of temporal scales . Visual representations track the Note that this type of recurrent processing is more environment moment by moment. However, the duration profound than the two-sweep algorithm (Fig. 4a) that of a visual moment, the temporal grain, may depend on incorporated top-down influences on visual inference. The the level of representation. These principles apply to all two-sweep algorithm is trivial to unroll into a feedforward sensory modalities and have been empirically explored, in architecture. In contrast, unrolling a Kalman filter- particular, for audition and speech perception. At the like recurrent algorithm would induce an infinitely deep simplest level, a neural network could use delay lines to feedforward network, with a separate set of areas and detect spatiotemporal, rather than purely spatial, patterns. connections for each time point to be processed. A finite- Recurrent neural networks have internal states and can depth feedforward architecture can only approximate the represent temporal context across units tuned to different recurrent algorithm. While the feedforward approximation latencies. An RNN could represent a fixed temporal window, will have a finite temporal window of memory to constrain by replicating units tuned to different patterns for multiple its present inferences, the recurrent network can in principle latencies. However, RNNs trained on sequence processing integrate information over arbitrarily long periods. tasks, such as language translation, learn more sophisticated representations of temporal context . They can represent Due to their advantages for dealing with time-varying (or context at multiple time scales, learning a latent represen- otherwise ordered) inputs, recurrent neural networks are in tation that enables them to dynamically compress whatever fact widely employed in the broader field of machine learning information from the past is needed for the task. In contrast for tasks involving sequential data. Speech recognition and to a feedforward network, a recurrent network is not limited machine translation are prominent applications that RNNs 58,67–70 by spatial constraints in terms of its retrospective time excel at . Computer vision, too, has embraced RNNs 71–73 horizon. It can maintain task-relevant information indefi- for recognition and prediction of video input . Note nitely, integrating long-term memory into its inferences. that these applications all exploit the dynamics in RNNs to 8 model the dynamics in the data. being subdivided into smaller hypotheses about lower or What if we trained a Kalman filter or sequence-to- intermediate-level features, such as the local edges that make up a larger contour. An iterative visual inference sequence RNN (Fig. 4b) on a train of independently sampled static inputs to be classified? The memory of the algorithm starts with an initial hypothesis, and refines it by incremental improvements. These improvements may preceding inputs would not be useful then, so we expect the recurrent model to revert to using essentially only its include eliminating hypotheses that are mutually exclusive, strengthening compatible causes, or adjusting a hypothesis feedforward weights. The type of recurrent processing we described in this section, thus uses memory to improve based on its ability to predict the data (the visual input). In a probabilistic framework, the optimization objective would visual inference. In the next section, we consider how recurrent processing can help with the inferential compu- be the likelihood (probability of the image given the latent representation) or the posterior probability (probability of tations themselves, even for static inputs. the latent representation given the image). Recurrence enables iterative inference Incompatible hypotheses can compete in the representation Recurrent processing can contribute even to inference on static inputs, and regardless of the agent’s goals and expec- There are often multiple plausible explanations for a given tations, by means of an iterative algorithm. An iterative sensory input that are mutually exclusive. The distributed, algorithm is one that employs a computation that improves parallel nature of neural networks enables them to initially an initial guess. Applying the computation again to the activate and represent all of these possible hypotheses simul- improved guess yields a further improvement. This process taneously. Recurrent connectivity between neurons can then can be repeated until a good solution has been achieved implement competitive interactions among hypotheses, so or until we run out of time or energy. Recurrent networks as to converge on the best overall explanation. can implement iterative algorithms, with the same neural There is some evidence that sensory representations are 74–76 network functions applied successively to some internal probabilistic – in this case, the probabilities assigned pattern of activity Fig. 4c). to a set of mutually exclusive hypotheses must sum to 1. In many fields, iterative algorithms are used to solve A strengthening of belief in one hypothesis, thus, should estimation and optimization problems. In each iteration, entail a reduction of the probability of other hypotheses in a small adjustment is made to the problem’s proposed the representation. If neurons encode point estimates rather solution, to improve a mathematically formulated objective. than probability distributions, then only one hypothesis A locally optimal solution is found by making small improve- can win (although that hypothesis may be encoded by ments until further progress is not required or not possible. a population response involving multiple neurons). The The algorithm navigates a path in the space of the values winning hypothesis could be the maximum a posteriori to be estimated or the parameters to be optimized, that (MAP) hypothesis or the maximum likelihood hypothesis. leads to a good solution (albeit not necessarily the global Influential models of visual inference involving compet- optimum). itive recurrent interactions include divisive normalization , 36 30,32,77 Much of machine learning involves iterative methods. biased competition , and predictive coding . Gradient descent is an iterative optimization method, whose Recent theoretical work has demonstrated that lateral stochastic variant is the most widely used method for competition can give rise to a robust neural code, and 77,78 training FNNs. Many discrete optimization techniques are can explain certain puzzling neural response properties . iterative. Iterative algorithms are also central to inference This theory considers a spiking neural network setting, in machine learning, for example in variational inference in which different neurons encode highly overlapping or (where inference is achieved by optimization), sampling even identical features in their input. This degeneracy methods (where steps are chosen stochastically such that means that the same signal can be encoded equally well the distribution of samples converges on the posterior distri- by a range of different response patterns. When a bution), and message passing algorithms (such as loopy particular neuron spikes, lateral inhibition ensures that belief propagation). In particular, such iterative inference other competing neurons do not encode the same part of algorithms are used in probabilistic approaches to computer the input again. Which neuron gets to do the encoding 31,33 vision . It is somewhat surprising, then, that iterative thus depends on which neuron fires first, because its computation is not widely exploited to perform visual membrane potential happened to be closest to a spiking inference in FNNs. threshold. This leads to trial-to-trial variability in neural Visual inference is naturally understood as an responses that reflects subtle differences in initial condi- optimization problem, where the goal is to find hypotheses tions – conditions that may not be known to an experi- that can explain the current visual input . A hypothesis, menter, who may thus mistake this variability for random in this case, is a proposed set of latent (i.e. unobserved) noise. This could explain the puzzling observation that causes that can jointly explain the image. The hypothe- individual neurons reliably reproduce the same output given sized latent causes could be the identities and positions of the same electrical stimulation, but populations of neurons, objects in the scene. Visual hypotheses are hierarchical, wired together, display apparently random variability under 9 79–81 85 sensory stimulation . Since multiple neurons can encode perceptual grouping operations . Recent examples include the same feature, the resulting code is also robust to neurons Linsley et al., who developed horizontal gated-recurrent being lost or temporarily inactivated. units (hGRUs) that learn local spatial dependencies . A network equipped with this particular recurrent connectivity FNNs do not incorporate lateral connections for compet- was competitive with state-of-the-art feedforward models itive interactions, although they very often include compu- on a contour integration task, while using far fewer free tations that serve a similar purpose. Chief among these parameters. George et al. similarly leveraged lateral inter- are operations known as max-pooling and local response 16,82 normalization (LRN) . In max-pooling, only the actions to recognize contiguous contours and surfaces, by modeling these with a conditional random field (CRF), using strongest response within a pool of competing neurons is forwarded to the next processing stage. In LRN, each a message-passing algorithm for inference. This approach made their Recursive Cortical Network (RCN) the first neuron has its response divided by a term that is computed from the sum of activity in its normalization pool. While computer vision algorithm to reliably beat CAPTCHAs – images of letter sequences under a variety of distortions, neither of these mechanisms is mediated by explicit lateral noise and clutter, that are widely used to verify that queries connections in a FNN, a strictly connectionist implemen- to a user interface are made by a person, and not an tation of these mechanisms (e.g., in biological neurons or algorithm. CRFs were also used by Zheng et al. , who neuromorphic hardware) would have to include lateral recur- incorporated them as a recurrent extension of a convolu- rence. This, then, is another way in which apparently tional neural network for image segmentation. The model feedforward FNNs can exhibit a (limited) form of recurrent processing ”under the hood”. Note, though, that each surpassed state-of-the-art performance at the time. Associ- ation rules enforced through lateral connections may also of these operations is carried out only once, rather than allowing competitive dynamics to converge over multiple help to fill in missing information, such as when objects are partially hidden from view by occluders. Lateral connec- iterations. Furthermore, in contrast to the lateral inter- actions in predictive coding or other normative models, tivity has been shown to improve recognition performance 23,89,90 in such settings . Montobbio et al. showed that LRN and max-pooling are not derived from normative lateral diffusion of activity between neurons with correlated principles, and do not necessarily select (or enhance) the feedforward filter weights improves robustness to image best hypothesis (however ”best” is defined). perturbations including occlusions . Enhancement of mutually compatible hypotheses (this section) and competition between mutually exclusive Compatible hypotheses can strengthen each other in the hypotheses (previous section) can both contribute to representation inference. A more general perspective is provided by the insight that prior knowledge about what features in a scene In feedforward models of hierarchical visual inference, are mutually compatible or exclusive may be part of an neurons at higher stages selectively respond to combinations overarching generative model, which iterative algorithms of simpler features encoded by lower-level neurons. Higher- can exploit for inference. level neurons thus are sensitive to larger-scale patterns of correlation between subsets of lower-level features. But such larger-scale statistical regularities may not be most Iterative algorithms can leverage generative models for efficiently captured by a set of larger-scale building blocks. inference Instead, they may be more compactly captured by local association rules. Consider, for instance, the problem of contour detection. Many combinations of local edges in an Perceptual inference aims to converge on a set of hypotheses image can form a continuous contour. The resulting space that best explain the sensory data. Typically, a hypothesis is of contours may be too complex to be efficiently represented considered to be a good explanation if it is consistent with with larger-scale templates. What all these contours have in both our prior knowledge and the sensory data. A gener- common, however, is that they consist of pairs of edges that ative model is a model of the joint distribution of latent are locally contiguous, with sharper angles occurring with causes and sensory data. Generative models can powerfully lower probability. Thus, the criteria for ’contour-ness’ may constrain perceptual inference because they capture prior be compactly expressed by a set of local association rules: knowledge about the world. In machine learning, defining 83,84 these edges go together; those do not . Contours may generative models enables us to express and exploit what then be pieced together by repeatedly applying the same we know about the domain. A wide range of inference local association rules. Those edge pairs which are most algorithms can be used to compute posterior distributions clearly connected would be identified in early iterations. over variables of interest, given observed variables. The Later inferences can benefit from the context provided by algorithms include variational inference, message passing, and Markov Chain Monte Carlo sampling, all of which earlier inferences, enabling the process to recognize conti- nuity even where it is less locally apparent. require iterative computation. This insight has inspired network models of visual In this section, we focus on a particular approach to lever- inference that implement local association rules through aging generative models in visual inference, in which the lateral connections, to aid contour integration and other joint distribution p(x, z) of the image x and the latents z 10 is factorized as p(x, z) = p(z) · p(x|z), which we refer to either of the categories. An ideal observer should evaluate as the top-down factorization. The architecture contains the likelihood for each hypothesis and adjudicate according components that model p(x|z) and predict the image from to their ratio . A feedforward network may instead latch the latents (or more generally lower-level latent representa- on to a few highly discriminative, but subtle image features tions from higher-level latent representations). Compared that don’t explain much and may not generalize to images 93,95 to the alternative factorization p(x, z) = p(x) · p(z|x), the from a different data set . In contrast, visual features top-down factorization has the potential advantage that the that are important for generating or reconstructing images model operates in the causal direction, matching the causal of a given class may be more likely to generalize to other process in the world that generated the image. The top- examples of the same category. In support of this intuition, down model predicts what visual input is likely to result two novel RNN architectures that employ generative models from a scene that has the hypothesized properties. This is for inference were found to be more robust to adversarial 96,97 somewhat similar to the graphics engine of a video game perturbations . Generative inference networks were also or image rendering software. This top-down model can be shown to better align with human perception, compared implemented via feedback connections that translate higher- to discriminative models, when presented with controversial level hypotheses in the network to representations at a lower stimuli – images synthesized to evoke strongly conflicting level of abstraction. classifications from different models . Despite these promising developments, generative Using generative models implemented with top-down inference remains rare in visual FNN models. The predictions for inference is known as analysis-by-synthesis exceptions mentioned above are rather simple networks – an approach that has a long history in theories of 30,32,51 trained on easy classifications problems, and are not (yet) perception . Arguably, the goal of perceptual competitive with state-of-the-art performance on more inference, by definition, is to reason back from effects challenging computer vision benchmarks. Within compu- (sensory data) to their causes (unobserved variables of tational neuroscience, by contrast, generative feedback interest), and thus invert the process that generated the connections appear in many network models of visual effects. The crucial question, however, is whether the causal 30,32 inference. Prominent examples are predictive coding process is explicitly represented in the inference algorithm. and hierarchical Bayesian inference . However, these The alternative, which can be achieved with feedforward models have not had much success in explaining visual inference, is to directly approximate the inverse, without inference beyond its earliest stages. A notable exception is ever making predictions in the causal direction. The success work by Wen et al. , which shows that extending super- of the feedforward approach then depends on how well the vised convolutional FNNs with the recurrent dynamics of inverse can be approximated by a fixed mapping of inputs predictive coding can improve classification performance. to hypotheses. To iteratively invert the causal process, The fields of computer vision and computational neuro- a neural network can evaluate the causal model for a science both stand to benefit from the development of more current hypothesis and update the hypothesis in a beneficial powerful generative inference models. direction. This process can then be repeated until conver- gence. This process of analysis by repeated synthesis may be preferable to directly approximating the inverse mapping if the causal process that generates the sensory data is easier Iteration is necessary to close the amortization gap to model than its inverse. In particular, the causal process may be more compactly represented, more easily learned, Iterative inference has many advantages. A drawback of more efficient to compute, and more generalizable beyond iteration, however, is that it takes time for the algorithm to the training distribution than its inverse. converge during inference. This is unattractive for animals who need to perform visual inference under time pressure. Another potential advantage of generative inference lies in robustness to variations in the input. While FNNs can It is also a challenge when training a FNN, which already requires many iterations of optimization. If each update of accurately categorize images drawn from the same distri- bution that the training images were drawn from, it does not the network’s connections additionally includes an iterative inner loop to perform inference on each training example, take much to fool them. A slight alteration imperceptible to this lengthens the time required for training. humans can cause a FNN to misclassify an image entirely, with high confidence . State-of-the-art FNNs rely more A complementary inference mechanism is amortized 92 101,102 strongly on texture than humans, who rely more on shape . inference , where a feedforward model approximates More generally, FNNs seem to ignore many image features the mapping from images to their latent causes. FNNs that are relevant to human perception . One hypothe- are eminently suited for learning complicated input-output sized reason for this is that these networks are trained to mappings. A single transformation then replaces the trajec- discriminate images, but not to generate them. Thus, any tories that would be navigated by an iterative inference visual feature that reliably discriminates categories in the algorithm. In some cases, the iterative solution and the training data will be weighted heavily in the network’s classi- best amortized mapping may be exactly equivalent. A fication decisions. Importantly, this weight is unrelated to linear model, for instance, can be estimated iteratively, how much variance the feature explains in the image, and by performing gradient descent on the sum of squared to the likelihood, i.e. the probability of the image given prediction errors. However, if a unique solution exists, it 11 can equivalently be found by a linear transformation that illustrates how limited resources (the fovea) can be dynam- directly maps from the data to the optimal coefficients. ically allocated (eye movements) to different portions of the evidence (the visual scene) in temporal sequence. A In general, however, amortized inference incurs some error, compared to the optimal solution that might be found sensory system limited to a finite number of neurons, thus, can multiply its resources along time to achieve a detailed through iterative optimization. This error has been called 103,104 the amortization gap . It is analogous to the poor analysis. The cycle may start with an initial rough analysis of the entire visual field, followed by fixations on locations fit that may result from buying clothes ”off the rack”, compared to a tailored version of the same garment. The likely to yield valuable information. This is an example of an essentially recurrent process whose efficiency cannot amortization gap is defined in the context of variational inference, when the iterative optimization of the varia- be emulated with a feedforward system. The internal mechanisms of visual inference are faced with qualitatively tional approximation to the posterior is replaced by a neural network that maps from the image to the parameters of the similar challenges: Just like our retinae cannot afford foveal resolution throughout the visual field, the ventral stream variational distribution. The resulting model suffers from cannot afford to perform all potentially relevant inferences two types of error: (1) error caused be the choice of the variational approximation (variational approximation gap) on the evidence streaming in through the optic nerve in a single feedforward sweep. Internal shifts of attention, like and (2) error caused by the model mapping from images to variational parameters (amortization gap). One recent eye movements, can sequentialize a complex computation and avoid wasting energy on portions of the evidence that study has argued that the amortization gap is often the main source of error in amortized inference models . are uninformative or irrelevant to the current goals of the animal. Amortized and iterative inference define a continuum. At Whereas the outer loop of active vision is largely about one extreme, iterative inference until convergence reaches positioning our eyes relative to the scene and bringing a solution through a trajectory of small improvements, important content into foveal vision, the inner loop of visual explicitly evaluating the quality of the current solution at inference on each glimpse is far more flexible. Beyond covert every iteration. At the other extreme, fully amortized attentional shifts that select locations, features, or objects inference takes a single leap from input to output. In for scrutiny, a recurrent network can decide what computa- between these extremes lies a space for algorithms that use intermediate numbers of steps, to approximate the tions to perform so as to most efficiently reduce uncertainty about the important parts of the scene. In a game of twenty optimal solution through a computational path that is more refined than a leap, but more efficient than full- questions, we choose a question that most reduces our remaining uncertainty at each step. The budget of twenty fledged iterative optimization. Models that occupy this space include explicit hybrids of iterative and amortized would not suffice if we had to decide all the questions before 104–106 seeing any answers. The visual system similarly has limited inference , as well as RNNs with arbitrary dynamics computational resources for processing a massive stream of that are trained to converge to a desired objective in a 23,107–109 evidence. It must choose what inferences to pursue on the limited number of time steps (e.g., ). basis of their computational cost and uncertainty-reducing 113–115 benefit as it forages for insight . Recurrence is required for active vision CLOSING THE GAP BETWEEN BIOLOGICAL Vision is an active exploratory process. Our eye movements AND ARTIFICIAL VISION scan the scene through a sequence of well-chosen fixations that bring objects of interest into foveal vision. Moving We have reviewed a number of advantages that recurrence our heads and our bodies enables us to bring entirely new can bring to neural networks for visual inference. Going parts of the scene into view, and closer for inspection at high forward, neural network models of vision should incorporate resolution. Active control of our eyes, heads, and bodies can recurrence; not just to better understand visual inference also help disambiguate 3D structure as fixation on points in the brain, but also to improve its implementation in at different depths changes binocular disparity, and head machines. and body movements create motion parallax. Active vision involves a recurrent cycle of sensory processing and muscle control, a cycle that runs through the environment. Recurrence already improves performance on Our focus here has been on the internal computational challenging visual tasks functions of recurrent processing, and active vision has been 110–112 reviewed elsewhere . However, it is important to note that the internal recurrent processes of visual inference from Efforts in this direction are already underway, and turning a single glimpse are embedded within the larger recurrent up promising results. Some of this work has been described process of active visual exploration. Active vision provides in previous sections, such as the use of lateral connec- 86–88 not just the larger behavioral context of visual inference. tions to impose local association rules and generative It also provides a powerful illustration of the fundamental inference for more robust performance outside the training 96,97 advantages that recurrent algorithms offer in general. It distribution . Several other recent findings are worth 12 highlighting here, as they have shown improved performance realism could refer to the real-world constraints faced by on visual tasks, better approximations to biological vision, either biological or artificial visual systems. Future studies or both, through recurrent computations. should compare RNN and FNN implementations for the same visual inference task, while matching the complexity In particular, several studies have found that recurrence of the models in a meaningful way. Setting a realistic is required in order to explain or improve visual inference budget of units, connections, and computational operations in challenging settings. Kar and colleagues identified a is one important approach. To understand the computa- set of ’challenge images’ that required recurrent processing tional differences between RNN and FNN solutions, it is in order to be accurately recognized. A feedforward also interesting to (1) match the parameter count (number FNN struggled to interpret these images, whereas macaque of connection weights that must be learned and stored), monkeys recognized them as accurately as a set of control which requires granting the FNN larger feature kernels, images. Challenge images were associated with longer more feature maps per layer, or more layers, or (2) match processing times in the macaque inferior temporal (IT) the computational graph, which equates the distribution of cortex, consistent with recurrent computations. Neural path lengths from input to output and all other statistics responses in IT for images that took longer were well of the graph, but grants the FNN a much larger number of accounted for by a brain-inspired RNN model. In a parameters . different study , this same recurrent architecture was found to account for behavior, and neural data from macaque visual cortex, in object recognition tasks, while also achieving good performance on an important computer Freeing ourselves from the feedforward framework vision benchmark (ImageNet ). In human visual cortex, recurrent interactions were also found to be crucial to Deep feedforward neural networks constitute an essential model the neural dynamics underlying object recognition, building block for visual inference, but they are not the as measured through magnetoencephalography (MEG) . whole story. The missing element, recurrent dynamics, One prominent challenge to visual inference is posed is central to a range of alternative conceptions of visual 31,110–112,129,130 by partial occlusions, which hide part of a target object inference that have been proposed . These from view. In two recent studies, recurrent architec- ideas have a long history, they are essential to under- tures were shown to be more robust to occlusions than standing biological vision, and they have great potential for 89,119 their feedforward counterparts . Interestingly, in both engineering, especially in the context of modern hardware human observers and in an RNN model, object recognition and software. The promise of active vision and recurrent under occlusion was impaired by backward masking (the visual inference is, in fact, boosted by the power of presentation of a meaningless noise image, shortly after feedforward networks. 13,15,120 a target stimulus, to disrupt recurrent processing ). However, the beauty, power, and simplicity of feedforward Neural responses to partially occluded shapes in macaque neural networks also makes it difficult to engage and visual cortex are also consistent with recurrent processing, develop the space of recurrent neural network algorithms and were well explained by a predictive coding model in for vision. The feedforward framework, embellished by which prefrontal cortex provide a feedback signal to visual recurrent processes that serve auxiliary and modulatory 121,122 area V4 . functions like normalization and attention, enables compu- Another challenge for human perception is crowding, tational neuroscientists to hold on to the idea of a hierarchy which occurs when the detailed perception of a target of feature detectors. This idea might not be entirely stimulus is disrupted by nearby flanker stimuli . In mistaken. However, it is likely to be severely incomplete certain instances, the target stimulus can be released and ultimately limiting. from crowding if further flankers are added that form The insight that any finite-time recurrent network can a larger, coherent structure with the original flankers. be unrolled compounds the problem by suggesting that the This uncrowding effect may be due to the flankers being feedforward framework is essentially complete. More practi- ’explained away’, thus reducing their interference with the cally, the fact that we train RNNs by unrolling them for 124,125 126 target representation . Recent work has shown that finite time steps might in some ways impede our progress. both effects can be explained by architectures known as FNNs are usually trained by stochastic gradient descent 127,128 Capsule Nets , which include recurrent information using the backpropagation algorithm. This method retraces routing mechanisms that may be similar to perceptual in reverse the computational steps that led to the response grouping and segmentation processes in the visual cortex. in the output layer, so as to estimate the influence that Note that, in all of these cases, it may be possible to each connection in the network had on the response. Each develop a feedforward architecture that performs the task connection weight is then adjusted, to bring the network equally well or better. Trivially, and as we discussed previ- output closer to a desired output. The deeper the network, ously, a successful recurrent architecture can always be the longer the computational path that needs to be retraced. unrolled (for a finite number of time steps) into a deep RNNs for visual inference typically are trained through feedforward network with many more learnable connections. a variation on this method, known as backpropagation However, a realistic recurrent model, when unrolled, may through time (BPTT) . To retrace computations in map onto an unrealistic feedforward model (Fig. 2), where reverse through cycles, the RNN is unrolled along time, so 13 as to convert it into a feedforward network whose depth computational path to this state. Marino et al. recently depends on the number of time steps as shown in Fig. 1b- proposed iterative amortized inference, training inference d. This enables the RNN to be trained like an FNN. networks to have recurrent dynamics that improve the BPTT is attractive for enabling us to train RNNs like network’s hypotheses in each iteration, without constraining FNNs on arbitrary objectives. When it comes to learning these dynamics to a particular form (such as predictive recurrent dynamics, however, BPTT strictly optimizes the coding). More generally, RNNs whose dynamics converge output at the specific time points evaluated by the objective to a steady state can be optimized through variations on 136–138 (e.g., the output after exactly N steps). Outside of this time an algorithm known as recurrent backpropagation , window, there is no guarantee that the network’s response which avoids retracing the computational graph through will be well-behaved. The RNN might reach the desired time. However, it is often difficult to design RNNs such objective at the desired time, but diverge immediately after. that their dynamics converge to a steady state (within Ideally, we would like a visual RNN presented with a stable the time window for which the model is trained), while image to converge to an attractor that represents the image maintaining expressivity (the ability of the model to learn a and behave stably for arbitrary lengths of time. This would wide range of functions). This challenge is addressed by the be consistent with iterative optimization, in which each recently developed contractor recurrent backpropagation step improves the network’s approximation to its objective. method , which introduces a mathematical penalty that While it is not impossible for BPTT to give rise to such can be imposed while training any RNN, to encourage it to dynamics, it does not specifically favor them. learn convergent dynamics. From a theory perspective, BPTT is limiting because it shackles RNNs to the feedforward framework, in which the goal is still to map inputs to outputs, rather than to discover useful dynamics. From a practical and implementa- GOING FORWARD, IN CIRCLES tional perspective, BPTT is computationally cumbersome, as every additional recurrent time step extends the compu- tational path that must be retraced in order to update We started this review with the puzzling observation that, the connections. This complication also renders BPTT whereas biological vision is implemented in a profoundly biologically implausible. Although the case for backpropa- recurrent neural architecture, the most successful neural gation as potentially biologically plausible has recently been network models of vision to date are feedforward. We have 132–134 strengthened , its extension through time is difficult argued, theoretically and empirically, that vision models will to reconcile with biology or implement efficiently in eventually converge to their biological roots and implement a finite engineered system for online learning – precisely more powerful recurrent solutions. This is an appealing because it requires unrolling and keeping track of separate prospect, as it suggests that neuroscientists and engineers copies of each weight as computational cycles are retraced can continue to work synergistically, to make progress on in reverse. common challenges. After all, visual inference, and intel- Given these drawbacks, we speculate that a true ligence more generally, were solved once before, and so breakthrough in recurrent vision models will require a discovering nature’s solutions should go hand in hand with training regime that does not rely on BPTT. Rather than building artificial ones. optimizing an RNN’s state in a finite time window, future RNN training methods might directly target the network’s dynamics, or the states that those dynamics are encouraged ACKNOWLEDGEMENTS to converge to. This approach has some history in RNN models of vision. Predictive coding models, for instance, are designed with dynamics that explicitly implement We thank Samuel Lippl, Heiko Sch¨utt, Andrew Zaharia, Tal iterative optimization. Such models can update their Golan and Benjamin Peters for detailed comments on a draft connections through learning rules that require only the of this paper. This work was supported by a Rubicon grant converged network state as input , rather than the entire from the Dutch Research Council (to R.S.v.B.). 1 3 V. A. Lamme, P. R. Roelfsema, The distinct modes of A. Angelucci, P. C. Bressloff, Contribution of feedforward, vision offered by feedforward and recurrent processing, lateral and feedback connections to the classical receptive Trends in Neurosciences 23 (11) (2000) 571–579. field center and extra-classical receptive field surround of doi:10.1016/S0166-2236(00)01657-X. primate V1 neurons, in: Progress in Brain Research, Vol. 154, G. Kreiman, T. Serre, Beyond the feedforward sweep: 2006, pp. 93–120. doi:10.1016/S0079-6123(06)54005-1. feedback computations in the visual cortex, Annals of the J. C. Anderson, R. J. Douglas, K. A. C. Martin, J. C. New York Academy of Sciences 1464 (1) (2020) 222–241. Nelson, Synaptic output of physiologically identified doi:10.1111/nyas.14320. spiny stellate neurons in cat visual cortex, The Journal of Comparative Neurology 341 (1) (1994) 16–24. 14 doi:10.1002/cne.903410103. Computer Society Conference on Computer Vision and K. A. Martin, Microcircuits in visual cortex, Current Pattern Recognition 2016-December (2016) 4873–4882. Opinion in Neurobiology 12 (4) (2002) 418–425. arXiv:1512.00596, doi:10.1109/CVPR.2016.527. doi:10.1016/S0959-4388(02)00343-4. J. Kubilius, S. Bracci, H. P. Op de Beeck, Deep Neural R. J. Douglas, K. A. Martin, Recurrent neuronal circuits Networks as a Computational Model for Human Shape Sensi- in the neocortex, Current Biology 17 (13) (2007) 496–500. tivity, PLOS Computational Biology 12 (4) (2016) e1004896. doi:10.1016/j.cub.2007.04.024. doi:10.1371/journal.pcbi.1004896. 7 22 D. J. Felleman, D. C. Van Essen, Distributed hierarchical N. J. Majaj, D. G. Pelli, Deep learning-Using machine processing in the primate cerebral cortex, Cerebral Cortex learning to study biological vision, Journal of Vision 18 (13) 1 (1) (1991) 1–47. doi:10.1093/cercor/1.1.1. (2018) 1–13. doi:10.1167/18.13.2. 8 23 P. A. Salin, J. Bullier, Corticocortical connec- C. J. Spoerer, T. C. Kietzmann, J. Mehrer, I. Charest, tions in the visual system: structure and function, N. Kriegeskorte, Recurrent neural networks can explain Physiological Reviews 75 (1) (1995) 107–154. flexible trading of speed and accuracy in biological vision, doi:10.1152/physrev.1995.75.1.107. PLOS Computational Biology 16 (10) (2020) e1008215. N. T. Markov, M. M. Ercsey-Ravasz, A. R. Ribeiro Gomes, doi:10.1371/journal.pcbi.1008215. C. Lamy, L. Magrou, J. Vezoli, P. Misery, A. Falchier, C. F. Cadieu, H. Hong, D. L. K. Yamins, N. Pinto, R. Quilodran, M. A. Gariel, J. Sallet, R. Gamanut, D. Ardila, E. A. Solomon, N. J. Majaj, J. J. DiCarlo, C. Huissoud, S. Clavagnier, P. Giroud, D. Sappey-Marinier, Deep Neural Networks Rival the Representation of P. Barone, C. Dehay, Z. Toroczkai, K. Knoblauch, D. C. Primate IT Cortex for Core Visual Object Recognition, Van Essen, H. Kennedy, A weighted and directed interareal PLoS Computational Biology 10 (12) (2014) e1003963. connectivity matrix for macaque cerebral cortex, Cerebral doi:10.1371/journal.pcbi.1003963. Cortex 24 (1) (2014) 17–36. doi:10.1093/cercor/bhs270. S. M. Khaligh-Razavi, N. Kriegeskorte, Deep Supervised, but R. J. Douglas, C. Koch, M. Mahowald, K. A. Not Unsupervised, Models May Explain IT Cortical Repre- Martin, H. H. Suarez, Recurrent excitation in neocor- sentation, PLoS Computational Biology 10 (11) (2014). tical circuits, Science 269 (5226) (1995) 981–985. doi:10.1371/journal.pcbi.1003915. doi:10.1126/science.7638624. U. Guclu, M. A. J. van Gerven, Deep Neural H. Sup`er, H. Spekreijse, V. A. Lamme, Two distinct modes Networks Reveal a Gradient in the Complexity of of sensory processing observed in monkey primary visual Neural Representations across the Ventral Stream, cortex (VI), Nature Neuroscience 4 (3) (2001) 304–310. Journal of Neuroscience 35 (27) (2015) 10005–10014. doi:10.1038/85170. doi:10.1523/JNEUROSCI.5023-14.2015. 12 27 V. Di Lollo, J. T. Enns, R. A. Rensink, Compe- N. Kriegeskorte, Deep Neural Networks: A New Framework tition for consciousness among visual events: The for Modeling Biological Vision and Brain Information psychophysics of reentrant visual processes, Journal of Exper- Processing, Annual Review of Vision Science 1 (1) (2015) imental Psychology: General 129 (4) (2000) 481–507. 417–446. doi:10.1146/annurev-vision-082114-035447. doi:10.1037/0096-3445.129.4.481. S. R. Kheradpisheh, M. Ghodrati, M. Ganjtabesh, V. A. Lamme, K. Zipser, H. Spekreijse, Masking interrupts T. Masquelier, Deep Networks Can Resemble Human Feed- figure-ground signals in V1, Journal of Vision 1 (3) (2001) forward Vision in Invariant Object Recognition, Scientific 1044–1053. doi:10.1167/1.3.32. Reports 6 (1) (2016) 32672. doi:10.1038/srep32672. 14 29 K. Heinen, J. Jolij, V. A. Lamme, Figure-ground segregation M. Schrimpf, J. Kubilius, H. Hong, N. J. Majaj, R. Rajal- requires two distinct periods of activity in VI: A transcranial ingham, E. B. Issa, K. Kar, P. Bashivan, J. Prescott-Roy, magnetic stimulation study, NeuroReport 16 (13) (2005) F. Geiger, K. Schmidt, D. L. K. Yamins, J. J. DiCarlo, Brain- 1483–1487. doi:10.1097/01.wnr.0000175611.26485.c8. score: Which artificial neural network for object recognition J. J. Fahrenfort, H. S. Scholte, V. A. Lamme, Masking is most brain-like?, bioRxiv (2020). doi:10.1101/407007. disrupts reentrant processing in human visual cortex, R. P. N. Rao, D. H. Ballard, Predictive coding in the visual Journal of Cognitive Neuroscience 19 (9) (2007) 1488–1497. cortex: a functional interpretation of some extra-classical doi:10.1162/jocn.2007.19.9.1488. receptive-field effects., Nature neuroscience 2 (1) (1999) 79– Y. Lecun, Y. Bengio, G. Hinton, Deep learning, Nature 87. doi:10.1038/4580. 521 (7553) (2015) 436–444. doi:10.1038/nature14539. A. Yuille, D. Kersten, Vision as Bayesian inference: analysis J. Schmidhuber, Deep learning in neural networks: by synthesis?, Trends in Cognitive Sciences 10 (7) (2006) An overview, Neural Networks 61 (2015) 85–117. 301–308. doi:10.1016/j.tics.2006.05.002. doi:10.1016/j.neunet.2014.09.003. K. Friston, S. Kiebel, Predictive coding under the free- K. He, X. Zhang, S. Ren, J. Sun, Delving Deep into Recti- energy principle, Philosophical Transactions of the Royal Society B: Biological Sciences 364 (1521) (2009) 1211–1221. fiers: Surpassing Human-Level Performance on ImageNet Classification, in: 2015 IEEE International Conference on doi:10.1098/rstb.2008.0300. Computer Vision (ICCV), Vol. 2015 Inter, IEEE, 2015, pp. S. J. D. Prince, Computer Vision: Models, Learning and 1026–1034. doi:10.1109/ICCV.2015.123. Inference, Cambridge University Press, Cambridge, 2012. K. He, X. Zhang, S. Ren, J. Sun, Deep residual doi:10.1017/CBO9780511996504. learning for image recognition, Proceedings of the IEEE J. J. DiCarlo, D. Zoccolan, N. C. Rust, How does the brain Computer Society Conference on Computer Vision and solve visual object recognition?, Neuron 73 (3) (2012) 415– Pattern Recognition 2016-December (2016) 770–778. 434. doi:10.1016/j.neuron.2012.01.010. doi:10.1109/CVPR.2016.90. M. Carandini, D. J. Heeger, Normalization as a canonical I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, neural computation, Nature Reviews Neuroscience 13 (1) E. Brossard, The MegaFace benchmark: 1 million (2012) 51–62. doi:10.1038/nrn3136. faces for recognition at scale, Proceedings of the IEEE 15 36 54 R. Desimone, J. Duncan, Neural Mechanisms of Selective P. Mamassian, R. Goutcher, Prior knowledge on the Visual Attention, Annual Review of Neuroscience 18 (1) illumination position, Cognition 81 (1) (2001) 1–9. (1995) 193–222. doi:10.1146/annurev.neuro.18.1.193. doi:10.1016/S0010-0277(01)00116-0. 37 55 S. Kastner, L. G. Ungerleider, Mechanisms of A. R. Girshick, M. S. Landy, E. P. Simoncelli, Cardinal rules: Visual Attention in the Human Cortex, Annual visual orientation perception reflects knowledge of environ- Review of Neuroscience 23 (1) (2000) 315–341. mental statistics., Nature neuroscience 14 (7) (2011) 926– doi:10.1146/annurev.neuro.23.1.315. 32. doi:10.1038/nn.2831. 38 56 J. H. Maunsell, S. Treue, Feature-based attention in visual U. Hasson, E. Yang, I. Vallines, D. J. Heeger, N. Rubin, cortex, Trends in Neurosciences 29 (6) (2006) 317–322. A Hierarchy of Temporal Receptive Windows in Human doi:10.1016/j.tins.2006.04.001. Cortex, Journal of Neuroscience 28 (10) (2008) 2539–2550. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, doi:10.1523/JNEUROSCI.5487-07.2008. A. N. Gomez, L. u. Kaiser, I. Polosukhin, Attention is J. D. Murray, A. Bernacchia, D. J. Freedman, R. Romo, J. D. all you need, in: I. Guyon, U. V. Luxburg, S. Bengio, Wallis, X. Cai, C. Padoa-Schioppa, T. Pasternak, H. Seo, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), D. Lee, X.-J. Wang, A hierarchy of intrinsic timescales across Advances in Neural Information Processing Systems 30, primate cortex, Nature Neuroscience 17 (12) (2014) 1661– Curran Associates, Inc., 2017, pp. 5998–6008. doi:10.1038/nn.3862. 40 58 Q. Liao, T. Poggio, Bridging the Gaps Between Residual I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence Learning, Recurrent Neural Networks and Visual Cortex (047) learning with neural networks, Advances in Neural Infor- (2016) 1–16. arXiv:1604.03640. mation Processing Systems 4 (January) (2014) 3104–3112. S. Jastrz¸ebski, D. Arpit, N. Ballas, V. Verma, T. Che, arXiv:1409.3215. Y. Bengio, Residual Connections Encourage Iterative R. E. Kalman, A New Approach to Linear Filtering and Inference (2017). arXiv:1710.04773. Prediction Problems, Journal of Basic Engineering 82 (1) K. Greff, R. K. Srivastava, J. Schmidhuber, Highway and (1960) 35–45. doi:10.1115/1.3662552. Residual Networks learn Unrolled Iterative Estimation, 5th D. Wolpert, Z. Ghahramani, M. Jordan, An internal model International Conference on Learning Representations, ICLR for sensorimotor integration, Science 269 (5232) (1995) 2017 - Conference Track Proceedings (2015) (2016) 1–14. 1880–1882. doi:10.1126/science.7569931. arXiv:1612.07771. R. P. N. Rao, D. H. Ballard, Dynamic model of visual recog- G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, nition predicts neural response properties in the visual cortex, Densely connected convolutional networks, Proceedings - Neural computation 9 (November 1995) (1997) 721–763. 30th IEEE Conference on Computer Vision and Pattern doi:10.1162/neco.1997.9.4.721. Recognition, CVPR 2017 2017-January (2017) 2261–2269. R. P. N. Rao, Bayesian computation in recurrent neural doi:10.1109/CVPR.2017.243. circuits., Neural computation 16 (1) (2004) 1–38. 44 63 P. Dayan, L. F. Abbott, Theoretical Neuroscience, MIT S. Den`eve, J.-R. Duhamel, A. Pouget, Optimal Press, Cambridge, MA, 2001. Sensorimotor Integration in Recurrent Cortical M. S. Advani, A. M. Saxe, High-dimensional dynamics Networks: A Neural Implementation of Kalman Filters, of generalization error in neural networks (2017) 1– Journal of Neuroscience 27 (21) (2007) 5744–5756. 32arXiv:1710.03667. doi:10.1523/JNEUROSCI.3985-06.2007. 46 64 M. Belkin, D. Hsu, S. Ma, S. Mandal, Reconciling modern J.-J. Orban de Xivry, S. Coppe, G. Blohm, P. Lefevre, machine-learning practice and the classical bias–variance Kalman Filtering Naturally Accounts for Visually trade-off, Proceedings of the National Academy of Sciences Guided and Predictive Smooth Pursuit Dynamics, of the United States of America 116 (32) (2019) 15849– Journal of Neuroscience 33 (44) (2013) 17301–17313. 15854. doi:10.1073/pnas.1903070116. doi:10.1523/JNEUROSCI.2321-13.2013. 47 65 P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, O.-S. Kwon, D. Tadin, D. C. Knill, Unifying account of I. Sutskever, Deep Double Descent: Where Bigger Models visual motion and position perception, Proceedings of the and More Data Hurt (2019). arXiv:1912.02292. National Academy of Sciences 112 (26) (2015) 8142–8147. W. Rawat, Z. Wang, Deep Convolutional Neural doi:10.1073/pnas.1500361112. Networks for Image Classification: A Comprehensive R. S. van Bergen, J. F. M. Jehee, Probabilistic Represen- Review, Neural Computation 29 (9) (2017) 2352–2449. tation in Human Visual Cortex Reflects Uncertainty in Serial doi:10.1162/neco_a_00990. Decisions, The Journal of neuroscience : the official journal C. D. Gilbert, W. Li, Top-down influences on visual of the Society for Neuroscience 39 (41) (2019) 8164–8176. processing, Nature Reviews Neuroscience 14 (5) (2013) 350– doi:10.1523/JNEUROSCI.3212-18.2019. 363. doi:10.1038/nrn3476. A. Graves, A.-R. Mohamed, G. Hinton, Speech recog- C. Summerfield, T. Egner, Expectation (and attention) in nition with deep recurrent neural networks, in: 2013 visual cognition, Trends in Cognitive Sciences 13 (9) (2009) IEEE International Conference on Acoustics, Speech and 403–409. doi:10.1016/j.tics.2009.06.003. Signal Processing, no. 3, IEEE, 2013, pp. 6645–6649. H. von Helmholtz, Handbuch der physiologischen Optik, doi:10.1109/ICASSP.2013.6638947. Dover (English translation), New York, 1860/1962. H. Sak, A. Senior, F. Beaufays, Long Short-Term Memory Y. Weiss, E. P. Simoncelli, E. H. Adelson, Motion illusions Based Recurrent Neural Network Architectures for Large as optimal percepts, Nature Neuroscience 5 (6) (2002) 598– Vocabulary Speech Recognition (2014). arXiv:1402.1128. 604. doi:10.1038/nn858. D. Bahdanau, K. Cho, Y. Bengio, Neural Machine Trans- A. A. Stocker, E. P. Simoncelli, Noise characteristics and lation by Jointly Learning to Align and Translate, 3rd Inter- prior expectations in human visual speed perception, Nature national Conference on Learning Representations, ICLR 2015 Neuroscience 9 (4) (2006) 578–585. doi:10.1038/nn1669. - Conference Track Proceedings (2014). arXiv:1409.0473. 16 K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, breaks text-based CAPTCHAs, Science 358 (6368) (2017). F. Bougares, H. Schwenk, Y. Bengio, Learning Phrase doi:10.1126/science.aag2612. Representations using RNN Encoder-Decoder for Statistical S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Machine Translation, Journal of Clinical Microbiology 28 (4) Z. Su, D. Du, C. Huang, P. H. Torr, Conditional random fields (2014) 828–829. arXiv:1406.1078. as recurrent neural networks, Proceedings of the IEEE Inter- M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, national Conference on Computer Vision 2015 Inter (2015) S. Chopra, Video (language) modeling: a baseline for gener- 1529–1537. doi:10.1109/ICCV.2015.179. ative models of natural videos (2014). arXiv:1412.6604. C. J. Spoerer, P. McClure, N. Kriegeskorte, Recurrent convo- N. Srivastava, E. Mansimov, R. Salakhutdinov, Unsupervised lutional neural networks: A better model of biological object Learning of Video Representations using LSTMs (2015). recognition, Frontiers in Psychology 8 (SEP) (2017) 1–14. arXiv:1502.04681. doi:10.3389/fpsyg.2017.01551. 73 90 W. Lotter, G. Kreiman, D. Cox, Deep Predictive Coding N. Montobbio, L. Bonnasse-Gahot, G. Citti, A. Sarti, Networks for Video Prediction and Unsupervised Learning KerCNNs: biologically inspired lateral connections for classi- arXiv:1605.08104. fication of corrupted images (2019). arXiv:1910.08336. (2016). 74 91 A. Pouget, J. Beck, W. J. Ma, P. Latham, Probabilistic C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, brains: knowns and unknowns., Nature neuroscience 16 (9) I. Goodfellow, R. Fergus, Intriguing properties of neural (2013) 1170–8. doi:10.1038/nn.3495. networks, 2nd International Conference on Learning Repre- W. J. Ma, M. Jazayeri, Neural Coding of Uncertainty and sentations, ICLR 2014 - Conference Track Proceedings Probability., Annual Review of Neuroscience 37 (2014) 205– (2014). arXiv:1312.6199. 220. doi:10.1146/annurev-neuro-071013-014017. R. Geirhos, C. Michaelis, F. A. Wichmann, P. Rubisch, G. Orb´an, P. Berkes, J. Fiser, M. Lengyel, Neural M. Bethge, W. Brendel, Imagenet-trained CNNs are biased Variability and Sampling-Based Probabilistic Representa- towards texture; increasing shape bias improves accuracy and tions in the Visual Cortex, Neuron 92 (2) (2016) 530–543. robustness, 7th International Conference on Learning Repre- doi:10.1016/j.neuron.2016.09.038. sentations, ICLR 2019 (c) (2019) 1–22. arXiv:1811.12231. 77 93 M. Boerlin, C. K. Machens, S. Den`eve, Predictive J. H. Jacobsen, J. Behrmann, R. Zemel, M. Bethge, Coding of Dynamical Variables in Balanced Spiking Excessive invariance causes adversarial vulnerability, 7th Networks, PLoS Computational Biology 9 (11) (2013). International Conference on Learning Representations, ICLR doi:10.1371/journal.pcbi.1003258. 2019 (2019). arXiv:1811.00401. 78 94 D. G. Barrett, S. Den`eve, C. K. Machens, Optimal compen- J. Neyman, E. S. Pearson, IX. On the problem of the most sation for neuron loss, eLife 5 (e12454) (2016) 1–36. efficient tests of statistical hypotheses, Philosophical Trans- doi:10.7554/eLife.12454. actions of the Royal Society of London. Series A, Containing P. H. Schiller, B. L. Finlay, S. F. Volman, Short-term response Papers of a Mathematical or Physical Character 231 (694- variability of monkey striate neurons., Brain research 105 (2) 706) (1933) 289–337. doi:10.1098/rsta.1933.0009. (1976) 347–9. A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, A. Dean, The variability of discharge of simple cells in the cat A. Madry, Adversarial Examples Are Not Bugs, They Are striate cortex, Experimental Brain Research 44 (4) (1981). Features (2019). arXiv:1905.02175. doi:10.1007/BF00238837. Y. Li, J. Bradshaw, Y. Sharma, Are generative classi- Z. F. Mainen, T. J. Sejnowski, Reliability of spike timing in fiers more robust to adversarial attacks?, 36th International neocortical neurons., Science 268 (5216) (1995) 1503–6. Conference on Machine Learning, ICML 2019 2019-June A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet (2019) 6754–6783. arXiv:1802.06552. Classification with Deep Convolutional Neural Networks, L. Schott, J. Rauber, M. Bethge, W. Brendel, Towards the Advances In Neural Information Processing Systems (2012). first adversarially robust neural network model on MNIST, arXiv:1102.0183. Iclr 3 (2018) 1–16. arXiv:1805.09190. 83 98 D. J. Field, A. Hayes, R. F. Hess, Contour integration T. Golan, P. C. Raju, N. Kriegeskorte, Controversial stimuli: by the human visual system: evidence for a local ”associ- pitting neural networks against each other as models of ation field”., Vision research 33 (2) (1993) 173–93. human recognition (2019). arXiv:1911.09288. doi:10.1016/0042-6989(93)90156-q. T. S. Lee, D. Mumford, Hierarchical Bayesian inference in W. S. Geisler, J. S. Perry, B. J. Super, D. P. Gallogly, the visual cortex., Journal of the Optical Society of America. Edge co-occurrence in natural images predicts contour A, Optics, image science, and vision 20 (7) (2003) 1434–48. grouping performance, Vision Research 41 (6) (2001) 711– H. Wen, K. Han, J. Shi, Y. Zhang, E. Culurciello, Z. Liu, 724. doi:10.1016/S0042-6989(00)00277-7. Deep Predictive Coding Network for Object Recognition P. R. Roelfsema, Cortical algorithms for perceptual grouping, (2018). arXiv:1802.04762. Annual Review of Neuroscience 29 (1) (2006) 203–227. V. Srikumar, G. Kundu, D. Roth, On amortizing inference doi:10.1146/annurev.neuro.29.051605.112939. cost for structured prediction, EMNLP-CoNLL 2012 - 2012 D. Linsley, J. Kim, V. Veerabadran, C. Windolf, T. Serre, Joint Conference on Empirical Methods in Natural Language Learning long-range spatial dependencies with horizontal Processing and Computational Natural Language Learning, gated recurrent units, in: S. Bengio, H. Wallach, Proceedings of the Conference (July) (2012) 1114–1124. H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett A. Stuhlmu¨ller, J. Taylor, N. Goodman, Learning stochastic (Eds.), Advances in Neural Information Processing Systems inverses, in: C. J. C. Burges, L. Bottou, M. Welling, 31, Curran Associates, Inc., 2018, pp. 152–164. Z. Ghahramani, K. Q. Weinberger (Eds.), Advances in Neural D. George, W. Lehrach, K. Kansky, M. L´azaro-Gredilla, Information Processing Systems, Vol. 26, Curran Associates, C. Laan, B. Marthi, X. Lou, Z. Meng, Y. Liu, Inc., 2013, pp. 3048–3056. H. Wang, A. Lavin, D. S. Phoenix, A generative C. Cremer, X. Li, D. Duvenaud, Inference suboptimality vision model that trains with high data efficiency and in variational autoencoders, 35th International Conference 17 on Machine Learning, ICML 2018 3 (2018) 1749–1760. is required to capture the representational dynamics of arXiv:1801.03558. the human visual system, Proceedings of the National J. Marino, Y. Yue, S. Mandt, Iterative amortized inference, Academy of Sciences 116 (43) (2019) 201905544. 35th International Conference on Machine Learning, ICML doi:10.1073/pnas.1905544116. 2018 8 (2018) 5444–5462. arXiv:1807.09356. H. Tang, M. Schrimpf, W. Lotter, C. Moerman, R. D. Hjelm, K. Cho, J. Chung, R. Salakhutdinov, A. Paredes, J. O. Caro, W. Hardesty, D. Cox, G. Kreiman, V. Calhoun, N. Jojic, Iterative refinement of the approximate Recurrent computations for visual pattern completion, posterior for directed belief networks, Advances in Neural Proceedings of the National Academy of Sciences of the Information Processing Systems (Nips 2016) (2016) 4698– United States of America 115 (35) (2018) 8835–8840. 4706. arXiv:1511.06382. doi:10.1073/pnas.1719397115. 106 120 R. G. Krishnan, D. Liang, M. D. Hoffman, On the J. T. Enns, V. Di Lollo, What’s new in visual masking?, challenges of learning with inference networks on sparse, Trends in Cognitive Sciences 4 (9) (2000) 345–352. high-dimensional data, International Conference on Artificial doi:10.1016/S1364-6613(00)01520-5. Intelligence and Statistics, AISTATS 2018 84 (2018) 143– A. M. Fyall, Y. El-Shamayleh, H. Choi, E. Shea-Brown, 151. arXiv:1710.06085. A. Pasupathy, Dynamic representation of partially occluded objects in primate prefrontal and visual cortex, eLife 6 (2017) M. Liang, X. Hu, Recurrent convolutional neural network for object recognition, Proceedings of the IEEE 1–25. doi:10.7554/eLife.25784. Computer Society Conference on Computer Vision and H. Choi, A. Pasupathy, E. Shea-Brown, Predictive Coding Pattern Recognition 07-12-June (2015) 3367–3375. in Area V4: Dynamic Shape Discrimination under Partial doi:10.1109/CVPR.2015.7298958. Occlusion, Neural Computation 30 (5) (2018) 1209–1257. K. Kar, J. Kubilius, K. Schmidt, E. B. Issa, J. J. doi:10.1162/neco_a_01072. DiCarlo, Evidence that recurrent circuits are critical to D. M. Levi, Crowding—An essential bottleneck for object the ventral stream’s execution of core object recognition recognition: A mini-review, Vision Research 48 (5) (2008) behavior, Nature Neuroscience 22 (6) (2019) 974–983. 635–654. doi:10.1016/j.visres.2007.12.009. doi:10.1038/s41593-019-0392-5. M. Manassi, B. Sayim, M. H. Herzog, Grouping, pooling, and A. Nayebi, D. Bear, J. Kubilius, K. Kar, S. Ganguli, when bigger is better in visual crowding, Journal of Vision D. Sussillo, J. J. DiCarlo, D. L. Yamins, Task-driven 12 (10) (2012) 13–13. doi:10.1167/12.10.13. convolutional recurrent models of the visual system, M. Manassi, S. Lonchampt, A. Clarke, M. H. Herzog, What Advances in Neural Information Processing Systems 2018- crowding can tell us about object representations, Journal of Decem (NeurIPS) (2018) 5290–5301. Vision 16 (3) (2016) 35. doi:10.1167/16.3.35. 110 126 D. H. Ballard, Animate vision, Artificial Intelligence 48 (1) A. Doerig, A. Bornet, O. Choung, M. Herzog, Crowding (1991) 57–86. doi:10.1016/0004-3702(91)90080-4. reveals fundamental differences in local vs. global processing J. M. Findlay, I. D. Gilchrist, Active in humans and machines, Vision Research 167 (August 2019) Vision, Oxford University Press, 2003. (2020) 39–45. doi:10.1016/j.visres.2019.12.006. doi:10.1093/acprof:oso/9780198524793.001.0001. S. Sabour, N. Frosst, G. E. Hinton, Dynamic routing R. Bajcsy, Y. Aloimonos, J. K. Tsotsos, Revisiting active between capsules, in: I. Guyon, U. V. Luxburg, S. Bengio, perception, Autonomous Robots 42 (2) (2018) 177–196. H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), doi:10.1007/s10514-017-9615-3. Advances in Neural Information Processing Systems 30, S. J. Russell, Rationality and intelligence, Curran Associates, Inc., 2017, pp. 3856–3866. Artificial Intelligence 94 (1-2) (1997) 57–77. S. Sabour, N. Frosst, G. E. Hinton, Matrix capsules with EM doi:10.1016/S0004-3702(97)00026-X. routing, Iclr 2018 (2011) (2018) 1–12. arXiv:1710.09829. 114 129 S. J. Gershman, E. J. Horvitz, J. B. Tenenbaum, Compu- J. K. O’Regan, A. No¨e, A sensorimotor account of vision and tational rationality: A converging paradigm for intelligence visual consciousness, Behavioral and Brain Sciences 24 (5) in brains, minds, and machines, Science 349 (6245) (2015) (2001) 939–973. doi:10.1017/S0140525X01000115. 273–278. doi:10.1126/science.aac6076. G. Buzs´aki, The Brain from Inside Out, Oxford University T. L. Griffiths, F. Lieder, N. D. Goodman, Rational Use of Press, 2019. doi:10.1093/oso/9780190905385.001.0001. Cognitive Resources: Levels of Analysis Between the Compu- P. Werbos, Backpropagation through time: what it does and tational and the Algorithmic, Topics in Cognitive Science how to do it, Proceedings of the IEEE 78 (10) (1990) 1550– 7 (2) (2015) 217–229. doi:10.1111/tops.12142. 1560. doi:10.1109/5.58337. 116 132 J. Kubilius, M. Schrimpf, K. Kar, R. Rajalingham, H. Hong, J. Guerguiev, T. P. Lillicrap, B. A. Richards, Towards deep N. Majaj, E. Issa, P. Bashivan, J. Prescott-Roy, K. Schmidt, learning with segregated dendrites, eLife 6 (2017) 1–37. A. Nayebi, D. Bear, D. L. Yamins, J. J. DiCarlo, Brain- doi:10.7554/eLife.22901. like object recognition with high-performing shallow recurrent J. Sacramento, R. Ponte Costa, Y. Bengio, W. Senn, anns, in: H. Wallach, H. Larochelle, A. Beygelzimer, Dendritic cortical microcircuits approximate the backprop- F. d’Alch´e Buc, E. Fox, R. Garnett (Eds.), Advances agation algorithm, in: S. Bengio, H. Wallach, H. Larochelle, in Neural Information Processing Systems 32, Curran K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), Advances Associates, Inc., 2019, pp. 12805–12816. in Neural Information Processing Systems 31, Curran J. Deng, W. Dong, R. Socher, L.-J. Li, Kai Li, Associates, Inc., 2018, pp. 8721–8732. Li Fei-Fei, ImageNet: A large-scale hierarchical image J. C. Whittington, R. Bogacz, Theories of Error Back- database, in: 2009 IEEE Conference on Computer Vision Propagation in the Brain, Trends in Cognitive Sciences 23 (3) and Pattern Recognition, IEEE, 2009, pp. 248–255. (2019) 235–250. doi:10.1016/j.tics.2018.12.005. doi:10.1109/CVPR.2009.5206848. T. P. Lillicrap, A. Santoro, Backpropagation through time T. C. Kietzmann, C. J. Spoerer, L. K. A. S¨orensen, and the brain, Current Opinion in Neurobiology 55 (2019) R. M. Cichy, O. Hauk, N. Kriegeskorte, Recurrence 82–89. doi:10.1016/j.conb.2019.01.011. 18 136 138 L. Almeida, A learning rule for asynchronous perceptrons with R. Liao, Y. Xiong, E. Fetaya, L. Zhang, K. J. Yoon, feedback in a combinatorial environment., Proceedings, 1st X. Pitkow, R. Urtasun, R. Zemel, Reviving and improving First International Conference on Neural Networks 2 (1987) recurrent back-propagation, 35th International Conference 609–618. on Machine Learning, ICML 2018 7 (2018) 4807–4820. F. J. Pineda, Generalization of back-propagation to recurrent arXiv:1803.06396. neural networks, Physical Review Letters 59 (19) (1987) D. Linsley, A. K. Ashok, L. N. Govindarajan, R. Liu, 2229–2232. doi:10.1103/PhysRevLett.59.2229. T. Serre, Stable and expressive recurrent vision models (2020). arXiv:2005.11362. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Quantitative Biology arXiv (Cornell University)

Going in circles is the way forward: the role of recurrence in visual inference

Quantitative Biology , Volume 2020 (2003) – Mar 26, 2020

Loading next page...
 
/lp/arxiv-cornell-university/going-in-circles-is-the-way-forward-the-role-of-recurrence-in-visual-HgkqrgEDKv
ISSN
0959-4388
eISSN
ARCH-3345
DOI
10.1016/j.conb.2020.11.009
Publisher site
See Article on Publisher Site

Abstract

Going in circles is the way forward: the role of recurrence in visual inference 1 1−4 Ruben S. van Bergen , Nikolaus Kriegeskorte Zuckerman Mind Brain Behavior Institute, Columbia University, New York, NY, United States Department of Psychology, Columbia University, New York, NY, United States Department of Neuroscience, Columbia University, New York, NY, United States and Affiliated member, Electrical Engineering, Columbia University, New York, NY, United States Biological visual systems exhibit abundant recurrent connectivity. State-of-the-art neural network models for visual recognition, by contrast, rely heavily or exclusively on feedforward computation. Any finite-time recurrent neural network (RNN) can be unrolled along time to yield an equivalent feedforward neural network (FNN). This important insight suggests that computational neurosci- entists may not need to engage recurrent computation, and that computer-vision engineers may be limiting themselves to a special case of FNN if they build recurrent models. Here we argue, to the contrary, that FNNs are a special case of RNNs and that computational neuroscientists and engineers should engage recurrence to understand how brains and machines can (1) achieve greater and more flexible computational depth, (2) compress complex computations into limited hardware, (3) integrate priors and priorities into visual inference through expectation and attention, (4) exploit sequential dependencies in their data for better inference and prediction, and (5) leverage the power of iterative computation. INTRODUCTION tions their models must operate under when deployed in, for example, a smartphone. Moreover, as long as neural network models continue to dominate computer vision, The primate visual cortex uses a recurrent algorithm to more efficient hardware implementations are likely to be 1–3 process sensory input . Anatomically, connectivity is more similar to biological neural networks than current cyclic. Neurons are connected in cycles within local cortical implementations using conventional processors and graphics 4–6 circuits . Global inter-area connections are dense and processing units (GPUs). 7–9 mostly bidirectional . Physiologically, the dynamics of A second explanation for the discrepancy is that the neural responses bear temporal signatures indicative of abundance of recurrent connections in cortex belies a 1,10,11 recurrent processing . Behaviorally, visual perception superficial role in neural computation. Perhaps the can be disturbed by carefully timed interventions that core computations can be performed by a feedforward coincide with the arrival of re-entrant information to a visual network , while recurrent processing serves more auxiliary 12–15 area . The evidence for recurrent computation in the and modulatory functions, such as divisive normalization primate brain, thus, is unequivocal. What is less obvious, 36–39 and attention . This perspective is convenient because however, is why the brain uses a recurrent algorithm. it enables us to hold on to the feedforward model in our This question has recently been brought into sharper minds. The auxiliary and modulatory functions let us focus by the successes of deep feedforward neural network acknowledge recurrence without fundamentally changing 16,17 models (FNNs) . These models now match or exceed the way we envision the algorithm of recognition. 18–20 human performance on certain visual tasks , and However, there is a third and more exciting expla- 21–23 better predict primate recognition behavior and neural nation for the discrepancy between recurrent brains and 24–29 activity than current alternative models. feedforward models: Although feedforward computation is Although computer vision and computational neuro- powerful, a recurrent algorithm provides a fundamentally 30–33 science both have a long history of recurrent models , superior solution to the problem of visual inference, and feedforward models have earned a dominant status in both this algorithm is implemented in primate visual cortex. This fields. How should we account for this discrepancy between recurrent algorithm explains how primate vision can be so brains and models? efficient in terms of space, time, energy, and data, while One answer is that the discrepancy reflects the fact that being so rich and robust in terms of the inferences and their generalization to novel environments. brains and computer-vision systems operate on different hardware and under different constraints on space, time, In this review, we argue for the latter possibility, and energy. Perhaps we have come to a point at which discussing a range of potential computational functions the two fields must go their separate ways. However, this of recurrence and citing the evidence suggesting that the answer is unsatisfying. Computational neuroscience must primate brain employs them. We aim to distinguish estab- still find out how visual inference works in brains. And lished from more speculative, and superficial from more although engineers face quantitatively different constraints profound forms of recurrence, so as to clarify the most when building computer-vision systems, they, too, must exciting directions for future research that will close the gap care about the spatial, temporal, and energetic limita- between models and brains. arXiv:2003.12128v3 [q-bio.NC] 16 Nov 2020 2 UNROLLING A RECURRENT NETWORK the distinction can easily be blurred. Consider the simple network in Fig. 1a. It consists of three processing stages, arranged hierarchically, which we will refer to as areas, by What exactly do we mean when we say that a neural network analogy to cortex. Each area contains a number of neurons – whether biological or artificial – is recurrent rather than (real or artificial) that apply fixed operations to their input. feedforward? This may seem obvious, but it turns out that Visual input enters in the first area, where it undergoes some transformation, the result of which is passed as input to the second area, and so forth. Information travels exclusively "#€$%%&'()*")& +#€,%-.))%/0 -#€,%-.))%/01€./)(22%&€3/€034% in one direction – the “forward” direction, from input to €�‚€� €�‚€� €�‚€� output – and so this is an example of a feedforward archi- tecture. Notably, the number of transformations between area 3 area 3 area 3 area 3 area 3 input and output is fixed, and equal to the number of areas in the network. area 2 area 2 area 2 area 2 area 2 Now compare this to the architecture in Fig. 1b. Here, we have added lateral and feedback connections to the area 1 area 1 area 1 area 1 area 1 network. Lateral connections allow the output of an area to !‚€� !‚€� !‚€� be fed back into the same area, to influence its computations  €�  €  €! in the next processing step. Feedback connections allow the output of an area to influence information processing in a lower area. There is some freedom in the order in which &#€,%-.))%/01€./)(22%&€3/€56"-%€ %#€7%%6€'%%&'()*")&€85.6%)94(&%2: computations may occur in such a network. The order we €�‚€� €�‚€� illustrate here starts with a full feed-forward pass through the network. In subsequent time steps, neural activations area 3 (� = 3) area 9 are updated in ascending order through the hierarchy, based on the activations that were computed in the previous time area 2 (� = 3) area 8 step. This order of operations can be seen more clearly if we area 1 (� = 3) area 7 ’unroll’ the network in time, as shown in Fig. 1c. In this illustration, the network is unrolled for a fixed number of area 3 (� = 2) area 6 time steps (3). In fact, recurrent processing can be run for any desired duration before its output is read out – a notion area 2 (� = 2) area 5 we will return to later. Notice how this temporally unrolled, small network resembles a larger feedforward neural network area 1 (� = 2) area 4 with more connections and areas between its input and output. We can emphasize this recurrent-feedforward equiv- area 3 (� = 1) area 3 alence by interpreting the computational graph over time as a spatial architecture, and visually arranging the induced area 2 (� = 1) area 2 areas and connections in a linear spatial sequence – an operation we call unrolling in space (Fig. 1d). This results area 1 (� = 1) area 1 in a deep feedforward architecture with many skip connec- tions between areas that are separated by more than one !‚€� !‚€� level in this new hierarchy, and with many connections that FIG. 1: Unrolling recurrent neural networks. (a) A simple feedforward are exact copies of one another (sharing identical connection neural network. (b) The same network with lateral (blue) and feedback weights). (red) connections added, to make it recurrent. (c) ”Unrolling” the Thus, any finite-time RNN can be transformed into an network in time clarifies the order of its computations. Here, the network is unrolled for three time steps before its output is read out, equivalent FNN. But this should not be taken to mean that but we could choose to run the network for more or fewer steps. Areas RNNs are a special case of FNNs. In fact, FNNs are a are staggered from left to right to show the order in which their neural special case of finite-time RNNs (Fig. 2a), comprising activities are updated. (d) Alternatively, we can unroll the recurrent those which happen to have no cycles. More practically, network’s time steps in space, by arranging the areas and connec- not every unrolled finite-time RNN is a realistic FNN (Fig. tions from different time steps in a linear spatial sequence. Note how all arrows now once again point in the same (forward) direction, 2b). By realistic networks, we mean networks that conform from input to output. Throughout panels (a-b), connections that are to the real-world constraints the system must operate under. identical (sharing the same weight matrices) are indicated by corre- For computational neuroscience, a realistic network is one sponding symbols. (e) If we lift the weight-sharing constraints from that fits in the brain of the animal and does not require the previous network, this induces a deep feedforward ”super-model”, a deeper network architecture or more processing steps which can implement the spatially-unrolled recurrent network as a special case. This more general architecture may include additional than the animal can accommodate. For computer vision, connections (examples shown as light gray arrows) not present in the a realistic network is one that can be trained and deployed spatially-unrolled recurrent net. on available hardware at the training and deployment stages. 3  € finite-time RNNs feedforward NNs unrealistic unrolled ftRNNs realistic realistically unrollable ftRNNs FIG. 2: Relationships between recurrent and feedforward networks. This figure illustrates relationships between discrete-time feedforward (FNN) and discrete-time recurrent (RNN) neural network models. (a) The architecture of any RNN can be reduced to an FNN by removing all its recurrent connections (e.g., going from Fig. 1b back to Fig. 1a), or equivalently, setting the weights of these connections to zero. Vice versa, any FNN can be expanded to an infinite variety of RNNs by adding lateral or feedback connections. Feedforward networks, thus, form an architectural subset of RNNs. Here we specifically consider RNNs that accomplish their task in a finite number of time steps. These finite-time RNNs (ftRNNs) have the special property that they can be unrolled into equivalent FNNs. White points linked by arcs indicate pairs of computationally equivalent architectures. Thus, the feedforward NNs contain a subset of architectures that can be obtained by unrolling a ftRNN. (b) These sets of networks can be further subdivided into subsets that are or are not realistic to implement with the computational resources available for a brain or engineered device (areas below and above the dotted line, respectively). Deeper networks and, more generally, networks with more neurons and connections tend to require more memory and computation to train and run. Some realistic ftRNNs remain realistic when expressed as an FNN (blue ellipse). Others, however, become too complex, when unrolled, to be feasible (black arc crossing the realism line). This is because the unrolling operation induces a much deeper architecture with many more neural connections to be stored. These not-realistically-unrollable ftRNNs are especially interesting, since they correspond to recurrent solutions that cannot be replaced by feedforward architectures. For example, there may be limits on the storage and energy models are strictly feedforward architectures. available, which would limit the complexity of the archi- tecture and computational graph. A realistic finite-time CONTINUOUS- VERSUS DISCRETE-TIME RNN, when unrolled, can yield an unworkably deep FNN. DYNAMICS Although the most widely used current method for training RNNs (backpropagation through time) requires unrolling, FNNs used in computer vision do not have meaningful an RNN is not equivalent to its unrolled FNN twin at the stage of real-world deployment: the RNN’s recurrent dynamics. Each unit in the network instantaneously trans- forms its input into an output. This is in contrast to a connections need not be physically duplicated, but can be reused across cycles of computation. feedforward network of biological neurons. When given a static input, biological neurons do not immediately produce 40–42 An important recent observation is that the archi- their final responses. The movement of electric charges tecture that results from spatially unrolling a recurrent and neurotransmitters, and the opening and closing of ion network, resembles the architectures of state-of-the art channels takes time, so the network will gradually transition FNNs used in computer vision, which similarly contain skip from its initial to its final state, with its trajectory contin- connections and can be very deep. These deep FNNs ually perturbed by noise. Such continuous-time dynamics may form a super-class of models (Fig. 1e), which can be described by differential equations. When these reduce to “recurrent-equivalent” architectures when certain cannot be solved analytically (as is typically the case), the subsets of weights are constrained to be identical. Liao dynamics can be simulated in discrete steps. In each step, & Poggio showed that deep feedforward architectures the current state of each simulated neuron is updated. The known as residual networks (ResNets) are formally equiv- future state of the network thus depends on its current state, alent to recurrent architectures when certain connection as it does in an RNN. Consequently, the computational weights are constrained to be identical. Moreover, when graph of the simulation algorithm contains loops from each ResNets were trained with such recurrent-equivalent weight- neuron back to itself. Running the simulation over time sharing constraints, their performance on computer vision amounts to unrolling this loopy computational graph, even benchmarks was similar to unconstrained ResNets (even though the network architecture did not contain loops. though the weight sharing drastically reduces the parameter Computational neuroscientists commonly study models of count and limits the component computations that the feedforward and recurrent neural networks with continuous- network can perform). This is especially noteworthy time dynamics . Here our focus is on neural network given that ResNets, and architecturally related DenseNets, models that are motivated by the goal to capture compu- are currently among the top-ranking FNNs on prominent tations, rather than their precise neural implementation. 19,43 computer vision benchmarks , as well as measures of The discrete-time behavior of such a model is not derived brain-similarity . Today’s best artificial vision models, from a continuous-time description in differential equations. thus, actually implement computational graphs closely Moreover, the model is optimized in its discrete-time imple- related to those of recurrent networks, even though these mentation. However, an implicit assumption in the field is 4 that such models could be implemented in biological brains, the FNNs when its confidence threshold was set to match and thus in continuous-time dynamical systems. the FNN’s computational cost (number of floating point operations) on average across images (Fig. 3). Flexible computational depth would be advantageous for animals, REASONS TO RECUR who may need to respond rapidly in some situations, must limit metabolic expenditures in general, and may benefit from slower and more energetically costly inferences when We have described how a recurrent network can be high accuracy is required. Computer vision faces similar unrolled into a deep feedforward architecture. The resulting requirements in certain applications. For example, a vision feedforward super-model offers greater computational flexi- algorithm in a smartphone should respond rapidly and bility, since weight-sharing constraints can be omitted and conserve energy in general, but should also be able to additional skip connections added to the network (Fig. 1e). recognize hard images, and it should allow trading off mean So what would be the benefit of restricting ourselves to accuracy for speed and energy (e.g., when the battery is recurrent architectures? We will first discuss the benefits of low). recurrence in terms of overarching principles, before consid- ering more specific implementations of these principles. Recurrent architectures can compress complex Recurrence provides greater and more flexible computations in limited hardware computational depth Another benefit of recurrent solutions is that they require Recurrence enables arbitrary computational depth fewer components in space when physically implemented in recurrent circuits, such as brains. Compare Figs. 1b and One important advantage of recurrent algorithms is that 1e: the recurrent network is anatomically more compact they can be run for any desired length of time before their than the feedforward network and has fewer connections. output is collected. We can define computational depth It is easy to see why evolution might have favored a as the maximum path length (i.e. number of successive recurrent implementations for many brain functions: Space, neural projections, and the energy to develop and maintain connections and nonlinear transformations) between input and output. A recurrent neural network (RNN) can achieve them are all costly for the organism. In addition, synaptic efficacies must be either learned from limited experience arbitrary computational depth despite having a finite count of parameters and being limited to finite spatial compo- or encoded in a limited-capacity genome. Beyond saving nents. In other words, it can multiply its limited spatial space, material, and energy, thus, smaller descriptive resources along time. These deeper computations can complexity (or parameter count) might ease development serve to expand on the number of hypotheses considered and learning. (in generative inference) or on the number of nonlinear Engineered devices face the same set of costs, although features computed (in discriminative inference), or to extend their relative weighting changes from application to appli- the representation into the future or past, or to iteratively cation. In particular, a larger number of units and weights converge to a good estimate of some latent variable of must either be represented in the memory of a conven- interest. tional computer or implemented in specialized (e.g., neuro- morphic) hardware. The connection weights in an NN model need to be learned from limited data. This requires Recurrence enables more flexible expenditure of energy and extensive training, e.g., in a supervised setting, with millions time in exchange for inferential accuracy of hand-labeled examples that show the network the desired output for a given input. The larger number of param- In addition to enabling an arbitrarily deep computation given eters associated with a feedforward solution might overfit the training data. The learned parameters then do not enough time, an RNN can adjust its computational depth to the task at hand. The computational depth of a feedforward generalize well to new examples of the same task. net, by contrast, is a fixed number determined by the archi- FNNs often turn out to generalize surprisingly well even 45–47 tecture. when they have very large numbers of parameters . This Spoerer et al. implemented a recurrent model that termi- phenomenon is thought to reflect a regularizing effect of the nates computations when it reaches a confidence threshold learning algorithm, stochastic gradient descent. Indeed, the (defined by the entropy of the posterior, a measure of the trend is towards ever deeper networks with more connections model’s uncertainty) . The model terminates rapidly for to be optimized, and this trend is associated with continuing many images, but expends more time and energy on hard gains in performance on computer vision benchmarks . images to reach its confidence threshold. Adjusting the Nevertheless, it could turn out that recurrent architectures confidence threshold enables trading off speed for accuracy that achieve high computational depth with fewer param- in terms of average performance. When compared to a eters bring benefits not only in terms of their storage, but range of FNNs requiring different amounts of computation, also in terms of statistical efficiency, the ability generalize the RNN achieved roughly the same accuracy as each of accurately based on limited experience. This would imply entropy threshold [nats] that recurrent networks have an inductive bias that makes proposed for recurrent processing for visual inference, from up for the limited experiential data. This is explored further superficial to increasingly profound forms of recurrence. in subsequent sections, where we discuss how RNNs can exploit temporal dependency structures, and enable iterative inference. Feedback connections are required to integrate information from outside the visual hierarchy Energy is another factor to consider in both biology and engineering. Larger FNNs take longer to train on bigger computing clusters, while drawing greater amounts A key, established role of recurrent connections in biological of power – a trend that is not sustainable. In the long vision is to propagate information from outside the visual run, therefore, computer vision too may benefit from the cortex, so that it can aid visual inference . Here, we will anatomical compression that can be achieved through clever briefly discuss two such outside influences: attention and use of recurrence. expectations. Importantly, however, not every deep feedforward model can be compressed into an equivalent recurrent implemen- tation. This anatomical compression can only be achieved Attentional prioritization requires feedback connections when the same function may be applied iteratively or recur- sively within the network. The crucial question, therefore, Animals have needs and goals that change from moment is: what are these functions? What operations can be to moment. Perception is attuned to an animal’s current applied repeatedly in a productive manner? The remainder objectives. For instance, a primate foraging for red berries of this paper will reflect on the various roles that have been may be more successful if its visual perception apparatus prioritizes or enhances the processing of red items. Since current goals are represented outside the visual cortex RNN (e.g., in frontal regions), top-down connections are clearly required for this information to influence visual processing. Such top-down effects have been grouped under the label ”attention”, and they have been the subject of an entire sub-field of study. For our purposes, it is sufficient to note that the effects and mechanisms of top-down attention are well-documented and pervasive in visual cortex (for review, 36–38 see [ ]), and thus there is no question that this is one important function of recurrent connections. FNNs Integrating prior expectations into visual inference requires feedback connections computational cost [number of floating-point operations ×10 ] Organisms may constrain their visual inferences by expectations . Visual input can be ambiguous and FIG. 3: Recurrence enables a network to trade speed for accuracy unreliable, and thus open to multiple interpretations. To while approximately emulating the accuracies of feedforward models constrain the inference, an observer can make use of on average at matched computational cost. Circles denote the perfor- 51–53 prior knowledge . One form of prior knowledge is mance of a recurrent neural network (RNN) that was run for different numbers of time steps, until it achieved a desired threshold of image environmental constants (e.g., ”light tends to come from classification confidence (quantified by the entropy of the class proba- above” ). Such unvarying knowledge may be stored bilities in the final network layer). Squares correspond to three archi- within visual cortex, especially when it pertains to the tecturally similar feedforward networks (FNN) with different compu- overall prevalence of basic visual features (e.g., local tational costs. On the x-axis is the computational cost of running edge orientations ). Another form of prior knowledge is these models, measured by the number of floating point operations. For the feedforward models, this cost is fixed by the architecture. contextual information specific to the current situation. For the recurrent models, it is the average number of operations Such time-varying knowledge may require a flexible repre- that was required to meet the given entropy threshold. The y-axis sentation outside visual cortex (e.g., ”I rang the doorbell at shows the classification accuracy achieved by each model. The perfor- my mother’s house, so I expect to see her open the door”). mance of the recurrent model for different certainty thresholds follows a smooth curve, trading off computational cost (and thus computa- Such expectations, represented in higher cortical regions, tional speed) and accuracy. Note that this curve passes almost exactly require feedback connections to affect processing in visual through the cost-accuracy combinations achieved by the feedforward cortex . models. Thus, a single recurrent model can emulate the performance The top-down imposition of attention and expectation of multiple feedforward models as it trades off speed and accuracy. must be mediated by feedback connections. However, it When the confidence threshold of termination was set such that the RNN matched the accuracy of a given FNN, the RNN required a similar is unclear whether these influences fundamentally change number of floating-point operations on average as the FNN. (Figure the nature of visual representations or merely modulate adapted with permission from the authors .) these representations, adjusting the gain depending on the accuracy [proportion top-1 correct] ‰##„'Œ%( !"€#$%€&$''(&  � ‚ƒ„…†‡ ˆ†‰Š‹Œˆ� † €�‚€� €�‚ƒ„…†‡…€ƒ…ˆ‰‡Š‹Œ�Ž‹�‰�„‡‚‹‰�‹� #$‚#%Œ � ˆ� † area 3 area 3 €�‚€� ‹Œˆ†�Œˆ† area 2 area 2 ‡ˆ†‚€� area 1 area 1 ˆ†‚€�  €�  € *"€+!,-!.€/,0'1 €�‚€� €�‚€� €�‚€� ! " €�‚ƒ„…†‡…€‰Œ�ƒ‹„�‰…‡‚‹‰�‹� area 3 area 3 area 3 €�‚€�‡ � ! "# €�‚€�‡ � ! $# #$Š� Œ‚&Œ#� area 2 area 2 area 2 ˆ†‚€‡� ‡� ! "# ˆ†‚€‡� ‡� ! $# area 1 area 1 area 1 ˆ†‚€� ˆ†‚€� ˆ†‚€� ! "  €�  €  €) €�‚ƒ„…†‡Š�…Œ�ŒˆŠ�ˆ�„‡… ��� Œ…�ˆ… 2"€30'1!045'€4.6'1'.2' €�‚€� area 3 area 3 area 3 area 2 area 2 area 2 area 1 area 1 area 1 ˆ†‚€� ˆ†‚€�  €�  €  €) FIG. 4: Increasingly profound modes of recurrent processing, unrolled in time. Visual cortex likely combines all three modes of recurrence illustrated here. The left side of each panel shows the computational graph induced by each form of recurrence, while the right side illustrates a (simplified) example of how this recurrence can be used. In these examples, circles correspond to neurons (or neural assemblies) encoding the feature illustrated within the circle, and lines that connect to circles indicate neural connections with significant activity. (a) Top-down influences from outside the visual processing hierarchy may be incorporated through two computational sweeps: a feedback sweep priming the network with top-down information and a feedforward sweep to interpret visual input and combine this interpretation with the top-down signal. Note that the lateral connections here merely copy neural activities in each area to the next time point; this identity transformation could also be implemented in other ways, such as slow membrane time constants or other forms of local memory. In the example on the right, a top-down signal communicates the expectation that the upcoming input will be horizontal motion. This primes neurons encoding this direction of motion to be more easily or strongly activated, and sharpens the interpretation of the subsequent (ambiguous) visual input. (b) To efficiently perform inference on time-varying visual input, recurrent connections may implement a fixed temporal prediction function akin to the transition kernel in a Kalman filter, extrapolating the ongoing dynamics of the world one time step into the future. For instance, in the example on the right, a downward moving square was perceived at t = 1. This motion is predicted to continue, and this prediction constrains the interpretation of the (ambiguous) visual input at the next time point. For simplicity, only lateral recurrence is shown in this example. Note that each input is mapped onto its corresponding output in a single recurrent time step. (c) Static input may also benefit from recurrent processing that iteratively refines an initial, coarse feedforward interpretation. In this mode of recurrence, there are several processing time steps between input and output, whereas in (b) there was one input and output for each time step. Illustrated on the right is an iterative hierarchical inference algorithm. Here, a higher-level hypothesis, generated in the first time step, refines the underlying lower-level representation in the next time step, which in turn improves the higher-level hypothesis, and so forth, until the network converges to an optimal interpretation of the input across the entire hierarchy. For simplicity, lateral recurrent interactions are not shown in this example. 7 current relevance of different features of the visual input. Recurrent dynamics can simulate and predict the dynamics of the world As illustrated in Fig. 4a, for a given input this would require only two ”sweeps” of computation through the visual processing hierarchy: a feedback sweep that primes visual Dynamic compression of the past exploits the temporal areas with top-down information, and a bottom-up sweep dependency structure of the sensory data. The purpose to interpret the visual input and integrate or modify this of representing the past is to act well in the future. This interpretation with the top-down signal (not necessarily in suggests that a neural network should exploit temporal that order). Importantly, if the feedback signal merely dependencies not just to compress the past, but also to enhances or suppresses some visual features, then the core predict the future. In fact, an optimal representation of even inference algorithm need not be fundamentally recurrent – just the present requires prediction, because the sensory one can imagine that the bottom-up part of such a network data is delayed and noisy. is modeled perfectly by an FNN, while an optional recurrent Changes in the world are governed by laws of dynamics, module could be added in order to implement top-down which by definition are temporally invariant. An ideal contextual influences. observer will exploit these laws in visual inference and optimally combine previous with present observations to estimate the current state. This implies an extrapolation of the past to generate predictions that improve the inter- pretation of the present sensory input. When the dynamics Recurrent networks can exploit temporal are linear and noise is Gaussian, the optimal way to infer the dependency structure present state by combining past and present evidence is the Kalman filter – an algorithm widely used in engineering 60–63 Contextual constraints on visual inference include not only applications. A number of authors have proposed that information from outside the visual hierarchy, such as infor- the visual cortex may implement an algorithm similar to a mation from other sensory modalities and memory, as Kalman filter. This theory is consistent with temporal biases 64–66 discussed in the previous section. The recent stimulus that are evident in human perceptual judgments . history within the visual modality also provides context, Kalman filters employ a fixed temporal transitional kernel. likely represented within the visual system. This kernel takes a representation of the world (e.g., variables encoding the present state of a physical system, such as positions and velocities) at time t, and transforms it into a predicted representation for time t + 1, to be integrated with new sensory evidence that arrives at that Recurrent networks can dynamically compress the stimulus time. While the resulting prediction varies as a function of history the kernel’s input, the kernel itself is constant, reflecting the temporal shift-invariance of the laws governing the dynamics. Recurrent neural networks provide a general- The primate visual system is thought to contain a hierarchy, ization of the Kalman filter and can represent nonlinear not only of processing stages and spatial scales, but also 56,57 dynamical systems with non-Gaussian noise. of temporal scales . Visual representations track the Note that this type of recurrent processing is more environment moment by moment. However, the duration profound than the two-sweep algorithm (Fig. 4a) that of a visual moment, the temporal grain, may depend on incorporated top-down influences on visual inference. The the level of representation. These principles apply to all two-sweep algorithm is trivial to unroll into a feedforward sensory modalities and have been empirically explored, in architecture. In contrast, unrolling a Kalman filter- particular, for audition and speech perception. At the like recurrent algorithm would induce an infinitely deep simplest level, a neural network could use delay lines to feedforward network, with a separate set of areas and detect spatiotemporal, rather than purely spatial, patterns. connections for each time point to be processed. A finite- Recurrent neural networks have internal states and can depth feedforward architecture can only approximate the represent temporal context across units tuned to different recurrent algorithm. While the feedforward approximation latencies. An RNN could represent a fixed temporal window, will have a finite temporal window of memory to constrain by replicating units tuned to different patterns for multiple its present inferences, the recurrent network can in principle latencies. However, RNNs trained on sequence processing integrate information over arbitrarily long periods. tasks, such as language translation, learn more sophisticated representations of temporal context . They can represent Due to their advantages for dealing with time-varying (or context at multiple time scales, learning a latent represen- otherwise ordered) inputs, recurrent neural networks are in tation that enables them to dynamically compress whatever fact widely employed in the broader field of machine learning information from the past is needed for the task. In contrast for tasks involving sequential data. Speech recognition and to a feedforward network, a recurrent network is not limited machine translation are prominent applications that RNNs 58,67–70 by spatial constraints in terms of its retrospective time excel at . Computer vision, too, has embraced RNNs 71–73 horizon. It can maintain task-relevant information indefi- for recognition and prediction of video input . Note nitely, integrating long-term memory into its inferences. that these applications all exploit the dynamics in RNNs to 8 model the dynamics in the data. being subdivided into smaller hypotheses about lower or What if we trained a Kalman filter or sequence-to- intermediate-level features, such as the local edges that make up a larger contour. An iterative visual inference sequence RNN (Fig. 4b) on a train of independently sampled static inputs to be classified? The memory of the algorithm starts with an initial hypothesis, and refines it by incremental improvements. These improvements may preceding inputs would not be useful then, so we expect the recurrent model to revert to using essentially only its include eliminating hypotheses that are mutually exclusive, strengthening compatible causes, or adjusting a hypothesis feedforward weights. The type of recurrent processing we described in this section, thus uses memory to improve based on its ability to predict the data (the visual input). In a probabilistic framework, the optimization objective would visual inference. In the next section, we consider how recurrent processing can help with the inferential compu- be the likelihood (probability of the image given the latent representation) or the posterior probability (probability of tations themselves, even for static inputs. the latent representation given the image). Recurrence enables iterative inference Incompatible hypotheses can compete in the representation Recurrent processing can contribute even to inference on static inputs, and regardless of the agent’s goals and expec- There are often multiple plausible explanations for a given tations, by means of an iterative algorithm. An iterative sensory input that are mutually exclusive. The distributed, algorithm is one that employs a computation that improves parallel nature of neural networks enables them to initially an initial guess. Applying the computation again to the activate and represent all of these possible hypotheses simul- improved guess yields a further improvement. This process taneously. Recurrent connectivity between neurons can then can be repeated until a good solution has been achieved implement competitive interactions among hypotheses, so or until we run out of time or energy. Recurrent networks as to converge on the best overall explanation. can implement iterative algorithms, with the same neural There is some evidence that sensory representations are 74–76 network functions applied successively to some internal probabilistic – in this case, the probabilities assigned pattern of activity Fig. 4c). to a set of mutually exclusive hypotheses must sum to 1. In many fields, iterative algorithms are used to solve A strengthening of belief in one hypothesis, thus, should estimation and optimization problems. In each iteration, entail a reduction of the probability of other hypotheses in a small adjustment is made to the problem’s proposed the representation. If neurons encode point estimates rather solution, to improve a mathematically formulated objective. than probability distributions, then only one hypothesis A locally optimal solution is found by making small improve- can win (although that hypothesis may be encoded by ments until further progress is not required or not possible. a population response involving multiple neurons). The The algorithm navigates a path in the space of the values winning hypothesis could be the maximum a posteriori to be estimated or the parameters to be optimized, that (MAP) hypothesis or the maximum likelihood hypothesis. leads to a good solution (albeit not necessarily the global Influential models of visual inference involving compet- optimum). itive recurrent interactions include divisive normalization , 36 30,32,77 Much of machine learning involves iterative methods. biased competition , and predictive coding . Gradient descent is an iterative optimization method, whose Recent theoretical work has demonstrated that lateral stochastic variant is the most widely used method for competition can give rise to a robust neural code, and 77,78 training FNNs. Many discrete optimization techniques are can explain certain puzzling neural response properties . iterative. Iterative algorithms are also central to inference This theory considers a spiking neural network setting, in machine learning, for example in variational inference in which different neurons encode highly overlapping or (where inference is achieved by optimization), sampling even identical features in their input. This degeneracy methods (where steps are chosen stochastically such that means that the same signal can be encoded equally well the distribution of samples converges on the posterior distri- by a range of different response patterns. When a bution), and message passing algorithms (such as loopy particular neuron spikes, lateral inhibition ensures that belief propagation). In particular, such iterative inference other competing neurons do not encode the same part of algorithms are used in probabilistic approaches to computer the input again. Which neuron gets to do the encoding 31,33 vision . It is somewhat surprising, then, that iterative thus depends on which neuron fires first, because its computation is not widely exploited to perform visual membrane potential happened to be closest to a spiking inference in FNNs. threshold. This leads to trial-to-trial variability in neural Visual inference is naturally understood as an responses that reflects subtle differences in initial condi- optimization problem, where the goal is to find hypotheses tions – conditions that may not be known to an experi- that can explain the current visual input . A hypothesis, menter, who may thus mistake this variability for random in this case, is a proposed set of latent (i.e. unobserved) noise. This could explain the puzzling observation that causes that can jointly explain the image. The hypothe- individual neurons reliably reproduce the same output given sized latent causes could be the identities and positions of the same electrical stimulation, but populations of neurons, objects in the scene. Visual hypotheses are hierarchical, wired together, display apparently random variability under 9 79–81 85 sensory stimulation . Since multiple neurons can encode perceptual grouping operations . Recent examples include the same feature, the resulting code is also robust to neurons Linsley et al., who developed horizontal gated-recurrent being lost or temporarily inactivated. units (hGRUs) that learn local spatial dependencies . A network equipped with this particular recurrent connectivity FNNs do not incorporate lateral connections for compet- was competitive with state-of-the-art feedforward models itive interactions, although they very often include compu- on a contour integration task, while using far fewer free tations that serve a similar purpose. Chief among these parameters. George et al. similarly leveraged lateral inter- are operations known as max-pooling and local response 16,82 normalization (LRN) . In max-pooling, only the actions to recognize contiguous contours and surfaces, by modeling these with a conditional random field (CRF), using strongest response within a pool of competing neurons is forwarded to the next processing stage. In LRN, each a message-passing algorithm for inference. This approach made their Recursive Cortical Network (RCN) the first neuron has its response divided by a term that is computed from the sum of activity in its normalization pool. While computer vision algorithm to reliably beat CAPTCHAs – images of letter sequences under a variety of distortions, neither of these mechanisms is mediated by explicit lateral noise and clutter, that are widely used to verify that queries connections in a FNN, a strictly connectionist implemen- to a user interface are made by a person, and not an tation of these mechanisms (e.g., in biological neurons or algorithm. CRFs were also used by Zheng et al. , who neuromorphic hardware) would have to include lateral recur- incorporated them as a recurrent extension of a convolu- rence. This, then, is another way in which apparently tional neural network for image segmentation. The model feedforward FNNs can exhibit a (limited) form of recurrent processing ”under the hood”. Note, though, that each surpassed state-of-the-art performance at the time. Associ- ation rules enforced through lateral connections may also of these operations is carried out only once, rather than allowing competitive dynamics to converge over multiple help to fill in missing information, such as when objects are partially hidden from view by occluders. Lateral connec- iterations. Furthermore, in contrast to the lateral inter- actions in predictive coding or other normative models, tivity has been shown to improve recognition performance 23,89,90 in such settings . Montobbio et al. showed that LRN and max-pooling are not derived from normative lateral diffusion of activity between neurons with correlated principles, and do not necessarily select (or enhance) the feedforward filter weights improves robustness to image best hypothesis (however ”best” is defined). perturbations including occlusions . Enhancement of mutually compatible hypotheses (this section) and competition between mutually exclusive Compatible hypotheses can strengthen each other in the hypotheses (previous section) can both contribute to representation inference. A more general perspective is provided by the insight that prior knowledge about what features in a scene In feedforward models of hierarchical visual inference, are mutually compatible or exclusive may be part of an neurons at higher stages selectively respond to combinations overarching generative model, which iterative algorithms of simpler features encoded by lower-level neurons. Higher- can exploit for inference. level neurons thus are sensitive to larger-scale patterns of correlation between subsets of lower-level features. But such larger-scale statistical regularities may not be most Iterative algorithms can leverage generative models for efficiently captured by a set of larger-scale building blocks. inference Instead, they may be more compactly captured by local association rules. Consider, for instance, the problem of contour detection. Many combinations of local edges in an Perceptual inference aims to converge on a set of hypotheses image can form a continuous contour. The resulting space that best explain the sensory data. Typically, a hypothesis is of contours may be too complex to be efficiently represented considered to be a good explanation if it is consistent with with larger-scale templates. What all these contours have in both our prior knowledge and the sensory data. A gener- common, however, is that they consist of pairs of edges that ative model is a model of the joint distribution of latent are locally contiguous, with sharper angles occurring with causes and sensory data. Generative models can powerfully lower probability. Thus, the criteria for ’contour-ness’ may constrain perceptual inference because they capture prior be compactly expressed by a set of local association rules: knowledge about the world. In machine learning, defining 83,84 these edges go together; those do not . Contours may generative models enables us to express and exploit what then be pieced together by repeatedly applying the same we know about the domain. A wide range of inference local association rules. Those edge pairs which are most algorithms can be used to compute posterior distributions clearly connected would be identified in early iterations. over variables of interest, given observed variables. The Later inferences can benefit from the context provided by algorithms include variational inference, message passing, and Markov Chain Monte Carlo sampling, all of which earlier inferences, enabling the process to recognize conti- nuity even where it is less locally apparent. require iterative computation. This insight has inspired network models of visual In this section, we focus on a particular approach to lever- inference that implement local association rules through aging generative models in visual inference, in which the lateral connections, to aid contour integration and other joint distribution p(x, z) of the image x and the latents z 10 is factorized as p(x, z) = p(z) · p(x|z), which we refer to either of the categories. An ideal observer should evaluate as the top-down factorization. The architecture contains the likelihood for each hypothesis and adjudicate according components that model p(x|z) and predict the image from to their ratio . A feedforward network may instead latch the latents (or more generally lower-level latent representa- on to a few highly discriminative, but subtle image features tions from higher-level latent representations). Compared that don’t explain much and may not generalize to images 93,95 to the alternative factorization p(x, z) = p(x) · p(z|x), the from a different data set . In contrast, visual features top-down factorization has the potential advantage that the that are important for generating or reconstructing images model operates in the causal direction, matching the causal of a given class may be more likely to generalize to other process in the world that generated the image. The top- examples of the same category. In support of this intuition, down model predicts what visual input is likely to result two novel RNN architectures that employ generative models from a scene that has the hypothesized properties. This is for inference were found to be more robust to adversarial 96,97 somewhat similar to the graphics engine of a video game perturbations . Generative inference networks were also or image rendering software. This top-down model can be shown to better align with human perception, compared implemented via feedback connections that translate higher- to discriminative models, when presented with controversial level hypotheses in the network to representations at a lower stimuli – images synthesized to evoke strongly conflicting level of abstraction. classifications from different models . Despite these promising developments, generative Using generative models implemented with top-down inference remains rare in visual FNN models. The predictions for inference is known as analysis-by-synthesis exceptions mentioned above are rather simple networks – an approach that has a long history in theories of 30,32,51 trained on easy classifications problems, and are not (yet) perception . Arguably, the goal of perceptual competitive with state-of-the-art performance on more inference, by definition, is to reason back from effects challenging computer vision benchmarks. Within compu- (sensory data) to their causes (unobserved variables of tational neuroscience, by contrast, generative feedback interest), and thus invert the process that generated the connections appear in many network models of visual effects. The crucial question, however, is whether the causal 30,32 inference. Prominent examples are predictive coding process is explicitly represented in the inference algorithm. and hierarchical Bayesian inference . However, these The alternative, which can be achieved with feedforward models have not had much success in explaining visual inference, is to directly approximate the inverse, without inference beyond its earliest stages. A notable exception is ever making predictions in the causal direction. The success work by Wen et al. , which shows that extending super- of the feedforward approach then depends on how well the vised convolutional FNNs with the recurrent dynamics of inverse can be approximated by a fixed mapping of inputs predictive coding can improve classification performance. to hypotheses. To iteratively invert the causal process, The fields of computer vision and computational neuro- a neural network can evaluate the causal model for a science both stand to benefit from the development of more current hypothesis and update the hypothesis in a beneficial powerful generative inference models. direction. This process can then be repeated until conver- gence. This process of analysis by repeated synthesis may be preferable to directly approximating the inverse mapping if the causal process that generates the sensory data is easier Iteration is necessary to close the amortization gap to model than its inverse. In particular, the causal process may be more compactly represented, more easily learned, Iterative inference has many advantages. A drawback of more efficient to compute, and more generalizable beyond iteration, however, is that it takes time for the algorithm to the training distribution than its inverse. converge during inference. This is unattractive for animals who need to perform visual inference under time pressure. Another potential advantage of generative inference lies in robustness to variations in the input. While FNNs can It is also a challenge when training a FNN, which already requires many iterations of optimization. If each update of accurately categorize images drawn from the same distri- bution that the training images were drawn from, it does not the network’s connections additionally includes an iterative inner loop to perform inference on each training example, take much to fool them. A slight alteration imperceptible to this lengthens the time required for training. humans can cause a FNN to misclassify an image entirely, with high confidence . State-of-the-art FNNs rely more A complementary inference mechanism is amortized 92 101,102 strongly on texture than humans, who rely more on shape . inference , where a feedforward model approximates More generally, FNNs seem to ignore many image features the mapping from images to their latent causes. FNNs that are relevant to human perception . One hypothe- are eminently suited for learning complicated input-output sized reason for this is that these networks are trained to mappings. A single transformation then replaces the trajec- discriminate images, but not to generate them. Thus, any tories that would be navigated by an iterative inference visual feature that reliably discriminates categories in the algorithm. In some cases, the iterative solution and the training data will be weighted heavily in the network’s classi- best amortized mapping may be exactly equivalent. A fication decisions. Importantly, this weight is unrelated to linear model, for instance, can be estimated iteratively, how much variance the feature explains in the image, and by performing gradient descent on the sum of squared to the likelihood, i.e. the probability of the image given prediction errors. However, if a unique solution exists, it 11 can equivalently be found by a linear transformation that illustrates how limited resources (the fovea) can be dynam- directly maps from the data to the optimal coefficients. ically allocated (eye movements) to different portions of the evidence (the visual scene) in temporal sequence. A In general, however, amortized inference incurs some error, compared to the optimal solution that might be found sensory system limited to a finite number of neurons, thus, can multiply its resources along time to achieve a detailed through iterative optimization. This error has been called 103,104 the amortization gap . It is analogous to the poor analysis. The cycle may start with an initial rough analysis of the entire visual field, followed by fixations on locations fit that may result from buying clothes ”off the rack”, compared to a tailored version of the same garment. The likely to yield valuable information. This is an example of an essentially recurrent process whose efficiency cannot amortization gap is defined in the context of variational inference, when the iterative optimization of the varia- be emulated with a feedforward system. The internal mechanisms of visual inference are faced with qualitatively tional approximation to the posterior is replaced by a neural network that maps from the image to the parameters of the similar challenges: Just like our retinae cannot afford foveal resolution throughout the visual field, the ventral stream variational distribution. The resulting model suffers from cannot afford to perform all potentially relevant inferences two types of error: (1) error caused be the choice of the variational approximation (variational approximation gap) on the evidence streaming in through the optic nerve in a single feedforward sweep. Internal shifts of attention, like and (2) error caused by the model mapping from images to variational parameters (amortization gap). One recent eye movements, can sequentialize a complex computation and avoid wasting energy on portions of the evidence that study has argued that the amortization gap is often the main source of error in amortized inference models . are uninformative or irrelevant to the current goals of the animal. Amortized and iterative inference define a continuum. At Whereas the outer loop of active vision is largely about one extreme, iterative inference until convergence reaches positioning our eyes relative to the scene and bringing a solution through a trajectory of small improvements, important content into foveal vision, the inner loop of visual explicitly evaluating the quality of the current solution at inference on each glimpse is far more flexible. Beyond covert every iteration. At the other extreme, fully amortized attentional shifts that select locations, features, or objects inference takes a single leap from input to output. In for scrutiny, a recurrent network can decide what computa- between these extremes lies a space for algorithms that use intermediate numbers of steps, to approximate the tions to perform so as to most efficiently reduce uncertainty about the important parts of the scene. In a game of twenty optimal solution through a computational path that is more refined than a leap, but more efficient than full- questions, we choose a question that most reduces our remaining uncertainty at each step. The budget of twenty fledged iterative optimization. Models that occupy this space include explicit hybrids of iterative and amortized would not suffice if we had to decide all the questions before 104–106 seeing any answers. The visual system similarly has limited inference , as well as RNNs with arbitrary dynamics computational resources for processing a massive stream of that are trained to converge to a desired objective in a 23,107–109 evidence. It must choose what inferences to pursue on the limited number of time steps (e.g., ). basis of their computational cost and uncertainty-reducing 113–115 benefit as it forages for insight . Recurrence is required for active vision CLOSING THE GAP BETWEEN BIOLOGICAL Vision is an active exploratory process. Our eye movements AND ARTIFICIAL VISION scan the scene through a sequence of well-chosen fixations that bring objects of interest into foveal vision. Moving We have reviewed a number of advantages that recurrence our heads and our bodies enables us to bring entirely new can bring to neural networks for visual inference. Going parts of the scene into view, and closer for inspection at high forward, neural network models of vision should incorporate resolution. Active control of our eyes, heads, and bodies can recurrence; not just to better understand visual inference also help disambiguate 3D structure as fixation on points in the brain, but also to improve its implementation in at different depths changes binocular disparity, and head machines. and body movements create motion parallax. Active vision involves a recurrent cycle of sensory processing and muscle control, a cycle that runs through the environment. Recurrence already improves performance on Our focus here has been on the internal computational challenging visual tasks functions of recurrent processing, and active vision has been 110–112 reviewed elsewhere . However, it is important to note that the internal recurrent processes of visual inference from Efforts in this direction are already underway, and turning a single glimpse are embedded within the larger recurrent up promising results. Some of this work has been described process of active visual exploration. Active vision provides in previous sections, such as the use of lateral connec- 86–88 not just the larger behavioral context of visual inference. tions to impose local association rules and generative It also provides a powerful illustration of the fundamental inference for more robust performance outside the training 96,97 advantages that recurrent algorithms offer in general. It distribution . Several other recent findings are worth 12 highlighting here, as they have shown improved performance realism could refer to the real-world constraints faced by on visual tasks, better approximations to biological vision, either biological or artificial visual systems. Future studies or both, through recurrent computations. should compare RNN and FNN implementations for the same visual inference task, while matching the complexity In particular, several studies have found that recurrence of the models in a meaningful way. Setting a realistic is required in order to explain or improve visual inference budget of units, connections, and computational operations in challenging settings. Kar and colleagues identified a is one important approach. To understand the computa- set of ’challenge images’ that required recurrent processing tional differences between RNN and FNN solutions, it is in order to be accurately recognized. A feedforward also interesting to (1) match the parameter count (number FNN struggled to interpret these images, whereas macaque of connection weights that must be learned and stored), monkeys recognized them as accurately as a set of control which requires granting the FNN larger feature kernels, images. Challenge images were associated with longer more feature maps per layer, or more layers, or (2) match processing times in the macaque inferior temporal (IT) the computational graph, which equates the distribution of cortex, consistent with recurrent computations. Neural path lengths from input to output and all other statistics responses in IT for images that took longer were well of the graph, but grants the FNN a much larger number of accounted for by a brain-inspired RNN model. In a parameters . different study , this same recurrent architecture was found to account for behavior, and neural data from macaque visual cortex, in object recognition tasks, while also achieving good performance on an important computer Freeing ourselves from the feedforward framework vision benchmark (ImageNet ). In human visual cortex, recurrent interactions were also found to be crucial to Deep feedforward neural networks constitute an essential model the neural dynamics underlying object recognition, building block for visual inference, but they are not the as measured through magnetoencephalography (MEG) . whole story. The missing element, recurrent dynamics, One prominent challenge to visual inference is posed is central to a range of alternative conceptions of visual 31,110–112,129,130 by partial occlusions, which hide part of a target object inference that have been proposed . These from view. In two recent studies, recurrent architec- ideas have a long history, they are essential to under- tures were shown to be more robust to occlusions than standing biological vision, and they have great potential for 89,119 their feedforward counterparts . Interestingly, in both engineering, especially in the context of modern hardware human observers and in an RNN model, object recognition and software. The promise of active vision and recurrent under occlusion was impaired by backward masking (the visual inference is, in fact, boosted by the power of presentation of a meaningless noise image, shortly after feedforward networks. 13,15,120 a target stimulus, to disrupt recurrent processing ). However, the beauty, power, and simplicity of feedforward Neural responses to partially occluded shapes in macaque neural networks also makes it difficult to engage and visual cortex are also consistent with recurrent processing, develop the space of recurrent neural network algorithms and were well explained by a predictive coding model in for vision. The feedforward framework, embellished by which prefrontal cortex provide a feedback signal to visual recurrent processes that serve auxiliary and modulatory 121,122 area V4 . functions like normalization and attention, enables compu- Another challenge for human perception is crowding, tational neuroscientists to hold on to the idea of a hierarchy which occurs when the detailed perception of a target of feature detectors. This idea might not be entirely stimulus is disrupted by nearby flanker stimuli . In mistaken. However, it is likely to be severely incomplete certain instances, the target stimulus can be released and ultimately limiting. from crowding if further flankers are added that form The insight that any finite-time recurrent network can a larger, coherent structure with the original flankers. be unrolled compounds the problem by suggesting that the This uncrowding effect may be due to the flankers being feedforward framework is essentially complete. More practi- ’explained away’, thus reducing their interference with the cally, the fact that we train RNNs by unrolling them for 124,125 126 target representation . Recent work has shown that finite time steps might in some ways impede our progress. both effects can be explained by architectures known as FNNs are usually trained by stochastic gradient descent 127,128 Capsule Nets , which include recurrent information using the backpropagation algorithm. This method retraces routing mechanisms that may be similar to perceptual in reverse the computational steps that led to the response grouping and segmentation processes in the visual cortex. in the output layer, so as to estimate the influence that Note that, in all of these cases, it may be possible to each connection in the network had on the response. Each develop a feedforward architecture that performs the task connection weight is then adjusted, to bring the network equally well or better. Trivially, and as we discussed previ- output closer to a desired output. The deeper the network, ously, a successful recurrent architecture can always be the longer the computational path that needs to be retraced. unrolled (for a finite number of time steps) into a deep RNNs for visual inference typically are trained through feedforward network with many more learnable connections. a variation on this method, known as backpropagation However, a realistic recurrent model, when unrolled, may through time (BPTT) . To retrace computations in map onto an unrealistic feedforward model (Fig. 2), where reverse through cycles, the RNN is unrolled along time, so 13 as to convert it into a feedforward network whose depth computational path to this state. Marino et al. recently depends on the number of time steps as shown in Fig. 1b- proposed iterative amortized inference, training inference d. This enables the RNN to be trained like an FNN. networks to have recurrent dynamics that improve the BPTT is attractive for enabling us to train RNNs like network’s hypotheses in each iteration, without constraining FNNs on arbitrary objectives. When it comes to learning these dynamics to a particular form (such as predictive recurrent dynamics, however, BPTT strictly optimizes the coding). More generally, RNNs whose dynamics converge output at the specific time points evaluated by the objective to a steady state can be optimized through variations on 136–138 (e.g., the output after exactly N steps). Outside of this time an algorithm known as recurrent backpropagation , window, there is no guarantee that the network’s response which avoids retracing the computational graph through will be well-behaved. The RNN might reach the desired time. However, it is often difficult to design RNNs such objective at the desired time, but diverge immediately after. that their dynamics converge to a steady state (within Ideally, we would like a visual RNN presented with a stable the time window for which the model is trained), while image to converge to an attractor that represents the image maintaining expressivity (the ability of the model to learn a and behave stably for arbitrary lengths of time. This would wide range of functions). This challenge is addressed by the be consistent with iterative optimization, in which each recently developed contractor recurrent backpropagation step improves the network’s approximation to its objective. method , which introduces a mathematical penalty that While it is not impossible for BPTT to give rise to such can be imposed while training any RNN, to encourage it to dynamics, it does not specifically favor them. learn convergent dynamics. From a theory perspective, BPTT is limiting because it shackles RNNs to the feedforward framework, in which the goal is still to map inputs to outputs, rather than to discover useful dynamics. From a practical and implementa- GOING FORWARD, IN CIRCLES tional perspective, BPTT is computationally cumbersome, as every additional recurrent time step extends the compu- tational path that must be retraced in order to update We started this review with the puzzling observation that, the connections. This complication also renders BPTT whereas biological vision is implemented in a profoundly biologically implausible. Although the case for backpropa- recurrent neural architecture, the most successful neural gation as potentially biologically plausible has recently been network models of vision to date are feedforward. We have 132–134 strengthened , its extension through time is difficult argued, theoretically and empirically, that vision models will to reconcile with biology or implement efficiently in eventually converge to their biological roots and implement a finite engineered system for online learning – precisely more powerful recurrent solutions. This is an appealing because it requires unrolling and keeping track of separate prospect, as it suggests that neuroscientists and engineers copies of each weight as computational cycles are retraced can continue to work synergistically, to make progress on in reverse. common challenges. After all, visual inference, and intel- Given these drawbacks, we speculate that a true ligence more generally, were solved once before, and so breakthrough in recurrent vision models will require a discovering nature’s solutions should go hand in hand with training regime that does not rely on BPTT. Rather than building artificial ones. optimizing an RNN’s state in a finite time window, future RNN training methods might directly target the network’s dynamics, or the states that those dynamics are encouraged ACKNOWLEDGEMENTS to converge to. This approach has some history in RNN models of vision. Predictive coding models, for instance, are designed with dynamics that explicitly implement We thank Samuel Lippl, Heiko Sch¨utt, Andrew Zaharia, Tal iterative optimization. Such models can update their Golan and Benjamin Peters for detailed comments on a draft connections through learning rules that require only the of this paper. This work was supported by a Rubicon grant converged network state as input , rather than the entire from the Dutch Research Council (to R.S.v.B.). 1 3 V. A. Lamme, P. R. Roelfsema, The distinct modes of A. Angelucci, P. C. Bressloff, Contribution of feedforward, vision offered by feedforward and recurrent processing, lateral and feedback connections to the classical receptive Trends in Neurosciences 23 (11) (2000) 571–579. field center and extra-classical receptive field surround of doi:10.1016/S0166-2236(00)01657-X. primate V1 neurons, in: Progress in Brain Research, Vol. 154, G. Kreiman, T. Serre, Beyond the feedforward sweep: 2006, pp. 93–120. doi:10.1016/S0079-6123(06)54005-1. feedback computations in the visual cortex, Annals of the J. C. Anderson, R. J. Douglas, K. A. C. Martin, J. C. New York Academy of Sciences 1464 (1) (2020) 222–241. Nelson, Synaptic output of physiologically identified doi:10.1111/nyas.14320. spiny stellate neurons in cat visual cortex, The Journal of Comparative Neurology 341 (1) (1994) 16–24. 14 doi:10.1002/cne.903410103. Computer Society Conference on Computer Vision and K. A. Martin, Microcircuits in visual cortex, Current Pattern Recognition 2016-December (2016) 4873–4882. Opinion in Neurobiology 12 (4) (2002) 418–425. arXiv:1512.00596, doi:10.1109/CVPR.2016.527. doi:10.1016/S0959-4388(02)00343-4. J. Kubilius, S. Bracci, H. P. Op de Beeck, Deep Neural R. J. Douglas, K. A. Martin, Recurrent neuronal circuits Networks as a Computational Model for Human Shape Sensi- in the neocortex, Current Biology 17 (13) (2007) 496–500. tivity, PLOS Computational Biology 12 (4) (2016) e1004896. doi:10.1016/j.cub.2007.04.024. doi:10.1371/journal.pcbi.1004896. 7 22 D. J. Felleman, D. C. Van Essen, Distributed hierarchical N. J. Majaj, D. G. Pelli, Deep learning-Using machine processing in the primate cerebral cortex, Cerebral Cortex learning to study biological vision, Journal of Vision 18 (13) 1 (1) (1991) 1–47. doi:10.1093/cercor/1.1.1. (2018) 1–13. doi:10.1167/18.13.2. 8 23 P. A. Salin, J. Bullier, Corticocortical connec- C. J. Spoerer, T. C. Kietzmann, J. Mehrer, I. Charest, tions in the visual system: structure and function, N. Kriegeskorte, Recurrent neural networks can explain Physiological Reviews 75 (1) (1995) 107–154. flexible trading of speed and accuracy in biological vision, doi:10.1152/physrev.1995.75.1.107. PLOS Computational Biology 16 (10) (2020) e1008215. N. T. Markov, M. M. Ercsey-Ravasz, A. R. Ribeiro Gomes, doi:10.1371/journal.pcbi.1008215. C. Lamy, L. Magrou, J. Vezoli, P. Misery, A. Falchier, C. F. Cadieu, H. Hong, D. L. K. Yamins, N. Pinto, R. Quilodran, M. A. Gariel, J. Sallet, R. Gamanut, D. Ardila, E. A. Solomon, N. J. Majaj, J. J. DiCarlo, C. Huissoud, S. Clavagnier, P. Giroud, D. Sappey-Marinier, Deep Neural Networks Rival the Representation of P. Barone, C. Dehay, Z. Toroczkai, K. Knoblauch, D. C. Primate IT Cortex for Core Visual Object Recognition, Van Essen, H. Kennedy, A weighted and directed interareal PLoS Computational Biology 10 (12) (2014) e1003963. connectivity matrix for macaque cerebral cortex, Cerebral doi:10.1371/journal.pcbi.1003963. Cortex 24 (1) (2014) 17–36. doi:10.1093/cercor/bhs270. S. M. Khaligh-Razavi, N. Kriegeskorte, Deep Supervised, but R. J. Douglas, C. Koch, M. Mahowald, K. A. Not Unsupervised, Models May Explain IT Cortical Repre- Martin, H. H. Suarez, Recurrent excitation in neocor- sentation, PLoS Computational Biology 10 (11) (2014). tical circuits, Science 269 (5226) (1995) 981–985. doi:10.1371/journal.pcbi.1003915. doi:10.1126/science.7638624. U. Guclu, M. A. J. van Gerven, Deep Neural H. Sup`er, H. Spekreijse, V. A. Lamme, Two distinct modes Networks Reveal a Gradient in the Complexity of of sensory processing observed in monkey primary visual Neural Representations across the Ventral Stream, cortex (VI), Nature Neuroscience 4 (3) (2001) 304–310. Journal of Neuroscience 35 (27) (2015) 10005–10014. doi:10.1038/85170. doi:10.1523/JNEUROSCI.5023-14.2015. 12 27 V. Di Lollo, J. T. Enns, R. A. Rensink, Compe- N. Kriegeskorte, Deep Neural Networks: A New Framework tition for consciousness among visual events: The for Modeling Biological Vision and Brain Information psychophysics of reentrant visual processes, Journal of Exper- Processing, Annual Review of Vision Science 1 (1) (2015) imental Psychology: General 129 (4) (2000) 481–507. 417–446. doi:10.1146/annurev-vision-082114-035447. doi:10.1037/0096-3445.129.4.481. S. R. Kheradpisheh, M. Ghodrati, M. Ganjtabesh, V. A. Lamme, K. Zipser, H. Spekreijse, Masking interrupts T. Masquelier, Deep Networks Can Resemble Human Feed- figure-ground signals in V1, Journal of Vision 1 (3) (2001) forward Vision in Invariant Object Recognition, Scientific 1044–1053. doi:10.1167/1.3.32. Reports 6 (1) (2016) 32672. doi:10.1038/srep32672. 14 29 K. Heinen, J. Jolij, V. A. Lamme, Figure-ground segregation M. Schrimpf, J. Kubilius, H. Hong, N. J. Majaj, R. Rajal- requires two distinct periods of activity in VI: A transcranial ingham, E. B. Issa, K. Kar, P. Bashivan, J. Prescott-Roy, magnetic stimulation study, NeuroReport 16 (13) (2005) F. Geiger, K. Schmidt, D. L. K. Yamins, J. J. DiCarlo, Brain- 1483–1487. doi:10.1097/01.wnr.0000175611.26485.c8. score: Which artificial neural network for object recognition J. J. Fahrenfort, H. S. Scholte, V. A. Lamme, Masking is most brain-like?, bioRxiv (2020). doi:10.1101/407007. disrupts reentrant processing in human visual cortex, R. P. N. Rao, D. H. Ballard, Predictive coding in the visual Journal of Cognitive Neuroscience 19 (9) (2007) 1488–1497. cortex: a functional interpretation of some extra-classical doi:10.1162/jocn.2007.19.9.1488. receptive-field effects., Nature neuroscience 2 (1) (1999) 79– Y. Lecun, Y. Bengio, G. Hinton, Deep learning, Nature 87. doi:10.1038/4580. 521 (7553) (2015) 436–444. doi:10.1038/nature14539. A. Yuille, D. Kersten, Vision as Bayesian inference: analysis J. Schmidhuber, Deep learning in neural networks: by synthesis?, Trends in Cognitive Sciences 10 (7) (2006) An overview, Neural Networks 61 (2015) 85–117. 301–308. doi:10.1016/j.tics.2006.05.002. doi:10.1016/j.neunet.2014.09.003. K. Friston, S. Kiebel, Predictive coding under the free- K. He, X. Zhang, S. Ren, J. Sun, Delving Deep into Recti- energy principle, Philosophical Transactions of the Royal Society B: Biological Sciences 364 (1521) (2009) 1211–1221. fiers: Surpassing Human-Level Performance on ImageNet Classification, in: 2015 IEEE International Conference on doi:10.1098/rstb.2008.0300. Computer Vision (ICCV), Vol. 2015 Inter, IEEE, 2015, pp. S. J. D. Prince, Computer Vision: Models, Learning and 1026–1034. doi:10.1109/ICCV.2015.123. Inference, Cambridge University Press, Cambridge, 2012. K. He, X. Zhang, S. Ren, J. Sun, Deep residual doi:10.1017/CBO9780511996504. learning for image recognition, Proceedings of the IEEE J. J. DiCarlo, D. Zoccolan, N. C. Rust, How does the brain Computer Society Conference on Computer Vision and solve visual object recognition?, Neuron 73 (3) (2012) 415– Pattern Recognition 2016-December (2016) 770–778. 434. doi:10.1016/j.neuron.2012.01.010. doi:10.1109/CVPR.2016.90. M. Carandini, D. J. Heeger, Normalization as a canonical I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, neural computation, Nature Reviews Neuroscience 13 (1) E. Brossard, The MegaFace benchmark: 1 million (2012) 51–62. doi:10.1038/nrn3136. faces for recognition at scale, Proceedings of the IEEE 15 36 54 R. Desimone, J. Duncan, Neural Mechanisms of Selective P. Mamassian, R. Goutcher, Prior knowledge on the Visual Attention, Annual Review of Neuroscience 18 (1) illumination position, Cognition 81 (1) (2001) 1–9. (1995) 193–222. doi:10.1146/annurev.neuro.18.1.193. doi:10.1016/S0010-0277(01)00116-0. 37 55 S. Kastner, L. G. Ungerleider, Mechanisms of A. R. Girshick, M. S. Landy, E. P. Simoncelli, Cardinal rules: Visual Attention in the Human Cortex, Annual visual orientation perception reflects knowledge of environ- Review of Neuroscience 23 (1) (2000) 315–341. mental statistics., Nature neuroscience 14 (7) (2011) 926– doi:10.1146/annurev.neuro.23.1.315. 32. doi:10.1038/nn.2831. 38 56 J. H. Maunsell, S. Treue, Feature-based attention in visual U. Hasson, E. Yang, I. Vallines, D. J. Heeger, N. Rubin, cortex, Trends in Neurosciences 29 (6) (2006) 317–322. A Hierarchy of Temporal Receptive Windows in Human doi:10.1016/j.tins.2006.04.001. Cortex, Journal of Neuroscience 28 (10) (2008) 2539–2550. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, doi:10.1523/JNEUROSCI.5487-07.2008. A. N. Gomez, L. u. Kaiser, I. Polosukhin, Attention is J. D. Murray, A. Bernacchia, D. J. Freedman, R. Romo, J. D. all you need, in: I. Guyon, U. V. Luxburg, S. Bengio, Wallis, X. Cai, C. Padoa-Schioppa, T. Pasternak, H. Seo, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), D. Lee, X.-J. Wang, A hierarchy of intrinsic timescales across Advances in Neural Information Processing Systems 30, primate cortex, Nature Neuroscience 17 (12) (2014) 1661– Curran Associates, Inc., 2017, pp. 5998–6008. doi:10.1038/nn.3862. 40 58 Q. Liao, T. Poggio, Bridging the Gaps Between Residual I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence Learning, Recurrent Neural Networks and Visual Cortex (047) learning with neural networks, Advances in Neural Infor- (2016) 1–16. arXiv:1604.03640. mation Processing Systems 4 (January) (2014) 3104–3112. S. Jastrz¸ebski, D. Arpit, N. Ballas, V. Verma, T. Che, arXiv:1409.3215. Y. Bengio, Residual Connections Encourage Iterative R. E. Kalman, A New Approach to Linear Filtering and Inference (2017). arXiv:1710.04773. Prediction Problems, Journal of Basic Engineering 82 (1) K. Greff, R. K. Srivastava, J. Schmidhuber, Highway and (1960) 35–45. doi:10.1115/1.3662552. Residual Networks learn Unrolled Iterative Estimation, 5th D. Wolpert, Z. Ghahramani, M. Jordan, An internal model International Conference on Learning Representations, ICLR for sensorimotor integration, Science 269 (5232) (1995) 2017 - Conference Track Proceedings (2015) (2016) 1–14. 1880–1882. doi:10.1126/science.7569931. arXiv:1612.07771. R. P. N. Rao, D. H. Ballard, Dynamic model of visual recog- G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, nition predicts neural response properties in the visual cortex, Densely connected convolutional networks, Proceedings - Neural computation 9 (November 1995) (1997) 721–763. 30th IEEE Conference on Computer Vision and Pattern doi:10.1162/neco.1997.9.4.721. Recognition, CVPR 2017 2017-January (2017) 2261–2269. R. P. N. Rao, Bayesian computation in recurrent neural doi:10.1109/CVPR.2017.243. circuits., Neural computation 16 (1) (2004) 1–38. 44 63 P. Dayan, L. F. Abbott, Theoretical Neuroscience, MIT S. Den`eve, J.-R. Duhamel, A. Pouget, Optimal Press, Cambridge, MA, 2001. Sensorimotor Integration in Recurrent Cortical M. S. Advani, A. M. Saxe, High-dimensional dynamics Networks: A Neural Implementation of Kalman Filters, of generalization error in neural networks (2017) 1– Journal of Neuroscience 27 (21) (2007) 5744–5756. 32arXiv:1710.03667. doi:10.1523/JNEUROSCI.3985-06.2007. 46 64 M. Belkin, D. Hsu, S. Ma, S. Mandal, Reconciling modern J.-J. Orban de Xivry, S. Coppe, G. Blohm, P. Lefevre, machine-learning practice and the classical bias–variance Kalman Filtering Naturally Accounts for Visually trade-off, Proceedings of the National Academy of Sciences Guided and Predictive Smooth Pursuit Dynamics, of the United States of America 116 (32) (2019) 15849– Journal of Neuroscience 33 (44) (2013) 17301–17313. 15854. doi:10.1073/pnas.1903070116. doi:10.1523/JNEUROSCI.2321-13.2013. 47 65 P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, O.-S. Kwon, D. Tadin, D. C. Knill, Unifying account of I. Sutskever, Deep Double Descent: Where Bigger Models visual motion and position perception, Proceedings of the and More Data Hurt (2019). arXiv:1912.02292. National Academy of Sciences 112 (26) (2015) 8142–8147. W. Rawat, Z. Wang, Deep Convolutional Neural doi:10.1073/pnas.1500361112. Networks for Image Classification: A Comprehensive R. S. van Bergen, J. F. M. Jehee, Probabilistic Represen- Review, Neural Computation 29 (9) (2017) 2352–2449. tation in Human Visual Cortex Reflects Uncertainty in Serial doi:10.1162/neco_a_00990. Decisions, The Journal of neuroscience : the official journal C. D. Gilbert, W. Li, Top-down influences on visual of the Society for Neuroscience 39 (41) (2019) 8164–8176. processing, Nature Reviews Neuroscience 14 (5) (2013) 350– doi:10.1523/JNEUROSCI.3212-18.2019. 363. doi:10.1038/nrn3476. A. Graves, A.-R. Mohamed, G. Hinton, Speech recog- C. Summerfield, T. Egner, Expectation (and attention) in nition with deep recurrent neural networks, in: 2013 visual cognition, Trends in Cognitive Sciences 13 (9) (2009) IEEE International Conference on Acoustics, Speech and 403–409. doi:10.1016/j.tics.2009.06.003. Signal Processing, no. 3, IEEE, 2013, pp. 6645–6649. H. von Helmholtz, Handbuch der physiologischen Optik, doi:10.1109/ICASSP.2013.6638947. Dover (English translation), New York, 1860/1962. H. Sak, A. Senior, F. Beaufays, Long Short-Term Memory Y. Weiss, E. P. Simoncelli, E. H. Adelson, Motion illusions Based Recurrent Neural Network Architectures for Large as optimal percepts, Nature Neuroscience 5 (6) (2002) 598– Vocabulary Speech Recognition (2014). arXiv:1402.1128. 604. doi:10.1038/nn858. D. Bahdanau, K. Cho, Y. Bengio, Neural Machine Trans- A. A. Stocker, E. P. Simoncelli, Noise characteristics and lation by Jointly Learning to Align and Translate, 3rd Inter- prior expectations in human visual speed perception, Nature national Conference on Learning Representations, ICLR 2015 Neuroscience 9 (4) (2006) 578–585. doi:10.1038/nn1669. - Conference Track Proceedings (2014). arXiv:1409.0473. 16 K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, breaks text-based CAPTCHAs, Science 358 (6368) (2017). F. Bougares, H. Schwenk, Y. Bengio, Learning Phrase doi:10.1126/science.aag2612. Representations using RNN Encoder-Decoder for Statistical S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Machine Translation, Journal of Clinical Microbiology 28 (4) Z. Su, D. Du, C. Huang, P. H. Torr, Conditional random fields (2014) 828–829. arXiv:1406.1078. as recurrent neural networks, Proceedings of the IEEE Inter- M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, national Conference on Computer Vision 2015 Inter (2015) S. Chopra, Video (language) modeling: a baseline for gener- 1529–1537. doi:10.1109/ICCV.2015.179. ative models of natural videos (2014). arXiv:1412.6604. C. J. Spoerer, P. McClure, N. Kriegeskorte, Recurrent convo- N. Srivastava, E. Mansimov, R. Salakhutdinov, Unsupervised lutional neural networks: A better model of biological object Learning of Video Representations using LSTMs (2015). recognition, Frontiers in Psychology 8 (SEP) (2017) 1–14. arXiv:1502.04681. doi:10.3389/fpsyg.2017.01551. 73 90 W. Lotter, G. Kreiman, D. Cox, Deep Predictive Coding N. Montobbio, L. Bonnasse-Gahot, G. Citti, A. Sarti, Networks for Video Prediction and Unsupervised Learning KerCNNs: biologically inspired lateral connections for classi- arXiv:1605.08104. fication of corrupted images (2019). arXiv:1910.08336. (2016). 74 91 A. Pouget, J. Beck, W. J. Ma, P. Latham, Probabilistic C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, brains: knowns and unknowns., Nature neuroscience 16 (9) I. Goodfellow, R. Fergus, Intriguing properties of neural (2013) 1170–8. doi:10.1038/nn.3495. networks, 2nd International Conference on Learning Repre- W. J. Ma, M. Jazayeri, Neural Coding of Uncertainty and sentations, ICLR 2014 - Conference Track Proceedings Probability., Annual Review of Neuroscience 37 (2014) 205– (2014). arXiv:1312.6199. 220. doi:10.1146/annurev-neuro-071013-014017. R. Geirhos, C. Michaelis, F. A. Wichmann, P. Rubisch, G. Orb´an, P. Berkes, J. Fiser, M. Lengyel, Neural M. Bethge, W. Brendel, Imagenet-trained CNNs are biased Variability and Sampling-Based Probabilistic Representa- towards texture; increasing shape bias improves accuracy and tions in the Visual Cortex, Neuron 92 (2) (2016) 530–543. robustness, 7th International Conference on Learning Repre- doi:10.1016/j.neuron.2016.09.038. sentations, ICLR 2019 (c) (2019) 1–22. arXiv:1811.12231. 77 93 M. Boerlin, C. K. Machens, S. Den`eve, Predictive J. H. Jacobsen, J. Behrmann, R. Zemel, M. Bethge, Coding of Dynamical Variables in Balanced Spiking Excessive invariance causes adversarial vulnerability, 7th Networks, PLoS Computational Biology 9 (11) (2013). International Conference on Learning Representations, ICLR doi:10.1371/journal.pcbi.1003258. 2019 (2019). arXiv:1811.00401. 78 94 D. G. Barrett, S. Den`eve, C. K. Machens, Optimal compen- J. Neyman, E. S. Pearson, IX. On the problem of the most sation for neuron loss, eLife 5 (e12454) (2016) 1–36. efficient tests of statistical hypotheses, Philosophical Trans- doi:10.7554/eLife.12454. actions of the Royal Society of London. Series A, Containing P. H. Schiller, B. L. Finlay, S. F. Volman, Short-term response Papers of a Mathematical or Physical Character 231 (694- variability of monkey striate neurons., Brain research 105 (2) 706) (1933) 289–337. doi:10.1098/rsta.1933.0009. (1976) 347–9. A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, A. Dean, The variability of discharge of simple cells in the cat A. Madry, Adversarial Examples Are Not Bugs, They Are striate cortex, Experimental Brain Research 44 (4) (1981). Features (2019). arXiv:1905.02175. doi:10.1007/BF00238837. Y. Li, J. Bradshaw, Y. Sharma, Are generative classi- Z. F. Mainen, T. J. Sejnowski, Reliability of spike timing in fiers more robust to adversarial attacks?, 36th International neocortical neurons., Science 268 (5216) (1995) 1503–6. Conference on Machine Learning, ICML 2019 2019-June A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet (2019) 6754–6783. arXiv:1802.06552. Classification with Deep Convolutional Neural Networks, L. Schott, J. Rauber, M. Bethge, W. Brendel, Towards the Advances In Neural Information Processing Systems (2012). first adversarially robust neural network model on MNIST, arXiv:1102.0183. Iclr 3 (2018) 1–16. arXiv:1805.09190. 83 98 D. J. Field, A. Hayes, R. F. Hess, Contour integration T. Golan, P. C. Raju, N. Kriegeskorte, Controversial stimuli: by the human visual system: evidence for a local ”associ- pitting neural networks against each other as models of ation field”., Vision research 33 (2) (1993) 173–93. human recognition (2019). arXiv:1911.09288. doi:10.1016/0042-6989(93)90156-q. T. S. Lee, D. Mumford, Hierarchical Bayesian inference in W. S. Geisler, J. S. Perry, B. J. Super, D. P. Gallogly, the visual cortex., Journal of the Optical Society of America. Edge co-occurrence in natural images predicts contour A, Optics, image science, and vision 20 (7) (2003) 1434–48. grouping performance, Vision Research 41 (6) (2001) 711– H. Wen, K. Han, J. Shi, Y. Zhang, E. Culurciello, Z. Liu, 724. doi:10.1016/S0042-6989(00)00277-7. Deep Predictive Coding Network for Object Recognition P. R. Roelfsema, Cortical algorithms for perceptual grouping, (2018). arXiv:1802.04762. Annual Review of Neuroscience 29 (1) (2006) 203–227. V. Srikumar, G. Kundu, D. Roth, On amortizing inference doi:10.1146/annurev.neuro.29.051605.112939. cost for structured prediction, EMNLP-CoNLL 2012 - 2012 D. Linsley, J. Kim, V. Veerabadran, C. Windolf, T. Serre, Joint Conference on Empirical Methods in Natural Language Learning long-range spatial dependencies with horizontal Processing and Computational Natural Language Learning, gated recurrent units, in: S. Bengio, H. Wallach, Proceedings of the Conference (July) (2012) 1114–1124. H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett A. Stuhlmu¨ller, J. Taylor, N. Goodman, Learning stochastic (Eds.), Advances in Neural Information Processing Systems inverses, in: C. J. C. Burges, L. Bottou, M. Welling, 31, Curran Associates, Inc., 2018, pp. 152–164. Z. Ghahramani, K. Q. Weinberger (Eds.), Advances in Neural D. George, W. Lehrach, K. Kansky, M. L´azaro-Gredilla, Information Processing Systems, Vol. 26, Curran Associates, C. Laan, B. Marthi, X. Lou, Z. Meng, Y. Liu, Inc., 2013, pp. 3048–3056. H. Wang, A. Lavin, D. S. Phoenix, A generative C. Cremer, X. Li, D. Duvenaud, Inference suboptimality vision model that trains with high data efficiency and in variational autoencoders, 35th International Conference 17 on Machine Learning, ICML 2018 3 (2018) 1749–1760. is required to capture the representational dynamics of arXiv:1801.03558. the human visual system, Proceedings of the National J. Marino, Y. Yue, S. Mandt, Iterative amortized inference, Academy of Sciences 116 (43) (2019) 201905544. 35th International Conference on Machine Learning, ICML doi:10.1073/pnas.1905544116. 2018 8 (2018) 5444–5462. arXiv:1807.09356. H. Tang, M. Schrimpf, W. Lotter, C. Moerman, R. D. Hjelm, K. Cho, J. Chung, R. Salakhutdinov, A. Paredes, J. O. Caro, W. Hardesty, D. Cox, G. Kreiman, V. Calhoun, N. Jojic, Iterative refinement of the approximate Recurrent computations for visual pattern completion, posterior for directed belief networks, Advances in Neural Proceedings of the National Academy of Sciences of the Information Processing Systems (Nips 2016) (2016) 4698– United States of America 115 (35) (2018) 8835–8840. 4706. arXiv:1511.06382. doi:10.1073/pnas.1719397115. 106 120 R. G. Krishnan, D. Liang, M. D. Hoffman, On the J. T. Enns, V. Di Lollo, What’s new in visual masking?, challenges of learning with inference networks on sparse, Trends in Cognitive Sciences 4 (9) (2000) 345–352. high-dimensional data, International Conference on Artificial doi:10.1016/S1364-6613(00)01520-5. Intelligence and Statistics, AISTATS 2018 84 (2018) 143– A. M. Fyall, Y. El-Shamayleh, H. Choi, E. Shea-Brown, 151. arXiv:1710.06085. A. Pasupathy, Dynamic representation of partially occluded objects in primate prefrontal and visual cortex, eLife 6 (2017) M. Liang, X. Hu, Recurrent convolutional neural network for object recognition, Proceedings of the IEEE 1–25. doi:10.7554/eLife.25784. Computer Society Conference on Computer Vision and H. Choi, A. Pasupathy, E. Shea-Brown, Predictive Coding Pattern Recognition 07-12-June (2015) 3367–3375. in Area V4: Dynamic Shape Discrimination under Partial doi:10.1109/CVPR.2015.7298958. Occlusion, Neural Computation 30 (5) (2018) 1209–1257. K. Kar, J. Kubilius, K. Schmidt, E. B. Issa, J. J. doi:10.1162/neco_a_01072. DiCarlo, Evidence that recurrent circuits are critical to D. M. Levi, Crowding—An essential bottleneck for object the ventral stream’s execution of core object recognition recognition: A mini-review, Vision Research 48 (5) (2008) behavior, Nature Neuroscience 22 (6) (2019) 974–983. 635–654. doi:10.1016/j.visres.2007.12.009. doi:10.1038/s41593-019-0392-5. M. Manassi, B. Sayim, M. H. Herzog, Grouping, pooling, and A. Nayebi, D. Bear, J. Kubilius, K. Kar, S. Ganguli, when bigger is better in visual crowding, Journal of Vision D. Sussillo, J. J. DiCarlo, D. L. Yamins, Task-driven 12 (10) (2012) 13–13. doi:10.1167/12.10.13. convolutional recurrent models of the visual system, M. Manassi, S. Lonchampt, A. Clarke, M. H. Herzog, What Advances in Neural Information Processing Systems 2018- crowding can tell us about object representations, Journal of Decem (NeurIPS) (2018) 5290–5301. Vision 16 (3) (2016) 35. doi:10.1167/16.3.35. 110 126 D. H. Ballard, Animate vision, Artificial Intelligence 48 (1) A. Doerig, A. Bornet, O. Choung, M. Herzog, Crowding (1991) 57–86. doi:10.1016/0004-3702(91)90080-4. reveals fundamental differences in local vs. global processing J. M. Findlay, I. D. Gilchrist, Active in humans and machines, Vision Research 167 (August 2019) Vision, Oxford University Press, 2003. (2020) 39–45. doi:10.1016/j.visres.2019.12.006. doi:10.1093/acprof:oso/9780198524793.001.0001. S. Sabour, N. Frosst, G. E. Hinton, Dynamic routing R. Bajcsy, Y. Aloimonos, J. K. Tsotsos, Revisiting active between capsules, in: I. Guyon, U. V. Luxburg, S. Bengio, perception, Autonomous Robots 42 (2) (2018) 177–196. H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), doi:10.1007/s10514-017-9615-3. Advances in Neural Information Processing Systems 30, S. J. Russell, Rationality and intelligence, Curran Associates, Inc., 2017, pp. 3856–3866. Artificial Intelligence 94 (1-2) (1997) 57–77. S. Sabour, N. Frosst, G. E. Hinton, Matrix capsules with EM doi:10.1016/S0004-3702(97)00026-X. routing, Iclr 2018 (2011) (2018) 1–12. arXiv:1710.09829. 114 129 S. J. Gershman, E. J. Horvitz, J. B. Tenenbaum, Compu- J. K. O’Regan, A. No¨e, A sensorimotor account of vision and tational rationality: A converging paradigm for intelligence visual consciousness, Behavioral and Brain Sciences 24 (5) in brains, minds, and machines, Science 349 (6245) (2015) (2001) 939–973. doi:10.1017/S0140525X01000115. 273–278. doi:10.1126/science.aac6076. G. Buzs´aki, The Brain from Inside Out, Oxford University T. L. Griffiths, F. Lieder, N. D. Goodman, Rational Use of Press, 2019. doi:10.1093/oso/9780190905385.001.0001. Cognitive Resources: Levels of Analysis Between the Compu- P. Werbos, Backpropagation through time: what it does and tational and the Algorithmic, Topics in Cognitive Science how to do it, Proceedings of the IEEE 78 (10) (1990) 1550– 7 (2) (2015) 217–229. doi:10.1111/tops.12142. 1560. doi:10.1109/5.58337. 116 132 J. Kubilius, M. Schrimpf, K. Kar, R. Rajalingham, H. Hong, J. Guerguiev, T. P. Lillicrap, B. A. Richards, Towards deep N. Majaj, E. Issa, P. Bashivan, J. Prescott-Roy, K. Schmidt, learning with segregated dendrites, eLife 6 (2017) 1–37. A. Nayebi, D. Bear, D. L. Yamins, J. J. DiCarlo, Brain- doi:10.7554/eLife.22901. like object recognition with high-performing shallow recurrent J. Sacramento, R. Ponte Costa, Y. Bengio, W. Senn, anns, in: H. Wallach, H. Larochelle, A. Beygelzimer, Dendritic cortical microcircuits approximate the backprop- F. d’Alch´e Buc, E. Fox, R. Garnett (Eds.), Advances agation algorithm, in: S. Bengio, H. Wallach, H. Larochelle, in Neural Information Processing Systems 32, Curran K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), Advances Associates, Inc., 2019, pp. 12805–12816. in Neural Information Processing Systems 31, Curran J. Deng, W. Dong, R. Socher, L.-J. Li, Kai Li, Associates, Inc., 2018, pp. 8721–8732. Li Fei-Fei, ImageNet: A large-scale hierarchical image J. C. Whittington, R. Bogacz, Theories of Error Back- database, in: 2009 IEEE Conference on Computer Vision Propagation in the Brain, Trends in Cognitive Sciences 23 (3) and Pattern Recognition, IEEE, 2009, pp. 248–255. (2019) 235–250. doi:10.1016/j.tics.2018.12.005. doi:10.1109/CVPR.2009.5206848. T. P. Lillicrap, A. Santoro, Backpropagation through time T. C. Kietzmann, C. J. Spoerer, L. K. A. S¨orensen, and the brain, Current Opinion in Neurobiology 55 (2019) R. M. Cichy, O. Hauk, N. Kriegeskorte, Recurrence 82–89. doi:10.1016/j.conb.2019.01.011. 18 136 138 L. Almeida, A learning rule for asynchronous perceptrons with R. Liao, Y. Xiong, E. Fetaya, L. Zhang, K. J. Yoon, feedback in a combinatorial environment., Proceedings, 1st X. Pitkow, R. Urtasun, R. Zemel, Reviving and improving First International Conference on Neural Networks 2 (1987) recurrent back-propagation, 35th International Conference 609–618. on Machine Learning, ICML 2018 7 (2018) 4807–4820. F. J. Pineda, Generalization of back-propagation to recurrent arXiv:1803.06396. neural networks, Physical Review Letters 59 (19) (1987) D. Linsley, A. K. Ashok, L. N. Govindarajan, R. Liu, 2229–2232. doi:10.1103/PhysRevLett.59.2229. T. Serre, Stable and expressive recurrent vision models (2020). arXiv:2005.11362.

Journal

Quantitative BiologyarXiv (Cornell University)

Published: Mar 26, 2020

References