Access the full text.
Sign up today, get DeepDyve free for 14 days.
Going in circles is the way forward: the role of recurrence in visual inference 1 1−4 Ruben S. van Bergen , Nikolaus Kriegeskorte Zuckerman Mind Brain Behavior Institute, Columbia University, New York, NY, United States Department of Psychology, Columbia University, New York, NY, United States Department of Neuroscience, Columbia University, New York, NY, United States and Aﬃliated member, Electrical Engineering, Columbia University, New York, NY, United States Biological visual systems exhibit abundant recurrent connectivity. State-of-the-art neural network models for visual recognition, by contrast, rely heavily or exclusively on feedforward computation. Any ﬁnite-time recurrent neural network (RNN) can be unrolled along time to yield an equivalent feedforward neural network (FNN). This important insight suggests that computational neurosci- entists may not need to engage recurrent computation, and that computer-vision engineers may be limiting themselves to a special case of FNN if they build recurrent models. Here we argue, to the contrary, that FNNs are a special case of RNNs and that computational neuroscientists and engineers should engage recurrence to understand how brains and machines can (1) achieve greater and more ﬂexible computational depth, (2) compress complex computations into limited hardware, (3) integrate priors and priorities into visual inference through expectation and attention, (4) exploit sequential dependencies in their data for better inference and prediction, and (5) leverage the power of iterative computation. INTRODUCTION tions their models must operate under when deployed in, for example, a smartphone. Moreover, as long as neural network models continue to dominate computer vision, The primate visual cortex uses a recurrent algorithm to more eﬃcient hardware implementations are likely to be 1–3 process sensory input . Anatomically, connectivity is more similar to biological neural networks than current cyclic. Neurons are connected in cycles within local cortical implementations using conventional processors and graphics 4–6 circuits . Global inter-area connections are dense and processing units (GPUs). 7–9 mostly bidirectional . Physiologically, the dynamics of A second explanation for the discrepancy is that the neural responses bear temporal signatures indicative of abundance of recurrent connections in cortex belies a 1,10,11 recurrent processing . Behaviorally, visual perception superﬁcial role in neural computation. Perhaps the can be disturbed by carefully timed interventions that core computations can be performed by a feedforward coincide with the arrival of re-entrant information to a visual network , while recurrent processing serves more auxiliary 12–15 area . The evidence for recurrent computation in the and modulatory functions, such as divisive normalization primate brain, thus, is unequivocal. What is less obvious, 36–39 and attention . This perspective is convenient because however, is why the brain uses a recurrent algorithm. it enables us to hold on to the feedforward model in our This question has recently been brought into sharper minds. The auxiliary and modulatory functions let us focus by the successes of deep feedforward neural network acknowledge recurrence without fundamentally changing 16,17 models (FNNs) . These models now match or exceed the way we envision the algorithm of recognition. 18–20 human performance on certain visual tasks , and However, there is a third and more exciting expla- 21–23 better predict primate recognition behavior and neural nation for the discrepancy between recurrent brains and 24–29 activity than current alternative models. feedforward models: Although feedforward computation is Although computer vision and computational neuro- powerful, a recurrent algorithm provides a fundamentally 30–33 science both have a long history of recurrent models , superior solution to the problem of visual inference, and feedforward models have earned a dominant status in both this algorithm is implemented in primate visual cortex. This ﬁelds. How should we account for this discrepancy between recurrent algorithm explains how primate vision can be so brains and models? eﬃcient in terms of space, time, energy, and data, while One answer is that the discrepancy reﬂects the fact that being so rich and robust in terms of the inferences and their generalization to novel environments. brains and computer-vision systems operate on diﬀerent hardware and under diﬀerent constraints on space, time, In this review, we argue for the latter possibility, and energy. Perhaps we have come to a point at which discussing a range of potential computational functions the two ﬁelds must go their separate ways. However, this of recurrence and citing the evidence suggesting that the answer is unsatisfying. Computational neuroscience must primate brain employs them. We aim to distinguish estab- still ﬁnd out how visual inference works in brains. And lished from more speculative, and superﬁcial from more although engineers face quantitatively diﬀerent constraints profound forms of recurrence, so as to clarify the most when building computer-vision systems, they, too, must exciting directions for future research that will close the gap care about the spatial, temporal, and energetic limita- between models and brains. arXiv:2003.12128v3 [q-bio.NC] 16 Nov 2020 2 UNROLLING A RECURRENT NETWORK the distinction can easily be blurred. Consider the simple network in Fig. 1a. It consists of three processing stages, arranged hierarchically, which we will refer to as areas, by What exactly do we mean when we say that a neural network analogy to cortex. Each area contains a number of neurons – whether biological or artiﬁcial – is recurrent rather than (real or artiﬁcial) that apply ﬁxed operations to their input. feedforward? This may seem obvious, but it turns out that Visual input enters in the ﬁrst area, where it undergoes some transformation, the result of which is passed as input to the second area, and so forth. Information travels exclusively "#€$%%&'()*")& +#€,%-.))%/0 -#€,%-.))%/01€./)(22%&€3/€034% in one direction – the “forward” direction, from input to €�‚€� €�‚€� €�‚€� output – and so this is an example of a feedforward archi- tecture. Notably, the number of transformations between area 3 area 3 area 3 area 3 area 3 input and output is ﬁxed, and equal to the number of areas in the network. area 2 area 2 area 2 area 2 area 2 Now compare this to the architecture in Fig. 1b. Here, we have added lateral and feedback connections to the area 1 area 1 area 1 area 1 area 1 network. Lateral connections allow the output of an area to !‚€� !‚€� !‚€� be fed back into the same area, to inﬂuence its computations €� € €! in the next processing step. Feedback connections allow the output of an area to inﬂuence information processing in a lower area. There is some freedom in the order in which &#€,%-.))%/01€./)(22%&€3/€56"-%€ %#€7%%6€'%%&'()*")&€85.6%)94(&%2: computations may occur in such a network. The order we €�‚€� €�‚€� illustrate here starts with a full feed-forward pass through the network. In subsequent time steps, neural activations area 3 (� = 3) area 9 are updated in ascending order through the hierarchy, based on the activations that were computed in the previous time area 2 (� = 3) area 8 step. This order of operations can be seen more clearly if we area 1 (� = 3) area 7 ’unroll’ the network in time, as shown in Fig. 1c. In this illustration, the network is unrolled for a ﬁxed number of area 3 (� = 2) area 6 time steps (3). In fact, recurrent processing can be run for any desired duration before its output is read out – a notion area 2 (� = 2) area 5 we will return to later. Notice how this temporally unrolled, small network resembles a larger feedforward neural network area 1 (� = 2) area 4 with more connections and areas between its input and output. We can emphasize this recurrent-feedforward equiv- area 3 (� = 1) area 3 alence by interpreting the computational graph over time as a spatial architecture, and visually arranging the induced area 2 (� = 1) area 2 areas and connections in a linear spatial sequence – an operation we call unrolling in space (Fig. 1d). This results area 1 (� = 1) area 1 in a deep feedforward architecture with many skip connec- tions between areas that are separated by more than one !‚€� !‚€� level in this new hierarchy, and with many connections that FIG. 1: Unrolling recurrent neural networks. (a) A simple feedforward are exact copies of one another (sharing identical connection neural network. (b) The same network with lateral (blue) and feedback weights). (red) connections added, to make it recurrent. (c) ”Unrolling” the Thus, any ﬁnite-time RNN can be transformed into an network in time clariﬁes the order of its computations. Here, the network is unrolled for three time steps before its output is read out, equivalent FNN. But this should not be taken to mean that but we could choose to run the network for more or fewer steps. Areas RNNs are a special case of FNNs. In fact, FNNs are a are staggered from left to right to show the order in which their neural special case of ﬁnite-time RNNs (Fig. 2a), comprising activities are updated. (d) Alternatively, we can unroll the recurrent those which happen to have no cycles. More practically, network’s time steps in space, by arranging the areas and connec- not every unrolled ﬁnite-time RNN is a realistic FNN (Fig. tions from diﬀerent time steps in a linear spatial sequence. Note how all arrows now once again point in the same (forward) direction, 2b). By realistic networks, we mean networks that conform from input to output. Throughout panels (a-b), connections that are to the real-world constraints the system must operate under. identical (sharing the same weight matrices) are indicated by corre- For computational neuroscience, a realistic network is one sponding symbols. (e) If we lift the weight-sharing constraints from that ﬁts in the brain of the animal and does not require the previous network, this induces a deep feedforward ”super-model”, a deeper network architecture or more processing steps which can implement the spatially-unrolled recurrent network as a special case. This more general architecture may include additional than the animal can accommodate. For computer vision, connections (examples shown as light gray arrows) not present in the a realistic network is one that can be trained and deployed spatially-unrolled recurrent net. on available hardware at the training and deployment stages. 3 € ﬁnite-time RNNs feedforward NNs unrealistic unrolled ftRNNs realistic realistically unrollable ftRNNs FIG. 2: Relationships between recurrent and feedforward networks. This ﬁgure illustrates relationships between discrete-time feedforward (FNN) and discrete-time recurrent (RNN) neural network models. (a) The architecture of any RNN can be reduced to an FNN by removing all its recurrent connections (e.g., going from Fig. 1b back to Fig. 1a), or equivalently, setting the weights of these connections to zero. Vice versa, any FNN can be expanded to an inﬁnite variety of RNNs by adding lateral or feedback connections. Feedforward networks, thus, form an architectural subset of RNNs. Here we speciﬁcally consider RNNs that accomplish their task in a ﬁnite number of time steps. These ﬁnite-time RNNs (ftRNNs) have the special property that they can be unrolled into equivalent FNNs. White points linked by arcs indicate pairs of computationally equivalent architectures. Thus, the feedforward NNs contain a subset of architectures that can be obtained by unrolling a ftRNN. (b) These sets of networks can be further subdivided into subsets that are or are not realistic to implement with the computational resources available for a brain or engineered device (areas below and above the dotted line, respectively). Deeper networks and, more generally, networks with more neurons and connections tend to require more memory and computation to train and run. Some realistic ftRNNs remain realistic when expressed as an FNN (blue ellipse). Others, however, become too complex, when unrolled, to be feasible (black arc crossing the realism line). This is because the unrolling operation induces a much deeper architecture with many more neural connections to be stored. These not-realistically-unrollable ftRNNs are especially interesting, since they correspond to recurrent solutions that cannot be replaced by feedforward architectures. For example, there may be limits on the storage and energy models are strictly feedforward architectures. available, which would limit the complexity of the archi- tecture and computational graph. A realistic ﬁnite-time CONTINUOUS- VERSUS DISCRETE-TIME RNN, when unrolled, can yield an unworkably deep FNN. DYNAMICS Although the most widely used current method for training RNNs (backpropagation through time) requires unrolling, FNNs used in computer vision do not have meaningful an RNN is not equivalent to its unrolled FNN twin at the stage of real-world deployment: the RNN’s recurrent dynamics. Each unit in the network instantaneously trans- forms its input into an output. This is in contrast to a connections need not be physically duplicated, but can be reused across cycles of computation. feedforward network of biological neurons. When given a static input, biological neurons do not immediately produce 40–42 An important recent observation is that the archi- their ﬁnal responses. The movement of electric charges tecture that results from spatially unrolling a recurrent and neurotransmitters, and the opening and closing of ion network, resembles the architectures of state-of-the art channels takes time, so the network will gradually transition FNNs used in computer vision, which similarly contain skip from its initial to its ﬁnal state, with its trajectory contin- connections and can be very deep. These deep FNNs ually perturbed by noise. Such continuous-time dynamics may form a super-class of models (Fig. 1e), which can be described by diﬀerential equations. When these reduce to “recurrent-equivalent” architectures when certain cannot be solved analytically (as is typically the case), the subsets of weights are constrained to be identical. Liao dynamics can be simulated in discrete steps. In each step, & Poggio showed that deep feedforward architectures the current state of each simulated neuron is updated. The known as residual networks (ResNets) are formally equiv- future state of the network thus depends on its current state, alent to recurrent architectures when certain connection as it does in an RNN. Consequently, the computational weights are constrained to be identical. Moreover, when graph of the simulation algorithm contains loops from each ResNets were trained with such recurrent-equivalent weight- neuron back to itself. Running the simulation over time sharing constraints, their performance on computer vision amounts to unrolling this loopy computational graph, even benchmarks was similar to unconstrained ResNets (even though the network architecture did not contain loops. though the weight sharing drastically reduces the parameter Computational neuroscientists commonly study models of count and limits the component computations that the feedforward and recurrent neural networks with continuous- network can perform). This is especially noteworthy time dynamics . Here our focus is on neural network given that ResNets, and architecturally related DenseNets, models that are motivated by the goal to capture compu- are currently among the top-ranking FNNs on prominent tations, rather than their precise neural implementation. 19,43 computer vision benchmarks , as well as measures of The discrete-time behavior of such a model is not derived brain-similarity . Today’s best artiﬁcial vision models, from a continuous-time description in diﬀerential equations. thus, actually implement computational graphs closely Moreover, the model is optimized in its discrete-time imple- related to those of recurrent networks, even though these mentation. However, an implicit assumption in the ﬁeld is 4 that such models could be implemented in biological brains, the FNNs when its conﬁdence threshold was set to match and thus in continuous-time dynamical systems. the FNN’s computational cost (number of ﬂoating point operations) on average across images (Fig. 3). Flexible computational depth would be advantageous for animals, REASONS TO RECUR who may need to respond rapidly in some situations, must limit metabolic expenditures in general, and may beneﬁt from slower and more energetically costly inferences when We have described how a recurrent network can be high accuracy is required. Computer vision faces similar unrolled into a deep feedforward architecture. The resulting requirements in certain applications. For example, a vision feedforward super-model oﬀers greater computational ﬂexi- algorithm in a smartphone should respond rapidly and bility, since weight-sharing constraints can be omitted and conserve energy in general, but should also be able to additional skip connections added to the network (Fig. 1e). recognize hard images, and it should allow trading oﬀ mean So what would be the beneﬁt of restricting ourselves to accuracy for speed and energy (e.g., when the battery is recurrent architectures? We will ﬁrst discuss the beneﬁts of low). recurrence in terms of overarching principles, before consid- ering more speciﬁc implementations of these principles. Recurrent architectures can compress complex Recurrence provides greater and more ﬂexible computations in limited hardware computational depth Another beneﬁt of recurrent solutions is that they require Recurrence enables arbitrary computational depth fewer components in space when physically implemented in recurrent circuits, such as brains. Compare Figs. 1b and One important advantage of recurrent algorithms is that 1e: the recurrent network is anatomically more compact they can be run for any desired length of time before their than the feedforward network and has fewer connections. output is collected. We can deﬁne computational depth It is easy to see why evolution might have favored a as the maximum path length (i.e. number of successive recurrent implementations for many brain functions: Space, neural projections, and the energy to develop and maintain connections and nonlinear transformations) between input and output. A recurrent neural network (RNN) can achieve them are all costly for the organism. In addition, synaptic eﬃcacies must be either learned from limited experience arbitrary computational depth despite having a ﬁnite count of parameters and being limited to ﬁnite spatial compo- or encoded in a limited-capacity genome. Beyond saving nents. In other words, it can multiply its limited spatial space, material, and energy, thus, smaller descriptive resources along time. These deeper computations can complexity (or parameter count) might ease development serve to expand on the number of hypotheses considered and learning. (in generative inference) or on the number of nonlinear Engineered devices face the same set of costs, although features computed (in discriminative inference), or to extend their relative weighting changes from application to appli- the representation into the future or past, or to iteratively cation. In particular, a larger number of units and weights converge to a good estimate of some latent variable of must either be represented in the memory of a conven- interest. tional computer or implemented in specialized (e.g., neuro- morphic) hardware. The connection weights in an NN model need to be learned from limited data. This requires Recurrence enables more ﬂexible expenditure of energy and extensive training, e.g., in a supervised setting, with millions time in exchange for inferential accuracy of hand-labeled examples that show the network the desired output for a given input. The larger number of param- In addition to enabling an arbitrarily deep computation given eters associated with a feedforward solution might overﬁt the training data. The learned parameters then do not enough time, an RNN can adjust its computational depth to the task at hand. The computational depth of a feedforward generalize well to new examples of the same task. net, by contrast, is a ﬁxed number determined by the archi- FNNs often turn out to generalize surprisingly well even 45–47 tecture. when they have very large numbers of parameters . This Spoerer et al. implemented a recurrent model that termi- phenomenon is thought to reﬂect a regularizing eﬀect of the nates computations when it reaches a conﬁdence threshold learning algorithm, stochastic gradient descent. Indeed, the (deﬁned by the entropy of the posterior, a measure of the trend is towards ever deeper networks with more connections model’s uncertainty) . The model terminates rapidly for to be optimized, and this trend is associated with continuing many images, but expends more time and energy on hard gains in performance on computer vision benchmarks . images to reach its conﬁdence threshold. Adjusting the Nevertheless, it could turn out that recurrent architectures conﬁdence threshold enables trading oﬀ speed for accuracy that achieve high computational depth with fewer param- in terms of average performance. When compared to a eters bring beneﬁts not only in terms of their storage, but range of FNNs requiring diﬀerent amounts of computation, also in terms of statistical eﬃciency, the ability generalize the RNN achieved roughly the same accuracy as each of accurately based on limited experience. This would imply entropy threshold [nats] that recurrent networks have an inductive bias that makes proposed for recurrent processing for visual inference, from up for the limited experiential data. This is explored further superﬁcial to increasingly profound forms of recurrence. in subsequent sections, where we discuss how RNNs can exploit temporal dependency structures, and enable iterative inference. Feedback connections are required to integrate information from outside the visual hierarchy Energy is another factor to consider in both biology and engineering. Larger FNNs take longer to train on bigger computing clusters, while drawing greater amounts A key, established role of recurrent connections in biological of power – a trend that is not sustainable. In the long vision is to propagate information from outside the visual run, therefore, computer vision too may beneﬁt from the cortex, so that it can aid visual inference . Here, we will anatomical compression that can be achieved through clever brieﬂy discuss two such outside inﬂuences: attention and use of recurrence. expectations. Importantly, however, not every deep feedforward model can be compressed into an equivalent recurrent implemen- tation. This anatomical compression can only be achieved Attentional prioritization requires feedback connections when the same function may be applied iteratively or recur- sively within the network. The crucial question, therefore, Animals have needs and goals that change from moment is: what are these functions? What operations can be to moment. Perception is attuned to an animal’s current applied repeatedly in a productive manner? The remainder objectives. For instance, a primate foraging for red berries of this paper will reﬂect on the various roles that have been may be more successful if its visual perception apparatus prioritizes or enhances the processing of red items. Since current goals are represented outside the visual cortex RNN (e.g., in frontal regions), top-down connections are clearly required for this information to inﬂuence visual processing. Such top-down eﬀects have been grouped under the label ”attention”, and they have been the subject of an entire sub-ﬁeld of study. For our purposes, it is suﬃcient to note that the eﬀects and mechanisms of top-down attention are well-documented and pervasive in visual cortex (for review, 36–38 see [ ]), and thus there is no question that this is one important function of recurrent connections. FNNs Integrating prior expectations into visual inference requires feedback connections computational cost [number of ﬂoating-point operations ×10 ] Organisms may constrain their visual inferences by expectations . Visual input can be ambiguous and FIG. 3: Recurrence enables a network to trade speed for accuracy unreliable, and thus open to multiple interpretations. To while approximately emulating the accuracies of feedforward models constrain the inference, an observer can make use of on average at matched computational cost. Circles denote the perfor- 51–53 prior knowledge . One form of prior knowledge is mance of a recurrent neural network (RNN) that was run for diﬀerent numbers of time steps, until it achieved a desired threshold of image environmental constants (e.g., ”light tends to come from classiﬁcation conﬁdence (quantiﬁed by the entropy of the class proba- above” ). Such unvarying knowledge may be stored bilities in the ﬁnal network layer). Squares correspond to three archi- within visual cortex, especially when it pertains to the tecturally similar feedforward networks (FNN) with diﬀerent compu- overall prevalence of basic visual features (e.g., local tational costs. On the x-axis is the computational cost of running edge orientations ). Another form of prior knowledge is these models, measured by the number of ﬂoating point operations. For the feedforward models, this cost is ﬁxed by the architecture. contextual information speciﬁc to the current situation. For the recurrent models, it is the average number of operations Such time-varying knowledge may require a ﬂexible repre- that was required to meet the given entropy threshold. The y-axis sentation outside visual cortex (e.g., ”I rang the doorbell at shows the classiﬁcation accuracy achieved by each model. The perfor- my mother’s house, so I expect to see her open the door”). mance of the recurrent model for diﬀerent certainty thresholds follows a smooth curve, trading oﬀ computational cost (and thus computa- Such expectations, represented in higher cortical regions, tional speed) and accuracy. Note that this curve passes almost exactly require feedback connections to aﬀect processing in visual through the cost-accuracy combinations achieved by the feedforward cortex . models. Thus, a single recurrent model can emulate the performance The top-down imposition of attention and expectation of multiple feedforward models as it trades oﬀ speed and accuracy. must be mediated by feedback connections. However, it When the conﬁdence threshold of termination was set such that the RNN matched the accuracy of a given FNN, the RNN required a similar is unclear whether these inﬂuences fundamentally change number of ﬂoating-point operations on average as the FNN. (Figure the nature of visual representations or merely modulate adapted with permission from the authors .) these representations, adjusting the gain depending on the accuracy [proportion top-1 correct] ‰##„'Œ%( !"€#$%€&$''(& � ‚ƒ„…†‡ ˆ†‰Š‹Œˆ� † €�‚€� €�‚ƒ„…†‡…€ƒ…ˆ‰‡Š‹Œ�Ž‹�‰�„‡‚‹‰�‹� #$‚#%Œ � ˆ� † area 3 area 3 €�‚€� ‹Œˆ†�Œˆ† area 2 area 2 ‡ˆ†‚€� area 1 area 1 ˆ†‚€� €� € *"€+!,-!.€/,0'1 €�‚€� €�‚€� €�‚€� ! " €�‚ƒ„…†‡…€‰Œ�ƒ‹„�‰…‡‚‹‰�‹� area 3 area 3 area 3 €�‚€�‡ � ! "# €�‚€�‡ � ! $# #$Š� Œ‚&Œ#� area 2 area 2 area 2 ˆ†‚€‡� ‡� ! "# ˆ†‚€‡� ‡� ! $# area 1 area 1 area 1 ˆ†‚€� ˆ†‚€� ˆ†‚€� ! " €� € €) €�‚ƒ„…†‡Š�…Œ�ŒˆŠ�ˆ�„‡… ��� Œ…�ˆ… 2"€30'1!045'€4.6'1'.2' €�‚€� area 3 area 3 area 3 area 2 area 2 area 2 area 1 area 1 area 1 ˆ†‚€� ˆ†‚€� €� € €) FIG. 4: Increasingly profound modes of recurrent processing, unrolled in time. Visual cortex likely combines all three modes of recurrence illustrated here. The left side of each panel shows the computational graph induced by each form of recurrence, while the right side illustrates a (simpliﬁed) example of how this recurrence can be used. In these examples, circles correspond to neurons (or neural assemblies) encoding the feature illustrated within the circle, and lines that connect to circles indicate neural connections with signiﬁcant activity. (a) Top-down inﬂuences from outside the visual processing hierarchy may be incorporated through two computational sweeps: a feedback sweep priming the network with top-down information and a feedforward sweep to interpret visual input and combine this interpretation with the top-down signal. Note that the lateral connections here merely copy neural activities in each area to the next time point; this identity transformation could also be implemented in other ways, such as slow membrane time constants or other forms of local memory. In the example on the right, a top-down signal communicates the expectation that the upcoming input will be horizontal motion. This primes neurons encoding this direction of motion to be more easily or strongly activated, and sharpens the interpretation of the subsequent (ambiguous) visual input. (b) To eﬃciently perform inference on time-varying visual input, recurrent connections may implement a ﬁxed temporal prediction function akin to the transition kernel in a Kalman ﬁlter, extrapolating the ongoing dynamics of the world one time step into the future. For instance, in the example on the right, a downward moving square was perceived at t = 1. This motion is predicted to continue, and this prediction constrains the interpretation of the (ambiguous) visual input at the next time point. For simplicity, only lateral recurrence is shown in this example. Note that each input is mapped onto its corresponding output in a single recurrent time step. (c) Static input may also beneﬁt from recurrent processing that iteratively reﬁnes an initial, coarse feedforward interpretation. In this mode of recurrence, there are several processing time steps between input and output, whereas in (b) there was one input and output for each time step. Illustrated on the right is an iterative hierarchical inference algorithm. Here, a higher-level hypothesis, generated in the ﬁrst time step, reﬁnes the underlying lower-level representation in the next time step, which in turn improves the higher-level hypothesis, and so forth, until the network converges to an optimal interpretation of the input across the entire hierarchy. For simplicity, lateral recurrent interactions are not shown in this example. 7 current relevance of diﬀerent features of the visual input. Recurrent dynamics can simulate and predict the dynamics of the world As illustrated in Fig. 4a, for a given input this would require only two ”sweeps” of computation through the visual processing hierarchy: a feedback sweep that primes visual Dynamic compression of the past exploits the temporal areas with top-down information, and a bottom-up sweep dependency structure of the sensory data. The purpose to interpret the visual input and integrate or modify this of representing the past is to act well in the future. This interpretation with the top-down signal (not necessarily in suggests that a neural network should exploit temporal that order). Importantly, if the feedback signal merely dependencies not just to compress the past, but also to enhances or suppresses some visual features, then the core predict the future. In fact, an optimal representation of even inference algorithm need not be fundamentally recurrent – just the present requires prediction, because the sensory one can imagine that the bottom-up part of such a network data is delayed and noisy. is modeled perfectly by an FNN, while an optional recurrent Changes in the world are governed by laws of dynamics, module could be added in order to implement top-down which by deﬁnition are temporally invariant. An ideal contextual inﬂuences. observer will exploit these laws in visual inference and optimally combine previous with present observations to estimate the current state. This implies an extrapolation of the past to generate predictions that improve the inter- pretation of the present sensory input. When the dynamics Recurrent networks can exploit temporal are linear and noise is Gaussian, the optimal way to infer the dependency structure present state by combining past and present evidence is the Kalman ﬁlter – an algorithm widely used in engineering 60–63 Contextual constraints on visual inference include not only applications. A number of authors have proposed that information from outside the visual hierarchy, such as infor- the visual cortex may implement an algorithm similar to a mation from other sensory modalities and memory, as Kalman ﬁlter. This theory is consistent with temporal biases 64–66 discussed in the previous section. The recent stimulus that are evident in human perceptual judgments . history within the visual modality also provides context, Kalman ﬁlters employ a ﬁxed temporal transitional kernel. likely represented within the visual system. This kernel takes a representation of the world (e.g., variables encoding the present state of a physical system, such as positions and velocities) at time t, and transforms it into a predicted representation for time t + 1, to be integrated with new sensory evidence that arrives at that Recurrent networks can dynamically compress the stimulus time. While the resulting prediction varies as a function of history the kernel’s input, the kernel itself is constant, reﬂecting the temporal shift-invariance of the laws governing the dynamics. Recurrent neural networks provide a general- The primate visual system is thought to contain a hierarchy, ization of the Kalman ﬁlter and can represent nonlinear not only of processing stages and spatial scales, but also 56,57 dynamical systems with non-Gaussian noise. of temporal scales . Visual representations track the Note that this type of recurrent processing is more environment moment by moment. However, the duration profound than the two-sweep algorithm (Fig. 4a) that of a visual moment, the temporal grain, may depend on incorporated top-down inﬂuences on visual inference. The the level of representation. These principles apply to all two-sweep algorithm is trivial to unroll into a feedforward sensory modalities and have been empirically explored, in architecture. In contrast, unrolling a Kalman ﬁlter- particular, for audition and speech perception. At the like recurrent algorithm would induce an inﬁnitely deep simplest level, a neural network could use delay lines to feedforward network, with a separate set of areas and detect spatiotemporal, rather than purely spatial, patterns. connections for each time point to be processed. A ﬁnite- Recurrent neural networks have internal states and can depth feedforward architecture can only approximate the represent temporal context across units tuned to diﬀerent recurrent algorithm. While the feedforward approximation latencies. An RNN could represent a ﬁxed temporal window, will have a ﬁnite temporal window of memory to constrain by replicating units tuned to diﬀerent patterns for multiple its present inferences, the recurrent network can in principle latencies. However, RNNs trained on sequence processing integrate information over arbitrarily long periods. tasks, such as language translation, learn more sophisticated representations of temporal context . They can represent Due to their advantages for dealing with time-varying (or context at multiple time scales, learning a latent represen- otherwise ordered) inputs, recurrent neural networks are in tation that enables them to dynamically compress whatever fact widely employed in the broader ﬁeld of machine learning information from the past is needed for the task. In contrast for tasks involving sequential data. Speech recognition and to a feedforward network, a recurrent network is not limited machine translation are prominent applications that RNNs 58,67–70 by spatial constraints in terms of its retrospective time excel at . Computer vision, too, has embraced RNNs 71–73 horizon. It can maintain task-relevant information indeﬁ- for recognition and prediction of video input . Note nitely, integrating long-term memory into its inferences. that these applications all exploit the dynamics in RNNs to 8 model the dynamics in the data. being subdivided into smaller hypotheses about lower or What if we trained a Kalman ﬁlter or sequence-to- intermediate-level features, such as the local edges that make up a larger contour. An iterative visual inference sequence RNN (Fig. 4b) on a train of independently sampled static inputs to be classiﬁed? The memory of the algorithm starts with an initial hypothesis, and reﬁnes it by incremental improvements. These improvements may preceding inputs would not be useful then, so we expect the recurrent model to revert to using essentially only its include eliminating hypotheses that are mutually exclusive, strengthening compatible causes, or adjusting a hypothesis feedforward weights. The type of recurrent processing we described in this section, thus uses memory to improve based on its ability to predict the data (the visual input). In a probabilistic framework, the optimization objective would visual inference. In the next section, we consider how recurrent processing can help with the inferential compu- be the likelihood (probability of the image given the latent representation) or the posterior probability (probability of tations themselves, even for static inputs. the latent representation given the image). Recurrence enables iterative inference Incompatible hypotheses can compete in the representation Recurrent processing can contribute even to inference on static inputs, and regardless of the agent’s goals and expec- There are often multiple plausible explanations for a given tations, by means of an iterative algorithm. An iterative sensory input that are mutually exclusive. The distributed, algorithm is one that employs a computation that improves parallel nature of neural networks enables them to initially an initial guess. Applying the computation again to the activate and represent all of these possible hypotheses simul- improved guess yields a further improvement. This process taneously. Recurrent connectivity between neurons can then can be repeated until a good solution has been achieved implement competitive interactions among hypotheses, so or until we run out of time or energy. Recurrent networks as to converge on the best overall explanation. can implement iterative algorithms, with the same neural There is some evidence that sensory representations are 74–76 network functions applied successively to some internal probabilistic – in this case, the probabilities assigned pattern of activity Fig. 4c). to a set of mutually exclusive hypotheses must sum to 1. In many ﬁelds, iterative algorithms are used to solve A strengthening of belief in one hypothesis, thus, should estimation and optimization problems. In each iteration, entail a reduction of the probability of other hypotheses in a small adjustment is made to the problem’s proposed the representation. If neurons encode point estimates rather solution, to improve a mathematically formulated objective. than probability distributions, then only one hypothesis A locally optimal solution is found by making small improve- can win (although that hypothesis may be encoded by ments until further progress is not required or not possible. a population response involving multiple neurons). The The algorithm navigates a path in the space of the values winning hypothesis could be the maximum a posteriori to be estimated or the parameters to be optimized, that (MAP) hypothesis or the maximum likelihood hypothesis. leads to a good solution (albeit not necessarily the global Inﬂuential models of visual inference involving compet- optimum). itive recurrent interactions include divisive normalization , 36 30,32,77 Much of machine learning involves iterative methods. biased competition , and predictive coding . Gradient descent is an iterative optimization method, whose Recent theoretical work has demonstrated that lateral stochastic variant is the most widely used method for competition can give rise to a robust neural code, and 77,78 training FNNs. Many discrete optimization techniques are can explain certain puzzling neural response properties . iterative. Iterative algorithms are also central to inference This theory considers a spiking neural network setting, in machine learning, for example in variational inference in which diﬀerent neurons encode highly overlapping or (where inference is achieved by optimization), sampling even identical features in their input. This degeneracy methods (where steps are chosen stochastically such that means that the same signal can be encoded equally well the distribution of samples converges on the posterior distri- by a range of diﬀerent response patterns. When a bution), and message passing algorithms (such as loopy particular neuron spikes, lateral inhibition ensures that belief propagation). In particular, such iterative inference other competing neurons do not encode the same part of algorithms are used in probabilistic approaches to computer the input again. Which neuron gets to do the encoding 31,33 vision . It is somewhat surprising, then, that iterative thus depends on which neuron ﬁres ﬁrst, because its computation is not widely exploited to perform visual membrane potential happened to be closest to a spiking inference in FNNs. threshold. This leads to trial-to-trial variability in neural Visual inference is naturally understood as an responses that reﬂects subtle diﬀerences in initial condi- optimization problem, where the goal is to ﬁnd hypotheses tions – conditions that may not be known to an experi- that can explain the current visual input . A hypothesis, menter, who may thus mistake this variability for random in this case, is a proposed set of latent (i.e. unobserved) noise. This could explain the puzzling observation that causes that can jointly explain the image. The hypothe- individual neurons reliably reproduce the same output given sized latent causes could be the identities and positions of the same electrical stimulation, but populations of neurons, objects in the scene. Visual hypotheses are hierarchical, wired together, display apparently random variability under 9 79–81 85 sensory stimulation . Since multiple neurons can encode perceptual grouping operations . Recent examples include the same feature, the resulting code is also robust to neurons Linsley et al., who developed horizontal gated-recurrent being lost or temporarily inactivated. units (hGRUs) that learn local spatial dependencies . A network equipped with this particular recurrent connectivity FNNs do not incorporate lateral connections for compet- was competitive with state-of-the-art feedforward models itive interactions, although they very often include compu- on a contour integration task, while using far fewer free tations that serve a similar purpose. Chief among these parameters. George et al. similarly leveraged lateral inter- are operations known as max-pooling and local response 16,82 normalization (LRN) . In max-pooling, only the actions to recognize contiguous contours and surfaces, by modeling these with a conditional random ﬁeld (CRF), using strongest response within a pool of competing neurons is forwarded to the next processing stage. In LRN, each a message-passing algorithm for inference. This approach made their Recursive Cortical Network (RCN) the ﬁrst neuron has its response divided by a term that is computed from the sum of activity in its normalization pool. While computer vision algorithm to reliably beat CAPTCHAs – images of letter sequences under a variety of distortions, neither of these mechanisms is mediated by explicit lateral noise and clutter, that are widely used to verify that queries connections in a FNN, a strictly connectionist implemen- to a user interface are made by a person, and not an tation of these mechanisms (e.g., in biological neurons or algorithm. CRFs were also used by Zheng et al. , who neuromorphic hardware) would have to include lateral recur- incorporated them as a recurrent extension of a convolu- rence. This, then, is another way in which apparently tional neural network for image segmentation. The model feedforward FNNs can exhibit a (limited) form of recurrent processing ”under the hood”. Note, though, that each surpassed state-of-the-art performance at the time. Associ- ation rules enforced through lateral connections may also of these operations is carried out only once, rather than allowing competitive dynamics to converge over multiple help to ﬁll in missing information, such as when objects are partially hidden from view by occluders. Lateral connec- iterations. Furthermore, in contrast to the lateral inter- actions in predictive coding or other normative models, tivity has been shown to improve recognition performance 23,89,90 in such settings . Montobbio et al. showed that LRN and max-pooling are not derived from normative lateral diﬀusion of activity between neurons with correlated principles, and do not necessarily select (or enhance) the feedforward ﬁlter weights improves robustness to image best hypothesis (however ”best” is deﬁned). perturbations including occlusions . Enhancement of mutually compatible hypotheses (this section) and competition between mutually exclusive Compatible hypotheses can strengthen each other in the hypotheses (previous section) can both contribute to representation inference. A more general perspective is provided by the insight that prior knowledge about what features in a scene In feedforward models of hierarchical visual inference, are mutually compatible or exclusive may be part of an neurons at higher stages selectively respond to combinations overarching generative model, which iterative algorithms of simpler features encoded by lower-level neurons. Higher- can exploit for inference. level neurons thus are sensitive to larger-scale patterns of correlation between subsets of lower-level features. But such larger-scale statistical regularities may not be most Iterative algorithms can leverage generative models for eﬃciently captured by a set of larger-scale building blocks. inference Instead, they may be more compactly captured by local association rules. Consider, for instance, the problem of contour detection. Many combinations of local edges in an Perceptual inference aims to converge on a set of hypotheses image can form a continuous contour. The resulting space that best explain the sensory data. Typically, a hypothesis is of contours may be too complex to be eﬃciently represented considered to be a good explanation if it is consistent with with larger-scale templates. What all these contours have in both our prior knowledge and the sensory data. A gener- common, however, is that they consist of pairs of edges that ative model is a model of the joint distribution of latent are locally contiguous, with sharper angles occurring with causes and sensory data. Generative models can powerfully lower probability. Thus, the criteria for ’contour-ness’ may constrain perceptual inference because they capture prior be compactly expressed by a set of local association rules: knowledge about the world. In machine learning, deﬁning 83,84 these edges go together; those do not . Contours may generative models enables us to express and exploit what then be pieced together by repeatedly applying the same we know about the domain. A wide range of inference local association rules. Those edge pairs which are most algorithms can be used to compute posterior distributions clearly connected would be identiﬁed in early iterations. over variables of interest, given observed variables. The Later inferences can beneﬁt from the context provided by algorithms include variational inference, message passing, and Markov Chain Monte Carlo sampling, all of which earlier inferences, enabling the process to recognize conti- nuity even where it is less locally apparent. require iterative computation. This insight has inspired network models of visual In this section, we focus on a particular approach to lever- inference that implement local association rules through aging generative models in visual inference, in which the lateral connections, to aid contour integration and other joint distribution p(x, z) of the image x and the latents z 10 is factorized as p(x, z) = p(z) · p(x|z), which we refer to either of the categories. An ideal observer should evaluate as the top-down factorization. The architecture contains the likelihood for each hypothesis and adjudicate according components that model p(x|z) and predict the image from to their ratio . A feedforward network may instead latch the latents (or more generally lower-level latent representa- on to a few highly discriminative, but subtle image features tions from higher-level latent representations). Compared that don’t explain much and may not generalize to images 93,95 to the alternative factorization p(x, z) = p(x) · p(z|x), the from a diﬀerent data set . In contrast, visual features top-down factorization has the potential advantage that the that are important for generating or reconstructing images model operates in the causal direction, matching the causal of a given class may be more likely to generalize to other process in the world that generated the image. The top- examples of the same category. In support of this intuition, down model predicts what visual input is likely to result two novel RNN architectures that employ generative models from a scene that has the hypothesized properties. This is for inference were found to be more robust to adversarial 96,97 somewhat similar to the graphics engine of a video game perturbations . Generative inference networks were also or image rendering software. This top-down model can be shown to better align with human perception, compared implemented via feedback connections that translate higher- to discriminative models, when presented with controversial level hypotheses in the network to representations at a lower stimuli – images synthesized to evoke strongly conﬂicting level of abstraction. classiﬁcations from diﬀerent models . Despite these promising developments, generative Using generative models implemented with top-down inference remains rare in visual FNN models. The predictions for inference is known as analysis-by-synthesis exceptions mentioned above are rather simple networks – an approach that has a long history in theories of 30,32,51 trained on easy classiﬁcations problems, and are not (yet) perception . Arguably, the goal of perceptual competitive with state-of-the-art performance on more inference, by deﬁnition, is to reason back from eﬀects challenging computer vision benchmarks. Within compu- (sensory data) to their causes (unobserved variables of tational neuroscience, by contrast, generative feedback interest), and thus invert the process that generated the connections appear in many network models of visual eﬀects. The crucial question, however, is whether the causal 30,32 inference. Prominent examples are predictive coding process is explicitly represented in the inference algorithm. and hierarchical Bayesian inference . However, these The alternative, which can be achieved with feedforward models have not had much success in explaining visual inference, is to directly approximate the inverse, without inference beyond its earliest stages. A notable exception is ever making predictions in the causal direction. The success work by Wen et al. , which shows that extending super- of the feedforward approach then depends on how well the vised convolutional FNNs with the recurrent dynamics of inverse can be approximated by a ﬁxed mapping of inputs predictive coding can improve classiﬁcation performance. to hypotheses. To iteratively invert the causal process, The ﬁelds of computer vision and computational neuro- a neural network can evaluate the causal model for a science both stand to beneﬁt from the development of more current hypothesis and update the hypothesis in a beneﬁcial powerful generative inference models. direction. This process can then be repeated until conver- gence. This process of analysis by repeated synthesis may be preferable to directly approximating the inverse mapping if the causal process that generates the sensory data is easier Iteration is necessary to close the amortization gap to model than its inverse. In particular, the causal process may be more compactly represented, more easily learned, Iterative inference has many advantages. A drawback of more eﬃcient to compute, and more generalizable beyond iteration, however, is that it takes time for the algorithm to the training distribution than its inverse. converge during inference. This is unattractive for animals who need to perform visual inference under time pressure. Another potential advantage of generative inference lies in robustness to variations in the input. While FNNs can It is also a challenge when training a FNN, which already requires many iterations of optimization. If each update of accurately categorize images drawn from the same distri- bution that the training images were drawn from, it does not the network’s connections additionally includes an iterative inner loop to perform inference on each training example, take much to fool them. A slight alteration imperceptible to this lengthens the time required for training. humans can cause a FNN to misclassify an image entirely, with high conﬁdence . State-of-the-art FNNs rely more A complementary inference mechanism is amortized 92 101,102 strongly on texture than humans, who rely more on shape . inference , where a feedforward model approximates More generally, FNNs seem to ignore many image features the mapping from images to their latent causes. FNNs that are relevant to human perception . One hypothe- are eminently suited for learning complicated input-output sized reason for this is that these networks are trained to mappings. A single transformation then replaces the trajec- discriminate images, but not to generate them. Thus, any tories that would be navigated by an iterative inference visual feature that reliably discriminates categories in the algorithm. In some cases, the iterative solution and the training data will be weighted heavily in the network’s classi- best amortized mapping may be exactly equivalent. A ﬁcation decisions. Importantly, this weight is unrelated to linear model, for instance, can be estimated iteratively, how much variance the feature explains in the image, and by performing gradient descent on the sum of squared to the likelihood, i.e. the probability of the image given prediction errors. However, if a unique solution exists, it 11 can equivalently be found by a linear transformation that illustrates how limited resources (the fovea) can be dynam- directly maps from the data to the optimal coeﬃcients. ically allocated (eye movements) to diﬀerent portions of the evidence (the visual scene) in temporal sequence. A In general, however, amortized inference incurs some error, compared to the optimal solution that might be found sensory system limited to a ﬁnite number of neurons, thus, can multiply its resources along time to achieve a detailed through iterative optimization. This error has been called 103,104 the amortization gap . It is analogous to the poor analysis. The cycle may start with an initial rough analysis of the entire visual ﬁeld, followed by ﬁxations on locations ﬁt that may result from buying clothes ”oﬀ the rack”, compared to a tailored version of the same garment. The likely to yield valuable information. This is an example of an essentially recurrent process whose eﬃciency cannot amortization gap is deﬁned in the context of variational inference, when the iterative optimization of the varia- be emulated with a feedforward system. The internal mechanisms of visual inference are faced with qualitatively tional approximation to the posterior is replaced by a neural network that maps from the image to the parameters of the similar challenges: Just like our retinae cannot aﬀord foveal resolution throughout the visual ﬁeld, the ventral stream variational distribution. The resulting model suﬀers from cannot aﬀord to perform all potentially relevant inferences two types of error: (1) error caused be the choice of the variational approximation (variational approximation gap) on the evidence streaming in through the optic nerve in a single feedforward sweep. Internal shifts of attention, like and (2) error caused by the model mapping from images to variational parameters (amortization gap). One recent eye movements, can sequentialize a complex computation and avoid wasting energy on portions of the evidence that study has argued that the amortization gap is often the main source of error in amortized inference models . are uninformative or irrelevant to the current goals of the animal. Amortized and iterative inference deﬁne a continuum. At Whereas the outer loop of active vision is largely about one extreme, iterative inference until convergence reaches positioning our eyes relative to the scene and bringing a solution through a trajectory of small improvements, important content into foveal vision, the inner loop of visual explicitly evaluating the quality of the current solution at inference on each glimpse is far more ﬂexible. Beyond covert every iteration. At the other extreme, fully amortized attentional shifts that select locations, features, or objects inference takes a single leap from input to output. In for scrutiny, a recurrent network can decide what computa- between these extremes lies a space for algorithms that use intermediate numbers of steps, to approximate the tions to perform so as to most eﬃciently reduce uncertainty about the important parts of the scene. In a game of twenty optimal solution through a computational path that is more reﬁned than a leap, but more eﬃcient than full- questions, we choose a question that most reduces our remaining uncertainty at each step. The budget of twenty ﬂedged iterative optimization. Models that occupy this space include explicit hybrids of iterative and amortized would not suﬃce if we had to decide all the questions before 104–106 seeing any answers. The visual system similarly has limited inference , as well as RNNs with arbitrary dynamics computational resources for processing a massive stream of that are trained to converge to a desired objective in a 23,107–109 evidence. It must choose what inferences to pursue on the limited number of time steps (e.g., ). basis of their computational cost and uncertainty-reducing 113–115 beneﬁt as it forages for insight . Recurrence is required for active vision CLOSING THE GAP BETWEEN BIOLOGICAL Vision is an active exploratory process. Our eye movements AND ARTIFICIAL VISION scan the scene through a sequence of well-chosen ﬁxations that bring objects of interest into foveal vision. Moving We have reviewed a number of advantages that recurrence our heads and our bodies enables us to bring entirely new can bring to neural networks for visual inference. Going parts of the scene into view, and closer for inspection at high forward, neural network models of vision should incorporate resolution. Active control of our eyes, heads, and bodies can recurrence; not just to better understand visual inference also help disambiguate 3D structure as ﬁxation on points in the brain, but also to improve its implementation in at diﬀerent depths changes binocular disparity, and head machines. and body movements create motion parallax. Active vision involves a recurrent cycle of sensory processing and muscle control, a cycle that runs through the environment. Recurrence already improves performance on Our focus here has been on the internal computational challenging visual tasks functions of recurrent processing, and active vision has been 110–112 reviewed elsewhere . However, it is important to note that the internal recurrent processes of visual inference from Eﬀorts in this direction are already underway, and turning a single glimpse are embedded within the larger recurrent up promising results. Some of this work has been described process of active visual exploration. Active vision provides in previous sections, such as the use of lateral connec- 86–88 not just the larger behavioral context of visual inference. tions to impose local association rules and generative It also provides a powerful illustration of the fundamental inference for more robust performance outside the training 96,97 advantages that recurrent algorithms oﬀer in general. It distribution . Several other recent ﬁndings are worth 12 highlighting here, as they have shown improved performance realism could refer to the real-world constraints faced by on visual tasks, better approximations to biological vision, either biological or artiﬁcial visual systems. Future studies or both, through recurrent computations. should compare RNN and FNN implementations for the same visual inference task, while matching the complexity In particular, several studies have found that recurrence of the models in a meaningful way. Setting a realistic is required in order to explain or improve visual inference budget of units, connections, and computational operations in challenging settings. Kar and colleagues identiﬁed a is one important approach. To understand the computa- set of ’challenge images’ that required recurrent processing tional diﬀerences between RNN and FNN solutions, it is in order to be accurately recognized. A feedforward also interesting to (1) match the parameter count (number FNN struggled to interpret these images, whereas macaque of connection weights that must be learned and stored), monkeys recognized them as accurately as a set of control which requires granting the FNN larger feature kernels, images. Challenge images were associated with longer more feature maps per layer, or more layers, or (2) match processing times in the macaque inferior temporal (IT) the computational graph, which equates the distribution of cortex, consistent with recurrent computations. Neural path lengths from input to output and all other statistics responses in IT for images that took longer were well of the graph, but grants the FNN a much larger number of accounted for by a brain-inspired RNN model. In a parameters . diﬀerent study , this same recurrent architecture was found to account for behavior, and neural data from macaque visual cortex, in object recognition tasks, while also achieving good performance on an important computer Freeing ourselves from the feedforward framework vision benchmark (ImageNet ). In human visual cortex, recurrent interactions were also found to be crucial to Deep feedforward neural networks constitute an essential model the neural dynamics underlying object recognition, building block for visual inference, but they are not the as measured through magnetoencephalography (MEG) . whole story. The missing element, recurrent dynamics, One prominent challenge to visual inference is posed is central to a range of alternative conceptions of visual 31,110–112,129,130 by partial occlusions, which hide part of a target object inference that have been proposed . These from view. In two recent studies, recurrent architec- ideas have a long history, they are essential to under- tures were shown to be more robust to occlusions than standing biological vision, and they have great potential for 89,119 their feedforward counterparts . Interestingly, in both engineering, especially in the context of modern hardware human observers and in an RNN model, object recognition and software. The promise of active vision and recurrent under occlusion was impaired by backward masking (the visual inference is, in fact, boosted by the power of presentation of a meaningless noise image, shortly after feedforward networks. 13,15,120 a target stimulus, to disrupt recurrent processing ). However, the beauty, power, and simplicity of feedforward Neural responses to partially occluded shapes in macaque neural networks also makes it diﬃcult to engage and visual cortex are also consistent with recurrent processing, develop the space of recurrent neural network algorithms and were well explained by a predictive coding model in for vision. The feedforward framework, embellished by which prefrontal cortex provide a feedback signal to visual recurrent processes that serve auxiliary and modulatory 121,122 area V4 . functions like normalization and attention, enables compu- Another challenge for human perception is crowding, tational neuroscientists to hold on to the idea of a hierarchy which occurs when the detailed perception of a target of feature detectors. This idea might not be entirely stimulus is disrupted by nearby ﬂanker stimuli . In mistaken. However, it is likely to be severely incomplete certain instances, the target stimulus can be released and ultimately limiting. from crowding if further ﬂankers are added that form The insight that any ﬁnite-time recurrent network can a larger, coherent structure with the original ﬂankers. be unrolled compounds the problem by suggesting that the This uncrowding eﬀect may be due to the ﬂankers being feedforward framework is essentially complete. More practi- ’explained away’, thus reducing their interference with the cally, the fact that we train RNNs by unrolling them for 124,125 126 target representation . Recent work has shown that ﬁnite time steps might in some ways impede our progress. both eﬀects can be explained by architectures known as FNNs are usually trained by stochastic gradient descent 127,128 Capsule Nets , which include recurrent information using the backpropagation algorithm. This method retraces routing mechanisms that may be similar to perceptual in reverse the computational steps that led to the response grouping and segmentation processes in the visual cortex. in the output layer, so as to estimate the inﬂuence that Note that, in all of these cases, it may be possible to each connection in the network had on the response. Each develop a feedforward architecture that performs the task connection weight is then adjusted, to bring the network equally well or better. Trivially, and as we discussed previ- output closer to a desired output. The deeper the network, ously, a successful recurrent architecture can always be the longer the computational path that needs to be retraced. unrolled (for a ﬁnite number of time steps) into a deep RNNs for visual inference typically are trained through feedforward network with many more learnable connections. a variation on this method, known as backpropagation However, a realistic recurrent model, when unrolled, may through time (BPTT) . To retrace computations in map onto an unrealistic feedforward model (Fig. 2), where reverse through cycles, the RNN is unrolled along time, so 13 as to convert it into a feedforward network whose depth computational path to this state. Marino et al. recently depends on the number of time steps as shown in Fig. 1b- proposed iterative amortized inference, training inference d. This enables the RNN to be trained like an FNN. networks to have recurrent dynamics that improve the BPTT is attractive for enabling us to train RNNs like network’s hypotheses in each iteration, without constraining FNNs on arbitrary objectives. When it comes to learning these dynamics to a particular form (such as predictive recurrent dynamics, however, BPTT strictly optimizes the coding). More generally, RNNs whose dynamics converge output at the speciﬁc time points evaluated by the objective to a steady state can be optimized through variations on 136–138 (e.g., the output after exactly N steps). Outside of this time an algorithm known as recurrent backpropagation , window, there is no guarantee that the network’s response which avoids retracing the computational graph through will be well-behaved. The RNN might reach the desired time. However, it is often diﬃcult to design RNNs such objective at the desired time, but diverge immediately after. that their dynamics converge to a steady state (within Ideally, we would like a visual RNN presented with a stable the time window for which the model is trained), while image to converge to an attractor that represents the image maintaining expressivity (the ability of the model to learn a and behave stably for arbitrary lengths of time. This would wide range of functions). This challenge is addressed by the be consistent with iterative optimization, in which each recently developed contractor recurrent backpropagation step improves the network’s approximation to its objective. method , which introduces a mathematical penalty that While it is not impossible for BPTT to give rise to such can be imposed while training any RNN, to encourage it to dynamics, it does not speciﬁcally favor them. learn convergent dynamics. From a theory perspective, BPTT is limiting because it shackles RNNs to the feedforward framework, in which the goal is still to map inputs to outputs, rather than to discover useful dynamics. From a practical and implementa- GOING FORWARD, IN CIRCLES tional perspective, BPTT is computationally cumbersome, as every additional recurrent time step extends the compu- tational path that must be retraced in order to update We started this review with the puzzling observation that, the connections. This complication also renders BPTT whereas biological vision is implemented in a profoundly biologically implausible. Although the case for backpropa- recurrent neural architecture, the most successful neural gation as potentially biologically plausible has recently been network models of vision to date are feedforward. We have 132–134 strengthened , its extension through time is diﬃcult argued, theoretically and empirically, that vision models will to reconcile with biology or implement eﬃciently in eventually converge to their biological roots and implement a ﬁnite engineered system for online learning – precisely more powerful recurrent solutions. This is an appealing because it requires unrolling and keeping track of separate prospect, as it suggests that neuroscientists and engineers copies of each weight as computational cycles are retraced can continue to work synergistically, to make progress on in reverse. common challenges. After all, visual inference, and intel- Given these drawbacks, we speculate that a true ligence more generally, were solved once before, and so breakthrough in recurrent vision models will require a discovering nature’s solutions should go hand in hand with training regime that does not rely on BPTT. Rather than building artiﬁcial ones. optimizing an RNN’s state in a ﬁnite time window, future RNN training methods might directly target the network’s dynamics, or the states that those dynamics are encouraged ACKNOWLEDGEMENTS to converge to. This approach has some history in RNN models of vision. Predictive coding models, for instance, are designed with dynamics that explicitly implement We thank Samuel Lippl, Heiko Sch¨utt, Andrew Zaharia, Tal iterative optimization. Such models can update their Golan and Benjamin Peters for detailed comments on a draft connections through learning rules that require only the of this paper. This work was supported by a Rubicon grant converged network state as input , rather than the entire from the Dutch Research Council (to R.S.v.B.). 1 3 V. A. Lamme, P. R. Roelfsema, The distinct modes of A. Angelucci, P. C. Bressloﬀ, Contribution of feedforward, vision oﬀered by feedforward and recurrent processing, lateral and feedback connections to the classical receptive Trends in Neurosciences 23 (11) (2000) 571–579. ﬁeld center and extra-classical receptive ﬁeld surround of doi:10.1016/S0166-2236(00)01657-X. primate V1 neurons, in: Progress in Brain Research, Vol. 154, G. Kreiman, T. Serre, Beyond the feedforward sweep: 2006, pp. 93–120. doi:10.1016/S0079-6123(06)54005-1. feedback computations in the visual cortex, Annals of the J. C. Anderson, R. J. Douglas, K. A. C. Martin, J. C. New York Academy of Sciences 1464 (1) (2020) 222–241. Nelson, Synaptic output of physiologically identiﬁed doi:10.1111/nyas.14320. spiny stellate neurons in cat visual cortex, The Journal of Comparative Neurology 341 (1) (1994) 16–24. 14 doi:10.1002/cne.903410103. Computer Society Conference on Computer Vision and K. A. Martin, Microcircuits in visual cortex, Current Pattern Recognition 2016-December (2016) 4873–4882. Opinion in Neurobiology 12 (4) (2002) 418–425. arXiv:1512.00596, doi:10.1109/CVPR.2016.527. doi:10.1016/S0959-4388(02)00343-4. J. Kubilius, S. Bracci, H. P. Op de Beeck, Deep Neural R. J. Douglas, K. A. Martin, Recurrent neuronal circuits Networks as a Computational Model for Human Shape Sensi- in the neocortex, Current Biology 17 (13) (2007) 496–500. tivity, PLOS Computational Biology 12 (4) (2016) e1004896. doi:10.1016/j.cub.2007.04.024. doi:10.1371/journal.pcbi.1004896. 7 22 D. J. Felleman, D. C. Van Essen, Distributed hierarchical N. J. Majaj, D. G. Pelli, Deep learning-Using machine processing in the primate cerebral cortex, Cerebral Cortex learning to study biological vision, Journal of Vision 18 (13) 1 (1) (1991) 1–47. doi:10.1093/cercor/1.1.1. (2018) 1–13. doi:10.1167/18.13.2. 8 23 P. A. Salin, J. Bullier, Corticocortical connec- C. J. Spoerer, T. C. Kietzmann, J. Mehrer, I. Charest, tions in the visual system: structure and function, N. Kriegeskorte, Recurrent neural networks can explain Physiological Reviews 75 (1) (1995) 107–154. ﬂexible trading of speed and accuracy in biological vision, doi:10.1152/physrev.1995.75.1.107. PLOS Computational Biology 16 (10) (2020) e1008215. N. T. Markov, M. M. Ercsey-Ravasz, A. R. Ribeiro Gomes, doi:10.1371/journal.pcbi.1008215. C. Lamy, L. Magrou, J. Vezoli, P. Misery, A. Falchier, C. F. Cadieu, H. Hong, D. L. K. Yamins, N. Pinto, R. Quilodran, M. A. Gariel, J. Sallet, R. Gamanut, D. Ardila, E. A. Solomon, N. J. Majaj, J. J. DiCarlo, C. Huissoud, S. Clavagnier, P. Giroud, D. Sappey-Marinier, Deep Neural Networks Rival the Representation of P. Barone, C. Dehay, Z. Toroczkai, K. Knoblauch, D. C. Primate IT Cortex for Core Visual Object Recognition, Van Essen, H. Kennedy, A weighted and directed interareal PLoS Computational Biology 10 (12) (2014) e1003963. connectivity matrix for macaque cerebral cortex, Cerebral doi:10.1371/journal.pcbi.1003963. Cortex 24 (1) (2014) 17–36. doi:10.1093/cercor/bhs270. S. M. Khaligh-Razavi, N. Kriegeskorte, Deep Supervised, but R. J. Douglas, C. Koch, M. Mahowald, K. A. Not Unsupervised, Models May Explain IT Cortical Repre- Martin, H. H. Suarez, Recurrent excitation in neocor- sentation, PLoS Computational Biology 10 (11) (2014). tical circuits, Science 269 (5226) (1995) 981–985. doi:10.1371/journal.pcbi.1003915. doi:10.1126/science.7638624. U. Guclu, M. A. J. van Gerven, Deep Neural H. Sup`er, H. Spekreijse, V. A. Lamme, Two distinct modes Networks Reveal a Gradient in the Complexity of of sensory processing observed in monkey primary visual Neural Representations across the Ventral Stream, cortex (VI), Nature Neuroscience 4 (3) (2001) 304–310. Journal of Neuroscience 35 (27) (2015) 10005–10014. doi:10.1038/85170. doi:10.1523/JNEUROSCI.5023-14.2015. 12 27 V. Di Lollo, J. T. Enns, R. A. Rensink, Compe- N. Kriegeskorte, Deep Neural Networks: A New Framework tition for consciousness among visual events: The for Modeling Biological Vision and Brain Information psychophysics of reentrant visual processes, Journal of Exper- Processing, Annual Review of Vision Science 1 (1) (2015) imental Psychology: General 129 (4) (2000) 481–507. 417–446. doi:10.1146/annurev-vision-082114-035447. doi:10.1037/0096-3445.129.4.481. S. R. Kheradpisheh, M. Ghodrati, M. Ganjtabesh, V. A. Lamme, K. Zipser, H. Spekreijse, Masking interrupts T. Masquelier, Deep Networks Can Resemble Human Feed- ﬁgure-ground signals in V1, Journal of Vision 1 (3) (2001) forward Vision in Invariant Object Recognition, Scientiﬁc 1044–1053. doi:10.1167/1.3.32. Reports 6 (1) (2016) 32672. doi:10.1038/srep32672. 14 29 K. Heinen, J. Jolij, V. A. Lamme, Figure-ground segregation M. Schrimpf, J. Kubilius, H. Hong, N. J. Majaj, R. Rajal- requires two distinct periods of activity in VI: A transcranial ingham, E. B. Issa, K. Kar, P. Bashivan, J. Prescott-Roy, magnetic stimulation study, NeuroReport 16 (13) (2005) F. Geiger, K. Schmidt, D. L. K. Yamins, J. J. DiCarlo, Brain- 1483–1487. doi:10.1097/01.wnr.0000175611.26485.c8. score: Which artiﬁcial neural network for object recognition J. J. Fahrenfort, H. S. Scholte, V. A. Lamme, Masking is most brain-like?, bioRxiv (2020). doi:10.1101/407007. disrupts reentrant processing in human visual cortex, R. P. N. Rao, D. H. Ballard, Predictive coding in the visual Journal of Cognitive Neuroscience 19 (9) (2007) 1488–1497. cortex: a functional interpretation of some extra-classical doi:10.1162/jocn.2007.19.9.1488. receptive-ﬁeld eﬀects., Nature neuroscience 2 (1) (1999) 79– Y. Lecun, Y. Bengio, G. Hinton, Deep learning, Nature 87. doi:10.1038/4580. 521 (7553) (2015) 436–444. doi:10.1038/nature14539. A. Yuille, D. Kersten, Vision as Bayesian inference: analysis J. Schmidhuber, Deep learning in neural networks: by synthesis?, Trends in Cognitive Sciences 10 (7) (2006) An overview, Neural Networks 61 (2015) 85–117. 301–308. doi:10.1016/j.tics.2006.05.002. doi:10.1016/j.neunet.2014.09.003. K. Friston, S. Kiebel, Predictive coding under the free- K. He, X. Zhang, S. Ren, J. Sun, Delving Deep into Recti- energy principle, Philosophical Transactions of the Royal Society B: Biological Sciences 364 (1521) (2009) 1211–1221. ﬁers: Surpassing Human-Level Performance on ImageNet Classiﬁcation, in: 2015 IEEE International Conference on doi:10.1098/rstb.2008.0300. Computer Vision (ICCV), Vol. 2015 Inter, IEEE, 2015, pp. S. J. D. Prince, Computer Vision: Models, Learning and 1026–1034. doi:10.1109/ICCV.2015.123. Inference, Cambridge University Press, Cambridge, 2012. K. He, X. Zhang, S. Ren, J. Sun, Deep residual doi:10.1017/CBO9780511996504. learning for image recognition, Proceedings of the IEEE J. J. DiCarlo, D. Zoccolan, N. C. Rust, How does the brain Computer Society Conference on Computer Vision and solve visual object recognition?, Neuron 73 (3) (2012) 415– Pattern Recognition 2016-December (2016) 770–778. 434. doi:10.1016/j.neuron.2012.01.010. doi:10.1109/CVPR.2016.90. M. Carandini, D. J. Heeger, Normalization as a canonical I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, neural computation, Nature Reviews Neuroscience 13 (1) E. Brossard, The MegaFace benchmark: 1 million (2012) 51–62. doi:10.1038/nrn3136. faces for recognition at scale, Proceedings of the IEEE 15 36 54 R. Desimone, J. Duncan, Neural Mechanisms of Selective P. Mamassian, R. Goutcher, Prior knowledge on the Visual Attention, Annual Review of Neuroscience 18 (1) illumination position, Cognition 81 (1) (2001) 1–9. (1995) 193–222. doi:10.1146/annurev.neuro.18.1.193. doi:10.1016/S0010-0277(01)00116-0. 37 55 S. Kastner, L. G. Ungerleider, Mechanisms of A. R. Girshick, M. S. Landy, E. P. Simoncelli, Cardinal rules: Visual Attention in the Human Cortex, Annual visual orientation perception reﬂects knowledge of environ- Review of Neuroscience 23 (1) (2000) 315–341. mental statistics., Nature neuroscience 14 (7) (2011) 926– doi:10.1146/annurev.neuro.23.1.315. 32. doi:10.1038/nn.2831. 38 56 J. H. Maunsell, S. Treue, Feature-based attention in visual U. Hasson, E. Yang, I. Vallines, D. J. Heeger, N. Rubin, cortex, Trends in Neurosciences 29 (6) (2006) 317–322. A Hierarchy of Temporal Receptive Windows in Human doi:10.1016/j.tins.2006.04.001. Cortex, Journal of Neuroscience 28 (10) (2008) 2539–2550. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, doi:10.1523/JNEUROSCI.5487-07.2008. A. N. Gomez, L. u. Kaiser, I. Polosukhin, Attention is J. D. Murray, A. Bernacchia, D. J. Freedman, R. Romo, J. D. all you need, in: I. Guyon, U. V. Luxburg, S. Bengio, Wallis, X. Cai, C. Padoa-Schioppa, T. Pasternak, H. Seo, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), D. Lee, X.-J. Wang, A hierarchy of intrinsic timescales across Advances in Neural Information Processing Systems 30, primate cortex, Nature Neuroscience 17 (12) (2014) 1661– Curran Associates, Inc., 2017, pp. 5998–6008. doi:10.1038/nn.3862. 40 58 Q. Liao, T. Poggio, Bridging the Gaps Between Residual I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence Learning, Recurrent Neural Networks and Visual Cortex (047) learning with neural networks, Advances in Neural Infor- (2016) 1–16. arXiv:1604.03640. mation Processing Systems 4 (January) (2014) 3104–3112. S. Jastrz¸ebski, D. Arpit, N. Ballas, V. Verma, T. Che, arXiv:1409.3215. Y. Bengio, Residual Connections Encourage Iterative R. E. Kalman, A New Approach to Linear Filtering and Inference (2017). arXiv:1710.04773. Prediction Problems, Journal of Basic Engineering 82 (1) K. Greﬀ, R. K. Srivastava, J. Schmidhuber, Highway and (1960) 35–45. doi:10.1115/1.3662552. Residual Networks learn Unrolled Iterative Estimation, 5th D. Wolpert, Z. Ghahramani, M. Jordan, An internal model International Conference on Learning Representations, ICLR for sensorimotor integration, Science 269 (5232) (1995) 2017 - Conference Track Proceedings (2015) (2016) 1–14. 1880–1882. doi:10.1126/science.7569931. arXiv:1612.07771. R. P. N. Rao, D. H. Ballard, Dynamic model of visual recog- G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, nition predicts neural response properties in the visual cortex, Densely connected convolutional networks, Proceedings - Neural computation 9 (November 1995) (1997) 721–763. 30th IEEE Conference on Computer Vision and Pattern doi:10.1162/neco.1997.9.4.721. Recognition, CVPR 2017 2017-January (2017) 2261–2269. R. P. N. Rao, Bayesian computation in recurrent neural doi:10.1109/CVPR.2017.243. circuits., Neural computation 16 (1) (2004) 1–38. 44 63 P. Dayan, L. F. Abbott, Theoretical Neuroscience, MIT S. Den`eve, J.-R. Duhamel, A. Pouget, Optimal Press, Cambridge, MA, 2001. Sensorimotor Integration in Recurrent Cortical M. S. Advani, A. M. Saxe, High-dimensional dynamics Networks: A Neural Implementation of Kalman Filters, of generalization error in neural networks (2017) 1– Journal of Neuroscience 27 (21) (2007) 5744–5756. 32arXiv:1710.03667. doi:10.1523/JNEUROSCI.3985-06.2007. 46 64 M. Belkin, D. Hsu, S. Ma, S. Mandal, Reconciling modern J.-J. Orban de Xivry, S. Coppe, G. Blohm, P. Lefevre, machine-learning practice and the classical bias–variance Kalman Filtering Naturally Accounts for Visually trade-oﬀ, Proceedings of the National Academy of Sciences Guided and Predictive Smooth Pursuit Dynamics, of the United States of America 116 (32) (2019) 15849– Journal of Neuroscience 33 (44) (2013) 17301–17313. 15854. doi:10.1073/pnas.1903070116. doi:10.1523/JNEUROSCI.2321-13.2013. 47 65 P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, O.-S. Kwon, D. Tadin, D. C. Knill, Unifying account of I. Sutskever, Deep Double Descent: Where Bigger Models visual motion and position perception, Proceedings of the and More Data Hurt (2019). arXiv:1912.02292. National Academy of Sciences 112 (26) (2015) 8142–8147. W. Rawat, Z. Wang, Deep Convolutional Neural doi:10.1073/pnas.1500361112. Networks for Image Classiﬁcation: A Comprehensive R. S. van Bergen, J. F. M. Jehee, Probabilistic Represen- Review, Neural Computation 29 (9) (2017) 2352–2449. tation in Human Visual Cortex Reﬂects Uncertainty in Serial doi:10.1162/neco_a_00990. Decisions, The Journal of neuroscience : the oﬃcial journal C. D. Gilbert, W. Li, Top-down inﬂuences on visual of the Society for Neuroscience 39 (41) (2019) 8164–8176. processing, Nature Reviews Neuroscience 14 (5) (2013) 350– doi:10.1523/JNEUROSCI.3212-18.2019. 363. doi:10.1038/nrn3476. A. Graves, A.-R. Mohamed, G. Hinton, Speech recog- C. Summerﬁeld, T. Egner, Expectation (and attention) in nition with deep recurrent neural networks, in: 2013 visual cognition, Trends in Cognitive Sciences 13 (9) (2009) IEEE International Conference on Acoustics, Speech and 403–409. doi:10.1016/j.tics.2009.06.003. Signal Processing, no. 3, IEEE, 2013, pp. 6645–6649. H. von Helmholtz, Handbuch der physiologischen Optik, doi:10.1109/ICASSP.2013.6638947. Dover (English translation), New York, 1860/1962. H. Sak, A. Senior, F. Beaufays, Long Short-Term Memory Y. Weiss, E. P. Simoncelli, E. H. Adelson, Motion illusions Based Recurrent Neural Network Architectures for Large as optimal percepts, Nature Neuroscience 5 (6) (2002) 598– Vocabulary Speech Recognition (2014). arXiv:1402.1128. 604. doi:10.1038/nn858. D. Bahdanau, K. Cho, Y. Bengio, Neural Machine Trans- A. A. Stocker, E. P. Simoncelli, Noise characteristics and lation by Jointly Learning to Align and Translate, 3rd Inter- prior expectations in human visual speed perception, Nature national Conference on Learning Representations, ICLR 2015 Neuroscience 9 (4) (2006) 578–585. doi:10.1038/nn1669. - Conference Track Proceedings (2014). arXiv:1409.0473. 16 K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, breaks text-based CAPTCHAs, Science 358 (6368) (2017). F. Bougares, H. Schwenk, Y. Bengio, Learning Phrase doi:10.1126/science.aag2612. Representations using RNN Encoder-Decoder for Statistical S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Machine Translation, Journal of Clinical Microbiology 28 (4) Z. Su, D. Du, C. Huang, P. H. Torr, Conditional random ﬁelds (2014) 828–829. arXiv:1406.1078. as recurrent neural networks, Proceedings of the IEEE Inter- M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, national Conference on Computer Vision 2015 Inter (2015) S. Chopra, Video (language) modeling: a baseline for gener- 1529–1537. doi:10.1109/ICCV.2015.179. ative models of natural videos (2014). arXiv:1412.6604. C. J. Spoerer, P. McClure, N. Kriegeskorte, Recurrent convo- N. Srivastava, E. Mansimov, R. Salakhutdinov, Unsupervised lutional neural networks: A better model of biological object Learning of Video Representations using LSTMs (2015). recognition, Frontiers in Psychology 8 (SEP) (2017) 1–14. arXiv:1502.04681. doi:10.3389/fpsyg.2017.01551. 73 90 W. Lotter, G. Kreiman, D. Cox, Deep Predictive Coding N. Montobbio, L. Bonnasse-Gahot, G. Citti, A. Sarti, Networks for Video Prediction and Unsupervised Learning KerCNNs: biologically inspired lateral connections for classi- arXiv:1605.08104. ﬁcation of corrupted images (2019). arXiv:1910.08336. (2016). 74 91 A. Pouget, J. Beck, W. J. Ma, P. Latham, Probabilistic C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, brains: knowns and unknowns., Nature neuroscience 16 (9) I. Goodfellow, R. Fergus, Intriguing properties of neural (2013) 1170–8. doi:10.1038/nn.3495. networks, 2nd International Conference on Learning Repre- W. J. Ma, M. Jazayeri, Neural Coding of Uncertainty and sentations, ICLR 2014 - Conference Track Proceedings Probability., Annual Review of Neuroscience 37 (2014) 205– (2014). arXiv:1312.6199. 220. doi:10.1146/annurev-neuro-071013-014017. R. Geirhos, C. Michaelis, F. A. Wichmann, P. Rubisch, G. Orb´an, P. Berkes, J. Fiser, M. Lengyel, Neural M. Bethge, W. Brendel, Imagenet-trained CNNs are biased Variability and Sampling-Based Probabilistic Representa- towards texture; increasing shape bias improves accuracy and tions in the Visual Cortex, Neuron 92 (2) (2016) 530–543. robustness, 7th International Conference on Learning Repre- doi:10.1016/j.neuron.2016.09.038. sentations, ICLR 2019 (c) (2019) 1–22. arXiv:1811.12231. 77 93 M. Boerlin, C. K. Machens, S. Den`eve, Predictive J. H. Jacobsen, J. Behrmann, R. Zemel, M. Bethge, Coding of Dynamical Variables in Balanced Spiking Excessive invariance causes adversarial vulnerability, 7th Networks, PLoS Computational Biology 9 (11) (2013). International Conference on Learning Representations, ICLR doi:10.1371/journal.pcbi.1003258. 2019 (2019). arXiv:1811.00401. 78 94 D. G. Barrett, S. Den`eve, C. K. Machens, Optimal compen- J. Neyman, E. S. Pearson, IX. On the problem of the most sation for neuron loss, eLife 5 (e12454) (2016) 1–36. eﬃcient tests of statistical hypotheses, Philosophical Trans- doi:10.7554/eLife.12454. actions of the Royal Society of London. Series A, Containing P. H. Schiller, B. L. Finlay, S. F. Volman, Short-term response Papers of a Mathematical or Physical Character 231 (694- variability of monkey striate neurons., Brain research 105 (2) 706) (1933) 289–337. doi:10.1098/rsta.1933.0009. (1976) 347–9. A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, A. Dean, The variability of discharge of simple cells in the cat A. Madry, Adversarial Examples Are Not Bugs, They Are striate cortex, Experimental Brain Research 44 (4) (1981). Features (2019). arXiv:1905.02175. doi:10.1007/BF00238837. Y. Li, J. Bradshaw, Y. Sharma, Are generative classi- Z. F. Mainen, T. J. Sejnowski, Reliability of spike timing in ﬁers more robust to adversarial attacks?, 36th International neocortical neurons., Science 268 (5216) (1995) 1503–6. Conference on Machine Learning, ICML 2019 2019-June A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet (2019) 6754–6783. arXiv:1802.06552. Classiﬁcation with Deep Convolutional Neural Networks, L. Schott, J. Rauber, M. Bethge, W. Brendel, Towards the Advances In Neural Information Processing Systems (2012). ﬁrst adversarially robust neural network model on MNIST, arXiv:1102.0183. Iclr 3 (2018) 1–16. arXiv:1805.09190. 83 98 D. J. Field, A. Hayes, R. F. Hess, Contour integration T. Golan, P. C. Raju, N. Kriegeskorte, Controversial stimuli: by the human visual system: evidence for a local ”associ- pitting neural networks against each other as models of ation ﬁeld”., Vision research 33 (2) (1993) 173–93. human recognition (2019). arXiv:1911.09288. doi:10.1016/0042-6989(93)90156-q. T. S. Lee, D. Mumford, Hierarchical Bayesian inference in W. S. Geisler, J. S. Perry, B. J. Super, D. P. Gallogly, the visual cortex., Journal of the Optical Society of America. Edge co-occurrence in natural images predicts contour A, Optics, image science, and vision 20 (7) (2003) 1434–48. grouping performance, Vision Research 41 (6) (2001) 711– H. Wen, K. Han, J. Shi, Y. Zhang, E. Culurciello, Z. Liu, 724. doi:10.1016/S0042-6989(00)00277-7. Deep Predictive Coding Network for Object Recognition P. R. Roelfsema, Cortical algorithms for perceptual grouping, (2018). arXiv:1802.04762. Annual Review of Neuroscience 29 (1) (2006) 203–227. V. Srikumar, G. Kundu, D. Roth, On amortizing inference doi:10.1146/annurev.neuro.29.051605.112939. cost for structured prediction, EMNLP-CoNLL 2012 - 2012 D. Linsley, J. Kim, V. Veerabadran, C. Windolf, T. Serre, Joint Conference on Empirical Methods in Natural Language Learning long-range spatial dependencies with horizontal Processing and Computational Natural Language Learning, gated recurrent units, in: S. Bengio, H. Wallach, Proceedings of the Conference (July) (2012) 1114–1124. H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett A. Stuhlmu¨ller, J. Taylor, N. Goodman, Learning stochastic (Eds.), Advances in Neural Information Processing Systems inverses, in: C. J. C. Burges, L. Bottou, M. Welling, 31, Curran Associates, Inc., 2018, pp. 152–164. Z. Ghahramani, K. Q. Weinberger (Eds.), Advances in Neural D. George, W. Lehrach, K. Kansky, M. L´azaro-Gredilla, Information Processing Systems, Vol. 26, Curran Associates, C. Laan, B. Marthi, X. Lou, Z. Meng, Y. Liu, Inc., 2013, pp. 3048–3056. H. Wang, A. Lavin, D. S. Phoenix, A generative C. Cremer, X. Li, D. Duvenaud, Inference suboptimality vision model that trains with high data eﬃciency and in variational autoencoders, 35th International Conference 17 on Machine Learning, ICML 2018 3 (2018) 1749–1760. is required to capture the representational dynamics of arXiv:1801.03558. the human visual system, Proceedings of the National J. Marino, Y. Yue, S. Mandt, Iterative amortized inference, Academy of Sciences 116 (43) (2019) 201905544. 35th International Conference on Machine Learning, ICML doi:10.1073/pnas.1905544116. 2018 8 (2018) 5444–5462. arXiv:1807.09356. H. Tang, M. Schrimpf, W. Lotter, C. Moerman, R. D. Hjelm, K. Cho, J. Chung, R. Salakhutdinov, A. Paredes, J. O. Caro, W. Hardesty, D. Cox, G. Kreiman, V. Calhoun, N. Jojic, Iterative reﬁnement of the approximate Recurrent computations for visual pattern completion, posterior for directed belief networks, Advances in Neural Proceedings of the National Academy of Sciences of the Information Processing Systems (Nips 2016) (2016) 4698– United States of America 115 (35) (2018) 8835–8840. 4706. arXiv:1511.06382. doi:10.1073/pnas.1719397115. 106 120 R. G. Krishnan, D. Liang, M. D. Hoﬀman, On the J. T. Enns, V. Di Lollo, What’s new in visual masking?, challenges of learning with inference networks on sparse, Trends in Cognitive Sciences 4 (9) (2000) 345–352. high-dimensional data, International Conference on Artiﬁcial doi:10.1016/S1364-6613(00)01520-5. Intelligence and Statistics, AISTATS 2018 84 (2018) 143– A. M. Fyall, Y. El-Shamayleh, H. Choi, E. Shea-Brown, 151. arXiv:1710.06085. A. Pasupathy, Dynamic representation of partially occluded objects in primate prefrontal and visual cortex, eLife 6 (2017) M. Liang, X. Hu, Recurrent convolutional neural network for object recognition, Proceedings of the IEEE 1–25. doi:10.7554/eLife.25784. Computer Society Conference on Computer Vision and H. Choi, A. Pasupathy, E. Shea-Brown, Predictive Coding Pattern Recognition 07-12-June (2015) 3367–3375. in Area V4: Dynamic Shape Discrimination under Partial doi:10.1109/CVPR.2015.7298958. Occlusion, Neural Computation 30 (5) (2018) 1209–1257. K. Kar, J. Kubilius, K. Schmidt, E. B. Issa, J. J. doi:10.1162/neco_a_01072. DiCarlo, Evidence that recurrent circuits are critical to D. M. Levi, Crowding—An essential bottleneck for object the ventral stream’s execution of core object recognition recognition: A mini-review, Vision Research 48 (5) (2008) behavior, Nature Neuroscience 22 (6) (2019) 974–983. 635–654. doi:10.1016/j.visres.2007.12.009. doi:10.1038/s41593-019-0392-5. M. Manassi, B. Sayim, M. H. Herzog, Grouping, pooling, and A. Nayebi, D. Bear, J. Kubilius, K. Kar, S. Ganguli, when bigger is better in visual crowding, Journal of Vision D. Sussillo, J. J. DiCarlo, D. L. Yamins, Task-driven 12 (10) (2012) 13–13. doi:10.1167/12.10.13. convolutional recurrent models of the visual system, M. Manassi, S. Lonchampt, A. Clarke, M. H. Herzog, What Advances in Neural Information Processing Systems 2018- crowding can tell us about object representations, Journal of Decem (NeurIPS) (2018) 5290–5301. Vision 16 (3) (2016) 35. doi:10.1167/16.3.35. 110 126 D. H. Ballard, Animate vision, Artiﬁcial Intelligence 48 (1) A. Doerig, A. Bornet, O. Choung, M. Herzog, Crowding (1991) 57–86. doi:10.1016/0004-3702(91)90080-4. reveals fundamental diﬀerences in local vs. global processing J. M. Findlay, I. D. Gilchrist, Active in humans and machines, Vision Research 167 (August 2019) Vision, Oxford University Press, 2003. (2020) 39–45. doi:10.1016/j.visres.2019.12.006. doi:10.1093/acprof:oso/9780198524793.001.0001. S. Sabour, N. Frosst, G. E. Hinton, Dynamic routing R. Bajcsy, Y. Aloimonos, J. K. Tsotsos, Revisiting active between capsules, in: I. Guyon, U. V. Luxburg, S. Bengio, perception, Autonomous Robots 42 (2) (2018) 177–196. H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), doi:10.1007/s10514-017-9615-3. Advances in Neural Information Processing Systems 30, S. J. Russell, Rationality and intelligence, Curran Associates, Inc., 2017, pp. 3856–3866. Artiﬁcial Intelligence 94 (1-2) (1997) 57–77. S. Sabour, N. Frosst, G. E. Hinton, Matrix capsules with EM doi:10.1016/S0004-3702(97)00026-X. routing, Iclr 2018 (2011) (2018) 1–12. arXiv:1710.09829. 114 129 S. J. Gershman, E. J. Horvitz, J. B. Tenenbaum, Compu- J. K. O’Regan, A. No¨e, A sensorimotor account of vision and tational rationality: A converging paradigm for intelligence visual consciousness, Behavioral and Brain Sciences 24 (5) in brains, minds, and machines, Science 349 (6245) (2015) (2001) 939–973. doi:10.1017/S0140525X01000115. 273–278. doi:10.1126/science.aac6076. G. Buzs´aki, The Brain from Inside Out, Oxford University T. L. Griﬃths, F. Lieder, N. D. Goodman, Rational Use of Press, 2019. doi:10.1093/oso/9780190905385.001.0001. Cognitive Resources: Levels of Analysis Between the Compu- P. Werbos, Backpropagation through time: what it does and tational and the Algorithmic, Topics in Cognitive Science how to do it, Proceedings of the IEEE 78 (10) (1990) 1550– 7 (2) (2015) 217–229. doi:10.1111/tops.12142. 1560. doi:10.1109/5.58337. 116 132 J. Kubilius, M. Schrimpf, K. Kar, R. Rajalingham, H. Hong, J. Guerguiev, T. P. Lillicrap, B. A. Richards, Towards deep N. Majaj, E. Issa, P. Bashivan, J. Prescott-Roy, K. Schmidt, learning with segregated dendrites, eLife 6 (2017) 1–37. A. Nayebi, D. Bear, D. L. Yamins, J. J. DiCarlo, Brain- doi:10.7554/eLife.22901. like object recognition with high-performing shallow recurrent J. Sacramento, R. Ponte Costa, Y. Bengio, W. Senn, anns, in: H. Wallach, H. Larochelle, A. Beygelzimer, Dendritic cortical microcircuits approximate the backprop- F. d’Alch´e Buc, E. Fox, R. Garnett (Eds.), Advances agation algorithm, in: S. Bengio, H. Wallach, H. Larochelle, in Neural Information Processing Systems 32, Curran K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), Advances Associates, Inc., 2019, pp. 12805–12816. in Neural Information Processing Systems 31, Curran J. Deng, W. Dong, R. Socher, L.-J. Li, Kai Li, Associates, Inc., 2018, pp. 8721–8732. Li Fei-Fei, ImageNet: A large-scale hierarchical image J. C. Whittington, R. Bogacz, Theories of Error Back- database, in: 2009 IEEE Conference on Computer Vision Propagation in the Brain, Trends in Cognitive Sciences 23 (3) and Pattern Recognition, IEEE, 2009, pp. 248–255. (2019) 235–250. doi:10.1016/j.tics.2018.12.005. doi:10.1109/CVPR.2009.5206848. T. P. Lillicrap, A. Santoro, Backpropagation through time T. C. Kietzmann, C. J. Spoerer, L. K. A. S¨orensen, and the brain, Current Opinion in Neurobiology 55 (2019) R. M. Cichy, O. Hauk, N. Kriegeskorte, Recurrence 82–89. doi:10.1016/j.conb.2019.01.011. 18 136 138 L. Almeida, A learning rule for asynchronous perceptrons with R. Liao, Y. Xiong, E. Fetaya, L. Zhang, K. J. Yoon, feedback in a combinatorial environment., Proceedings, 1st X. Pitkow, R. Urtasun, R. Zemel, Reviving and improving First International Conference on Neural Networks 2 (1987) recurrent back-propagation, 35th International Conference 609–618. on Machine Learning, ICML 2018 7 (2018) 4807–4820. F. J. Pineda, Generalization of back-propagation to recurrent arXiv:1803.06396. neural networks, Physical Review Letters 59 (19) (1987) D. Linsley, A. K. Ashok, L. N. Govindarajan, R. Liu, 2229–2232. doi:10.1103/PhysRevLett.59.2229. T. Serre, Stable and expressive recurrent vision models (2020). arXiv:2005.11362.
Quantitative Biology – arXiv (Cornell University)
Published: Mar 26, 2020
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.