# Nonlinear convergence preserves stimulus information

Standard

Gabrielle J. Gutierrez, Fred Rieke*, and Eric Shea-Brown*
*co-senior authors

##### The layered circuitry of the retina compresses and reformats visual inputs before passing them on to the brain. The optic nerve has the channel capacity of an internet connection from the early 90s, yet the brain somehow receives enough information to reconstruct our high-definition world. The goal of this study was to learn something about the compression algorithm of the retina by modeling aspects of its circuit architecture and neural response properties.

The retina compresses a high-dimensional input into a low-dimensional representation. This is supported by converging and diverging circuit architectures (Fig. 1), along with nonlinear neuron responses. Hundreds of photoreceptors converge to tens of bipolar cells which converge to a single ganglion cell. At the same time, inputs diverge onto many different ganglion cell types that have overlapping receptive fields.

Figure 1: Schematic of retina circuitry illustrating divergence into ON and OFF pathways and convergence within a pathway.

If you look at these circuit components, though, it’s hard to see how they manage to preserve enough information for the brain to work with. For example, converging two inputs can result in ambiguities. In Figure 2, the neural response is simply the sum of the input dimensions. This means that all of the stimuli in the top plot that lie along the orange line are represented by the same response shown by the orange circle in the bottom plot. There’s no telling those stimuli apart, so information is lost by convergence here – down to 12.50 bits in the response from 19.85 bits in the stimulus.

Figure 2: Convergence creates ambiguities, causing information about the stimulus to be lost.

Divergence is another common neural circuit motif. Diverging a stimulus input into two neurons (Fig. 3) expands a 1-dimensional stimulus into a 2-dimensional response but this leads to redundant signals. Here, divergence creates an inefficient neural architecture because it uses two neurons to give you as much information as just one neuron.

Figure 3: Diverging a single inputs into two outputs can produce redundancies.

Nonlinear response functions are common in neurons and can make a neuron more efficient at encoding its inputs by spreading its responses around so that information about the stimulus is maximized. Nonlinear response functions can otherwise make a neuron selective to certain stimulus features, but selectivity and efficiency can be in conflict with each other. Figure 4 shows what a rectified linear (ReLU) nonlinearity does to a gaussian stimulus distribution. It compresses the left side of the gaussian so that there is only one response to encode all of the stimuli that fall below the threshold. A lot of information is lost this way. If the stimulus distribution described luminance values in an image, the ReLU would cut out much of the detail from that image.

Figure 4: Nonlinear transformation of a gaussian distributed input with a ReLU can distort the distribution, producing a compressed response where some portion of the stimulus information is discarded.

Given that all of these information-problematic elements make up neural circuits, we wondered: how much information can a compressive neural circuit retain when its neurons are nonlinear? We were surprised to find that a convergent, divergent circuit can preserve more information when its subunits are nonlinear than when its subunits are linear (Fig. 5) – even though the individual linear subunits are lossless and the nonlinear subunits are not.

Figure 5: A convergent, divergent circuit with nonlinear subunits (right) preserves more information about the stimulus than a circuit with linear subunits (left).

To explain this, we’ll start out with a reduced version of the circuit. It has only 2 converging subunits and no divergence. Figure 6 shows how a 2-dimensional stimulus is encoded by each layer of the two circuits being compared. The dark purple band represents stimuli whose two inputs sum to the same value. These stimuli are represented by the same output response in the linear subunits circuit as demonstrated by the dark purple that fills a single histogram bin (left, 3rd and 4th rows). Those same stimuli are represented in a more distributed way for the nonlinear subunits circuit (right, 3rd and 4th rows) – meaning that they are represented more distinctly in the output response.

Figure 6: The encoding of the stimulus space at each circuit layer when the subunits are linear (left) and nonlinear (right).

With two subunits, the nonlinear subunits circuit retains more information than the linear subunits circuit, but what happens when there are more than two subunits? The more subunits you compress together, the more difficult it should be to distinguish between different stimuli. Indeed, this is true, but we wondered if the nonlinear subunits circuit would continue to have an advantage over the linear subunits circuit as more subunits are converged. Figure 7 shows that it does. With more subunits, the output response distribution becomes more gaussian, spreading responses out and shifting them towards more positive values (Fig. 7B). The more nonlinear subunits that are converged, the more the nonlinear subunits circuit gains an advantage, up to a saturation point (Fig. 7C). In essence, the convergence of increasing numbers of nonlinear subunits allows the circuit to escape from the compression imposed by the thresholds of the individual nonlinearities themselves.

Figure 7: (A) The output distribution for the linear subunits circuit does not change with more subunits. (B) The output distribution for the nonlinear subunits circuit shifts away from zero and becomes more gaussian. (C) The information entropy for the nonlinear subunits circuit increases with more subunits [subunits undergo an identical normalization regardless of their linearity or nonlinearity].

It would seem that this explains it all; however, there is something subtle to consider. In Figure 5, the circuits had two complementary, diverging pathways – an ON and an OFF pathway. You might have expected the divergence to redeem the linear subunits circuit since the OFF pathway can encode all the stimulus information that the ON pathway discarded. So why is the nonlinear subunits circuit still better? The explanation is in Figure 8 which tracks a distribution of 2-dimensional stimuli through circuits with 2 diverging pathways (ON and OFF) and two subunits in each pathway.

The points are all color-coded by the stimulus quadrant they originated from. The linear subunits don’t meaningfully transform the stimuli (Fig. 8A), although the OFF subunit space is rotated because the OFF subunits put a minus sign on the inputs. When the linear subunits are converged within their respective pathways, the ON and OFF responses compress everything onto a diagonal line because they are perfectly anti-correlated (Fig. 8B). When the output nonlinearities are applied, this linear manifold gets folded into an L-shape (Fig. 8C). Notice how the information entropy for the output response of the linear subunits circuit with diverging pathways is higher than it was with just a single pathway (Fig. 7C, black) – but it has only gone up enough to match the information entropy of a single pathway response without any nonlinearities in either the subunits or the output (Fig. 7C, grey dashed). In other words, the OFF pathway in the linear subunits circuit with output nonlinearities (Fig. 8C) is indeed rescuing the information discarded by the ON pathway, but it cannot do any better than an ON pathway with no nonlinearities anywhere. So how is the nonlinear subunits circuit able to preserve even more information?

Figure 8: Geometrical exploration of the compressive transformations that take place in the linear and nonlinear subunits circuits.

Well, first notice how the nonlinear subunits transform the inputs (Fig. 8D). The nonlinearities actually compress the subunit space, but they do so in complementary ways for the ON and OFF subunits. When these subunits are converged in their respective pathways (Fig. 8E), the output response has some similarities to that for the linear subunits circuit (Fig. 8C). The L-shaped manifold is still there, but the orange and purple points have been projected off of it. These points represent the stimulus inputs with mixed sign. By virtue of having these points leave the manifold and fill out the response space, information entropy is increased. In fact, as more nonlinear subunits are converged in a circuit that also has divergence, the information entropy continues to increase until saturation (Fig. 8F). It even increases beyond that of the fully linear response (shown in Fig. 8B) where there are no nonlinearities anywhere.

Manifold-schmanifold. Does the nonlinear subunits circuit encode something meaningful for the retina or what? Figure 9 shows that it does! The nonlinear subunits circuit encodes both mean luminance and local contrast whereas the linear subunits circuit is only able to encode the mean luminance of the stimulus. So the convergence of nonlinear subunits not only preserves more quantifiable information, it also preserves more qualitatively useful stimulus information.

Figure 9: (A) The stimulus space is color-coded by bands of of mean luminance. A banded structure is preserved in the output reponse spaces of both the linear and nonlinear subunits circuits. The red square is a reference point. The cyan square has the same mean as the red square, but a different contrast. The red circle has the same contrast as the red square but a different mean. There is no overlap of these shapes in the response space of the nonlinear subunits circuit. (B) The stimulus space is color-coded by contrast levels. The response space of the linear subunits circuit overlaps these levels, providing no distinction between them. The nonlinear subunits circuit preserves separate contrast bands in its response space.

Taken together, what this means for the retina is that the compression algorithm it uses might also be the same one that maximizes information about the stimulus distribution. This is especially noticeable when we focus on the nonlinearity. Nonlinear transformations can induce selectivity, or they can produce an efficient encoding of the stimulus. We’re not used to thinking of them as doing both at the same time though because an efficient code indicates that information about the stimulus is maximized whereas selective coding means that some information about the stimulus will have to be discarded or minimized. My study suggests that selective coding at the single cell level may be leveraged to efficiently encode as much information about the stimulus as possible at the level of the whole circuit.

* a full manuscript of this work is now on bioRxiv – click here

# E/I balance rescues the decoded representation that is corrupted by adaptation

Standard

Gabrielle J. Gutierrez and Sophie Deneve

* Update: this work is now published in the journal eLife.

#### Spike-frequency adaptation is part of an efficient code, but how do neural networks deal with the adverse effects on the encoded representations they produce? We use a normative framework to resolve this paradox.

Fig. 1: Adaptation shifts response curve. The shift in neural responses maintains a constant response range for an equivalent area under the stimulus PD curve.

The range of firing rates that a neuron can maintain is limited by biophysical constraints and available metabolic resources. Yet, neurons have to represent inputs whose strength varies by orders of magnitude. Early work by Barlow1 and Laughlin2 hypothesized and demonstrated that sensory neurons in early processing centers adapt their response gain as a function of recent input properties (Fig. 1). This work was instrumental in uncovering a principle of neural encoding in which adapting neural responses maximize information transfer. However, the natural follow-up question concerns the decoding of neural responses after they’ve been subject to adaptation. There’s no question that this kind of adaptation has to result in profound changes to the mapping of neural responses to stimuli3,4 – so how are adapting neural responses interpreted by downstream areas?

By using a normative approach to build a neural network, we show that adapted neural activity can be accurately decoded by a fixed readout unit. This doesn’t require any synaptic plasticity – or re-weighting of the synaptic weights. What it does require, as we’ll show, is a recurrent synaptic structure that promotes E/I balance.

Our approach rests on the premise that nothing is known from the outset about the structure of the network. All we known is the input/output transformation that the network performs. For this study, that I/O function is simply a linear integration of feedforward input the network receives. Given some input, c(t), we expect some output, x(t), such that $\dot{x}(t) = Ax(t) + c(t)$. The variable, x(t), is called the target signal because it is what we expect the network to produce given the input, but what the network actually puts out is denoted as x̂(t). We assume that the true network output is a linear sum of the activity of the network units, $\hat{x}(t) = \sum_n w_n r_n(t)$, where ri(t) is the activity of neuron i and wi is its readout weight. It is this actual network output, x̂(t), that will be compared to the target output, x(t).

With these assumptions, we set up an objective function, E, to be minimized. We want to minimize the representation error of the network as well as the overall neural activity. In other words, we derive a network that from the outset has the imperative to be as accurate as possible while also being efficient. The representation error is the squared difference between the decoded estimate that’s read out from the network,, and the output signal we should expect, x, given the input. The metabolic cost is a quadratic penalty on the network firing activity of all n neurons. So the objective looks like this: $E(t) = [x(t) - \hat{x}(t)]^2 + \mu \sum_n r_n(t)^2$

To derive a voltage equation from this objective (see these notes for detailed derivation), we rely on the greedy minimization approach from Boerlin, et al 5, which involves setting up an inequality between the objective expression that results when a neuron in the network spikes versus when no spike is fired in the network: E(t|no spike) > E(t|spike). This forces spikes to be informative. A spike may fire only if the objective is minimized by that spike. A spike must make the representation error lower than if a spike were not to have been fired at that time step.

Fig. 2: Spike-frequency adaptation. A history dependent spiking threshold (green) increases and decays with each spike fired (blue) in response to a constant stimulus (pink).

Knowing that the voltage of a spiking neuron needs to cross a threshold before a spike is fired, we let this inequality represent that concept so that after some algebra, the left-hand side expression is taken to be the voltage and the right-hand side is the spiking threshold. In other words, V > threshold is the condition for spiking. Therefore, $V_i = w_i(x - \hat{x}) > \frac{w_i^2 + \mu}{2} + \mu r_i = threshold$.

Let’s first take a look at the spiking threshold. Notice how it is a function of the neuron activity variable, r(t). This means we’ve derived a dynamic spiking threshold that increases as a function of past spiking activity (Fig. 2). Thus, spike-frequency adaptation fell into our lap from first principles. The dynamic part of this threshold is a direct result of the metabolic cost that was included in the objective function.

Fig. 3: Schematic of derived network.

Taking the derivative of the voltage expression gives us an equation where each term can be interpreted as a current source to the neuron. The resulting network is diagrammed in Figure 3 where you’ll see that the input weight to a given neuron is the same as its readout weight and proportional to the recurrent weights it receives as well as its own self-connection (i.e. autapse). Because our optimization procedure didn’t specify values for these weights – just the relationships between them – the weight parameter, wi, for any given neuron i is a free parameter. But the value of that parameter has consequences for the adaptation properties of the neuron in question (Fig. 4).

Fig. 4: Adaptation profiles for heterogeneous neurons. The weight parameter determines how excitable a neuron is and its time constant of adaptation.

Neurons with a large weight not only have higher baseline firing thresholds than their small weight counterparts, they have stronger self-inhibition. In contrast, small weight neurons are intrinsically closer to threshold, so they have a higher firing frequency out of the gate, but they burn out quickly because of spike-frequency adaptation. From here on, I’ll refer to the neurons with a small weight as excitable and the large weight neurons as mellow. These heterogeneous adaptation profiles have an important role to play in the network we’ve derived.

Fig. 5: Network response to a stimulus pulse. Neurons fire in response to the stimulus (top, raster) with the most excitable neurons firing first (light green) and the mellower neurons pitching in later (dark blue). Despite time-varying activity in the individual neurons, the network output (orange) tracks the target signal (grey).

To illustrate how this panoply of diverse neurons work together to represent a stimulus, take a look at Figure 5 in which a pulse stimulus has been presented to the network. For the duration of the pulse, the network as a whole does a great job of tracking the stimulus, forming a stable representation over that time. Any single neuron individually does not maintain a stable representation of the stimulus, but the network neurons coordinate as an ensemble. The excitable neurons are the first responders, valiantly taking on the early part of the representation. But they quickly become fatigued. That’s when the mellow neurons kick in to take up the slack. This coordinated effort is all thanks to the recurrent connectivity. When a neuron is firing, it is simultaneously inhibiting other neurons, basically informing other neurons that the stimulus has been accounted for and reported to the readout. But when adaptation starts to fatigue that neuron, it dis-inhibits the other neurons. At some point the amount of input that is going unrepresented outweighs the amount of inhibition coming from the active neuron, causing a mellower neuron to be recruited in carrying the stimulus representation.

Fig. 6: E/I balanced currents reduce error. Left, excitatory and inhibitory currents impinging on an example neuron in response to three different stimulus presentations. The neuron in the top plot belongs to a network with random recurrent connections that are not E/I balanced. In the bottom plot, that neuron is part of an E/I balanced network. Right, the representation error for the unbalanced network (grey) is higher than for the balanced network (black).

This connectivity scheme is inherently E/I balanced, meaning that excitatory currents to an individual neuron are closely tracked to the inhibitory currents entering that same neuron (as shown in the left panel in Fig. 6). When the network takes on a random recurrent structure, even though the currents are somewhat balanced over a long time, they aren’t as tightly balanced as in the recurrent connectivity structure that we derived. The balanced recurrent connectivity scheme is also what’s keeping things accurate (Fig. 6, right plot). In fact, the connectivity structure is entirely derived from the error term in the objective.

Now that we have a model with adaptation and E/I balanced connectivity, we use it to model a network that encodes orientation, such as in area V1 in visual cortex. To do this, we made a neural network with two cell types: mellow and excitable.

Fig. 7: Schematic of orientation coding network. Each orientation is represented by a pair of neurons, one excitable and one mellow neuron. Only a few connections coming from the outlined neuron are shown. Inhibitory connections terminate in a bar and excitatory connections terminate in a prong.

Each neuron has a preferred orientation which is set by the complement of input weights it receives. The preferred orientation of each mellow neuron overlaps with the preference of one other excitable neuron. That means that each orientation is preferred by a pair of network neurons, one excitable and one mellow (Fig. 7). It’s worth pointing out how the derived connectivity interacts with neuron preferences. Specifically, neurons with similar preferences inhibit each other most strongly, whereas neurons with opposing preferences excite each other. This seems counterintuitive – and even contrary to the experimental data – but it reflects the effective encoding strategy at work here. Neurons with similar preferences are competing with each other for the chance to report the stimulus to the readout. If all of the neurons reported at once, the readout would be overwhelmed and unable to decode the stimulus as accurately because the representation would too often reflect the intrinsic properties of the active neurons. On the other hand, neurons with opposite preferences can afford to excite each other because it’s almost like a game of chicken. The active neuron is betting that the opposing neuron isn’t receiving strong input and can therefore feel confident that exciting that neuron won’t be enough to bring it to spike. This set up keeps all neurons relatively close to their baseline spiking thresholds so that any given neuron is ready to be recruited at the drop of a hat.

Fig. 8: Tuning curves. The excitable neurons have broader tuning curves (light green) than the mellow neurons (dark blue).

The tuning curves for the excitable and mellow subpopulations reveal their particular characteristics (Fig. 8). Excitable neurons have a broader tuning curve than their mellow counterparts. Their tuning curves are also higher magnitude than the mellow ones, but both tuning curves were normalized to unity in the figure. These tuning curves represent the early responses of the network neurons to a series of stimulus presentations. By comparing them to the late part of the response to those same stimuli, we can see how the tuning curves change to accommodate the effects of adaptation (Fig. 9). The tuning curve for the late responses in the excitable neurons shows a decrease in the amplitude of the curve near the preferred orientation (left, Fig. 9). This is what most people would expect to see as a result of adaptation. However, the situation for the mellow neurons is counter to those expectations (right, Fig. 9). Their late responses show an increase in activity at the preferred orientation. This is because the excitable neurons are adapted more strongly than the mellow neurons, which means that the mellow neurons have to pitch in to save the representation after the excitable neurons burn out. Thus the mellow neurons tuning curves are facilitated due to adaptation, not suppressed.

Fig. 9: Tuning curves change after adaptation. Tuning curves for early responses as shown in Fig.8 are in grey. After adaptation, the tuning curves are suppressed for the excitable neurons (left, light green), but facilitated for the mellow neurons (right, dark blue).

We showed that E/I balance works hand-in-hand with adaptation to produce a representation that is both efficient and accurate. Sure, we could’ve allowed adaptation to result in a perceptual bias. Our model doesn’t exclude that possibility, but we paid particular attention to the short-term effects of adaptation, and to the subtle changes that adaptation produces in neuron tuning without degrading the network’s ability to accurately encode the stimulus. The bigger picture here is that variability is part of the optimal solution rather than a problem.

Fig. 10: Variability in network neuron responses. The spike rasters from the network are color coded for each stimulus presentation. The stimulus was identical across trials but preceded by a different randomized stimulus sequence. Individual neuron rasters are organized horizontally so that each line represents the spikes from a given neuron.

We illustrate that principle with the overlaid spike rasters in Figure 10 in which the network is presented with the same stimulus on three separate occasions. The only difference between those presentations are the randomized stimulus sequences presented before each one. The history dependence of spike-frequency adaptation produces highly variable neuron responses to the same stimulus over different trials. Despite that variability in the spike timing and firing rate of individual neurons, the network output is very accurate across those three presentations of the stimulus. Adaptation is the catalyst for the redistribution of spikes, while E/I balance is the means by which spiking activity is redistributed in a manner that will preserve the representation. With adaptation enforcing an efficient encoding and E/I balance maintaining an accurate representation, the network can have its cake and eat it too.

1. Barlow, H. B. Reconstructing the visual image in space and time. Nature 279, 189–190 (1979).
2. Laughlin, S. A Simple Coding Procedure Enhances a Neurons Information Capacity. Z. Naturforsch., C, Biosci. 36, 910–912 (1981).
3. Series, P., Stocker, A. A. & Simoncelli, E. P. Is the Homunculus ‘Aware’ of Sensory Adaptation? Neural Comput 21, 3271–3304 (2009).
4. Solomon, S. G. & Kohn, A. Moving Sensory Adaptation beyond Suppressive Effects in Single Neurons. Current Biology 24, R1012–R1022 (2014).
5. Boerlin, M., Machens, C. K. & Denève, S. Predictive Coding of Dynamical Variables in Balanced Spiking Networks. PLoS Comput Biol 9, e1003258–16 (2013).