Mutual information based retinal coding model derivation

This post aims to derive the mutual information object for the retinal coding model originally introduced in 2011 by Yan Karklin and Eero Simoncelli in a paper titled Efficient coding of natural images with a population of noisy Linear- Nonlinear neurons. It is a fantastic exemplar of theoretical neuroscience: biology guides the model’s constraints, and theory guides its objective, ultimately resulting in the functional properties of the retina.

Subsequent papers have used the model originally introduced by Karkin and Simoncelli to explore other aspects of retinal coding. I list the following, as they have guided the form of the model and its derivation:

  • Inter-mosaic coordination of retinal receptive fields Nature 2021 by Roy et al.
  • Scene statistics and noise determine the relative arrangement of receptive field mosaics PNAS 2021 by Jun, Field, and Pearson.

Given my interest in retinal coding, I decided it would be worthwhile to implement the model myself since I had some ideas for extensions. However, when it came down to translating equations into code, I found myself in a pickle. And this pickle was a major exercise of error correction.

In this post, I’m going to attempt to derive the model and it’s final object from first principles, which will serve as a reference for myself and, hopefully, anyone else who decides they’re also going to implement a similar model.


We assume a simple linear-nonlinear model for the transformation of input images to retinal ganglion cell (RGC) outputs. Mathematically, this model is realized as below:

\[r_i = f_i(\mathbf{w}_i^T(\mathbf{x}+\mathbf{n}_{in})) + n_{out}\]

Here, $\mathbf{x}$ denotes the input image vector, and $i$ denotes in neuron index. The neuron computes is response $r_i$, by taking the inner product between noisy input image $\mathbf{x} + \mathbf{n}{in}$ and its recptive field $\mathbf{w}_i$. The inner product can be thought as a membrane potential, which then goes through non-linearity $f_i$, which then has additional noise added. Both noise sources (image noise and neural noise) are modelled as isotropic Gaussians with variance $\sigma{in}$, and $\sigma_{out}$, respectively.

The neuron non-linearity takes the form of:

\[f_i(y_i) = \gamma_i \log(1+e^{\beta(y_i-\theta_i)})/\beta\]

where the scaling parameter, $\gamma_i$, and shifting parameter, $\theta_i$, allow each neuron adapt its nonlinearity to the distribution of its inputs, $\mathbf{w}i^T(\mathbf{x}+\mathbf{n}{in})$. The free parameters of the model are: ${\mathbf{w}_i}$, ${\gamma_i}$, and ${\theta_i}$, and these are learned by maximizing the mutual information between the input images, $X = {\mathbf{x}}$, and their corresponding output responses, $R= {\mathbf{r}}$ Mathematically, this is realized below:

\[I(X;R) = H(X) - H(X|R)\]
Where $H(X)$ is the entropy of the data, and $H(X R)$ is the entropy of $X$ conditioned on $R$. The entropy of the data is a fixed value, thus maximizing the mutual information is tantamount to minimizing the conditional entropy $H(X R)$. Intuitively, we can think of $H(X R)$ as a metric over the variance between elements of $X$ and their resulting $\mathbf{r}$s. If $H(X R)$ is large, then there are a large number of $\mathbf{x}$s that explain response $\mathbf{r}$, but if the conditional entropy is low, then the opposite is true: Given an $\mathbf{r}$, we can state its corresponding $\mathbf{x}$ with confidence. In other words, the posterior, $p(\mathbf{x} \mathbf{r})$, is highly peaked. However, in the current formulation of output noise in the linear-nonlinear model, $n_{out}$ can be easily overcome by increasing the magnitude of $ f(\mathbf{w}i^T(\mathbf{x}+\mathbf{n}{in}))$. To prevent this, a mean firing constraint is placed on $r_i$. Thus the objective of the model is:
\[\min H(X|R)\] \[\text{s.t. } \mathbb{E}_X [r_i]=1, \ \ i = 1,...,n\]

Updated: