Mutual information based retinal coding model derivation

This post aims to derive the mutual information object for the retinal coding model originally introduced in 2011 by Yan Karklin and Eero Simoncelli in a paper titled Efficient coding of natural images with a population of noisy Linear- Nonlinear neurons. It is a fantastic exemplar of theoretical neuroscience: biology guides the model’s constraints, and theory guides its objective, ultimately resulting in the functional properties of the retina.

Subsequent papers have used the model originally introduced by Karkin and Simoncelli to explore other aspects of retinal coding. I list the following, as they have heavily guided the form of the model in this post and its derivation:

  • Inter-mosaic coordination of retinal receptive fields Nature 2021 by Roy et al.
  • Scene statistics and noise determine the relative arrangement of receptive field mosaics PNAS 2021 by Jun, Field, and Pearson.

Given my interest in retinal coding, I decided it would be worthwhile to implement the model myself since I had some ideas for extensions. However, when it came down to translating equations into code, I found myself in a pickle. And this pickle was a major exercise of error correction.

In this post, I’m going to attempt to derive the model and it’s final object from first principles, which will serve as a reference for myself and, hopefully, anyone else who decides they’re also going to implement a similar model.

Model definition

We assume a simple linear-nonlinear model for the transformation of input images to retinal ganglion cell (RGC) outputs, which can be seen above. Mathematically, this model is realized as below:

\[r_i = f_i(\mathbf{w}_i^T(\mathbf{x}+\mathbf{n}_{in})) + n_{out}\]

Here, $\mathbf{x}$ denotes the input image vector, and $i$ denotes in neuron index. The neuron computes it’s response $r_i$, by taking the inner product between noisy input image \(\mathbf{x}+\mathbf{n}_{in}\) and its receptive field $\mathbf{w}_i$. The inner product can be thought as a membrane potential, which then goes through non-linearity $f_i$, which then has additional noise added. Both noise sources (image noise and neural noise) are modeled as isotropic Gaussians with variance \(\sigma_{in}\), and \(\sigma_{out}\), respectively.

The neuron non-linearity takes the form of a soft plus rectifying function:

\[f_i(y_i) = \gamma_i \log(1+e^{\beta(y_i-\theta_i)})/\beta\]

where the scaling parameter, $\gamma_i$, and shifting parameter, $\theta_i$, allow each neuron adapt its nonlinearity to the distribution of its inputs, \(\mathbf{w}_i^T(\mathbf{x}+\mathbf{n}_{in})\). The \(\beta\) parameter determines how “pointy” the elbow of the rectifying unit is, and can be fixed to a constant value. This nonlinearity’s derivative conveniently takes the form of a sigmoid, which will come in handy when in the future, when we try to do a first order approximation to $f$. Thus, the free parameters of the model are: \(\{\mathbf{w}_i\}\), \(\{\gamma_i\}\), and \(\{\theta_i\}\), and these are learned by maximizing the mutual information between the input images, $\mathbf{x}\in X$, and their corresponding output responses, ${\mathbf{r}} \in R$ Mathematically, this is realized below:

\[I(X;R) = H(X) - H(X\vert R)\]

Where \(H(X)\) is the entropy of the data, and \(H(X\vert R)\) is the entropy of \(X\) conditioned on \(R\). The entropy of the data is a fixed value, thus maximizing the mutual information is tantamount to minimizing the conditional entropy \(H(X\vert R)\). Intuitively, we can think of \(H(X\vert R)\) as a metric over the variance between elements of $X$ and their resulting \(\mathbf{r}\). If \(H(X\vert R)\) is large, then there are a large number of \(\mathbf{x}\) that explain response $\mathbf{r}$, but if the conditional entropy is low, then the opposite is true: Given an \(\mathbf{r}\), we can state its corresponding \(\mathbf{x}\) with confidence. In other words, the posterior, \(P(\mathbf{x}\vert \mathbf{r})\), is highly peaked. However, in the current formulation, the output noise on \(r_i\) can be easily overcome by increasing the magnitude of \(f_i(\mathbf{w}_i^T(\mathbf{x}+\mathbf{n}_{in}))\). To prevent this, a mean firing constraint is placed on $r_i$. Thus the constrained optimization objective of the model is:

\[\min H(X\vert R)\] \[\text{s.t. } \mathbb{E}_X [r_i]=1, \ \ i = 1,...,n\] \[\def\*#1{\mathbf{#1}}\]

Now the question is how do we compute \(H(X\vert R)\)??

Computing conditional entropy

We can compute the entropy via the following equation,

\[H(X\mid R)=\mathbb{E}_{X}\left[\frac{1}{2} \log \text{det}\left(2 \pi e \mathbf{C}_{\mathbf{x} \mid \mathbf{r}}\right)\right]\]

This might look complicated, but there is one key component that will guide intuition: \(\text{det}\left(2 \pi e \mathbf{C}_{\mathbf{x} \mid \mathbf{r}}\right)\). Recall that the determinant provides a metric over the volume formed from the column vectors of the matrix it is acting upon. Thus, minimizing \(\text{det}\left(2 \pi e \mathbf{C}_{\mathbf{x} \mid \mathbf{r}}\right)\) makes the posterior maximally peaked and minimizes uncertainty in \(\*x\).

With that, what is the equation for \(\*C_{\*x\mid\*r}\)? It is as follows:

\[\*C_{\*x\mid\*r} = \left(\mathbf{C}_\*x^{-1}+\mathbf{W G(x) C}_{r \mid x}^{-1} \mathbf{G(x)} \mathbf{W}^{\boldsymbol{\top}}\right)^{-1}\]

Where did that monstrosity come from?? Find out in Appendix 1, but I can assure you that we already have, or can compute the every one of these matrices given an input image. A rather undesirable property of directly computing \(\*C_{\*x\mid\*r}\) and then taking the \(\log\text{det}\), is that we have to compute two matrix inverses. We can get around this by using some properties of the determinant and applying the matrix inversion lemma, sometimes known as the Woodbury matrix identity. With the finagling in Appendix 2, we can remove these inverses and arrive to the following equation for the conditional entropy,

\[\begin{align} H(X\mid R) &= -{E}_{X}\left[ \log {\text{det} (\mathbf{A})} -\log{\text{det}( \mathbf{B})}\right] \\ \mathbf{A} &= \mathbf{G(x) W}(\mathbf{C_x}+\mathbf{C}_{in}) \mathbf{W G(x)}+\mathbf{C}_{out } \\ \mathbf{B} &= \mathbf{G(x)} \mathbf{W} \mathbf{C}_{ {in }} \mathbf{W} \mathbf{G(x)}+\mathbf{C}_{ {out }} \end{align}\]

Great! We now don’t have to take any matrix inverses, and can instead just take two $\log$ determinants.

Going back to our constrained optimization problem (where we minimized conditional entropy s.t. the mean firing rate is one), we can convert this to a unconstrained optimization problem using an augmented Lagrangian method with a quadratic penalty. See Chapter 17 in Numerical Optimization by Nocedal and Wright for more details on this approach.

Doing so results in the following objective:

\[\min_{\{\mathbf{w}_i\},\ \{\gamma_i\},\ \{\theta_i\}} H(X\mid R) + \sum_i \mu (<r_i>-1)^2\]

Here, \(\mu\) is a parameter that weights the constraint objective and is selected by the user. One approach to solving this problem is the compute the gradient of objective w.r.t. the parameters you’re optimizing over… I have not attempted this as this whole process has been hairy enough, and instead just use pytorch autograd. In Karlin and Simoncelli’s words “We omit the gradients here, as they are obtained using standard methods but do not yield easily interpretable update rules.” I’ll take their word.

Appendix 1: Derivation of posterior covariance

Let \(\*u = \*W(\*x + \*n_x)\), where the rows of \(\*W\) are receptive fields \(\*w_1^T,...,\*w_n^T\). Since \(\*u\) is a Gaussian random variable, we can write it in terms of its mean and variance: \(\begin{equation} \*u = \bar{\*u} + \Delta \*u \end{equation}\)

Where,

\[\begin{align} \bar{\*u} &= \*W\*x \\ \Delta{\*u} &= \*W\*n_x \\ \*C_{\*u\vert \*x} = <\Delta{\*u}\Delta{\*u}^T> &= \*W\*C_{in}\*W^T \\ \end{align}\]

Here, \(\*C_{in} = \sigma_{in}^2\mathbf{I}\). Now consider the response through the non-linearity, \(f\), which can be approximated via a first order Taylor expansion.

\[\begin{align} \*r &= f(\bar{\*u} + \Delta \*u) + \*n_r \\ \*r &\approx f(\bar{\*u}) + \*G(\mathbf{x})\Delta\*u + \*n_r \end{align}\]

Where \(\*G(\mathbf{x})=\text{diag}(f'(\*W(\*x + \*n_x)))\). Recall that \(df_i/dy\) is a sigmoid function (see the above plot)! To characterize the properties of \(\*r\), consider it’s decomposition into its mean and variance, i.e. \(\*r = \bar{\*r} + \Delta\*r\).

\[\begin{align} \bar{\*r} &= f(\bar{\*u}) \\ \Delta\*r &= \*G(\mathbf{x})\Delta\*u + \*n_r \\ \*C_{\*r\vert \*x} = <\Delta \*r \Delta \*r^T> &= <(\*G(\mathbf{x})\Delta\*u + \*n_r)(\*G(\mathbf{x})\Delta\*u + \*n_r)^T> \\ \*C_{\*r\vert \*x} &= \*G(\mathbf{x})\*W\*C_{in}\*W^T\*G(\mathbf{x}) + \*C_{out} \\ \end{align}\]

Where \(\*C_{out} = \sigma_{out}^2\*I\). Now that we have computed the likelihood covariance \(\*C_{\*r\vert \*x}\), we can put it together with our prior over the images, which is also Gaussian. The covariance of the dataset, \(\mathbf{C_x}\), is computed from the dataset of images and we’ll assume that the dataset of images has zero mean. Since both the liklihood and prior are Gaussian, the posterior will also be Gaussian:

\[\begin{align} P(\*x|\*r) &\propto P(\*r|\*x)P(\*x) \\ &\propto \frac{1}{Z_1} exp \left[ -\frac{1}{2}(\*r-f(\*W\*x))^T\*C_{\*r|\*x}^{-1}(\*r-f(\*W\*x)) \right] \cdot \frac{1}{Z_2}exp\left[-\frac{1}{2}\*x^T\*C_{\*x}^{-1}\*x \right] \end{align}\]

Lets now consider fixing \(\*r\) to a value and consider the image in terms of its mean value and variation around that mean:

\[\*x = \bar{\*x} + \Delta\*x\]

Plugging this into our linearized non-linearity, we get

\[\begin{align} f(\*W\*x) &= f(\*W(\bar{\*x} + \Delta\*x)) \\ &\approx f(\*W\bar{\*x}) + \*G({\*x})\*W\Delta\*x \\ \end{align}\]

Taking the negative log-likelihood of \(P(\*x\mid\*r)\), dropping the constants, and plugging in the result above, we derive the covariance of the posterior (note that the expected value of \(\*x\) is zero):

\(\begin{align} (\*r-f(\*W\*x))^T&\*C_{\*r\mid\*x}^{-1}(\*r-f(\*W\*x)) + \*x^T\*C_{\*x}^{-1}\*x =\ \ ... \\ (\*r-f(\*W\bar{\*x}) + \*G({\*x})\*W\Delta\*x)^T&\*C_{\*r\mid\*x}^{-1}(\*r-f(\*W\bar{\*x}) + \*G({\*x})\*W\Delta\*x) + (\bar{\*x} + \Delta\*x)^T\*C_{\*x}^{-1}(\bar{\*x} + \Delta\*x) \\ (\*G({\*x})\*W\Delta\*x)^T&\*C_{\*r\mid\*x}^{-1}(\*G({\*x})\*W\Delta\*x) + \Delta\*x^T\*C_{\*x}^{-1}\Delta\*x = \ \ ...\\ \Delta\*x^T(\*G({\*x})\*W^T&\*C_{\*r\mid\*x}^{-1}\*G({\*x})\*W + \*C_{\*x}^{-1})\Delta\*x \\ \end{align}\) Where,

\[\*C_{\*x \mid \*r} = (\*G({\*x})\*W^T\*C_{\*r\mid\*x}^{-1}\*G({\*x})\*W + \*C_{\*x}^{-1})^{-1}\]

Appendix 2: Applying the matrix inversion lemma (Woodbury matrix identity)

Given the covariance of the posterior, $*C_{\mathbf{x\vert\mathbf{r}}}$,

Updated: