RAM by RobRomijnders

Recurrent model of visual attention

This post implement the Recurrent model of visual attention paper by Mnih, 2014. Like the authors, we refer to this algorithm as Recurrent Attention model (RAM).

RAM

The RAM continues the work in attention models for images. Conventional approaches for image classification scale poorly with image size. Humans do not absorb images in a one-shot fashion. We scan the image and attend to part of interest. The RAM models this attention seeking.

Our previous posts treat attention models too. The DRAW attends to images via Gaussian filters. Another post shows three implementations of attention using feature keys. In the RAM paper, the authors observe that humans make decisions on where to look. Next, the eye focuses on a thumb-sizes patch. All other information gets blurred.

These processing forms the basis for RAM. The fovea-like extracts center around a point that conditions on the state of a network. The hidden state of an LSTM maps to a coordinate vector. The corresponding fovea-like extracts inputs to the LSTM at the next time-step.

Our model of the fovea is non-differentiable. The stochasticity in the coordinate-vector allows us to use the REINFORCE rule. For this Reinforcement agent, the reward follows from a correct prediction after a fixed number of time-steps.

Results

The implementation builds upon this work. Likewise, our visualization resembles his plots. RAM_gif The red squares highlight where the network centers its glimpses. I trained this network on an old laptop for a couple of hours. With more computing power, you might decrease the glimpse-size, allow for more time-steps. Those changes will make the RAM more resemble the human vision system. The corresponding classification achieves 87% accuracy on MNIST

This results makes us wonder if the attention follows the digit or follow from random perturbations. Therefore, we translate the digits in a larger 60x60 image. show_translate

Discussion

Initially, I learned from this implementation. This list summarizes my main changes:

I changed the reshaping of the sample locations. Sean's implementation incorrectly reshapes the Tensor sampled_locs. The network can still learn. During visualization, however, you'll run into non-trivial transformations.
All linear transformations benefit from an intercept too. I find that biases improve the performance when using linear transformations
Two convolutional layers replace the fully connected layers in the glimpse network. The authors of RAM propose this idea too in their follow-up paper
I add learning-rate decay. Also the variance of the noise perturbing the sample locations now decays with global_step
Tensorflow has a quick implementation for Softmax mapping to mutually-exclusive labels. I made this work for this code too.
I re-route the gradient in a crucial way. In Sean's implementation, the reward of the REINFORCE does not backpropagate to the location network. Contratily, the glimpse network does backpropagate into the location network. I add a tf.stop_gradient() such that the glimpse network cannot do this anymore. The reason that REINFORCE does not backpropagate into the location network is that the signals block eachother due to the symmetry in the gaussian_pdf computation. I add another tf.stop_gradient() in this function to prevent this.

As always, I am curious to any comments and questions. Reach me at romijndersrob@gmail.com