AIE

How do capsule neural networks work?

asked 3 months ago

21.7K

Geoffrey Hinton has been researching something he calls "capsules theory" in neural networks. What is it? How do capsule neural networks work?

neural-networks

capsule-neural-network

6 Answers

AIE

• answered 3 months ago

To supplement the previous answer: there is a paper on this that is mostly about learning low-level capsules from raw data, but explains Hinton's conception of a capsule in its introductory section: http://www.cs.toronto.edu/~fritz/absps/transauto6.pdf

It's also worth noting that the link to the MIT talk in the answer above seems to be working again.

According to Hinton, a "capsule" is a subset of neurons within a layer that outputs both an "instantiation parameter" indicating whether an entity is present within a limited domain and a vector of "pose parameters" specifying the pose of the entity relative to a canonical version.

The parameters output by low-level capsules are converted into predictions for the pose of the entities represented by higher-level capsules, which are activated if the predictions agree and output their own parameters (the higher-level pose parameters being averages of the predictions received).

Hinton speculates that this high-dimensional coincidence detection is what mini-column organization in the brain is for. His main goal seems to be replacing the max pooling used in convolutional networks, in which deeper layers lose information about pose.

AIE

• answered 3 months ago

It appears to not be published yet; the best available online are these slides for this talk. (Several people reference an earlier talk with this link, but sadly it's broken at time of writing this answer.)

My impression is that it's an attempt to formalize and abstract the creation of subnetworks inside a neural network. That is, if you look at a standard neural network, layers are fully connected (that is, every neuron in layer 1 has access to every neuron in layer 0, and is itself accessed by every neuron in layer 2). But this isn't obviously useful; one might instead have, say, n parallel stacks of layers (the 'capsules') that each specializes on some separate task (which may itself require more than one layer to complete successfully).

If I'm imagining its results correctly, this more sophisticated graph topology seems like something that could easily increase both the effectiveness and the interpretability of the resulting network.

AIE

• answered 3 months ago

Capsule Networks have two key ideas:

the first idea is how to represent multi-dimensional entities. Capsule Networks does this by grouping these properties of a feature together ("capsules").
the second is that you activate higher-level features by agreement between lower-level features ("routing by agreement").

First, Capsule Networks partition the image into regions subsets.

For each of these regions, it assumes that there is at most one instance of a single feature, called a Capsule.

A Capsule is able to represent an instance of a feature (but only one) and is able to represent all the different properties of that feature, e.g., its (x,y) coordinates, its colour, its movement etc.

The difference from Convolutional Neural Networks (CNNs) is that the Capsules bundle the neurons into groups with multi-dimensional properties, whereas in CNNs the neurons represent single, unrelated scalar properties.

This structured Capsule representation allows you to do "routing by agreement".

To understand this, lets look at the example of a face detector. Here, you could have capsules representing "mouth", "eye", "nose" etc. Since the Capsules are multi-dimensional you also train them to predict the parameters for the entire face.

Now, if the "mouth", "nose" and "eye" Capsules agree about the parameters of the face we have a very strong signal that this is a good prediction since accidental agreement in a high-dimensional space like a neural network is very unlikely.

You use this to stack the Capsules into deep networks where the activation of higher-level Capsules are conditioned on agreement between the lower-level Capsules (e.g. the Face capsule being activated by agreement on the face position between the Nose, Mouth, Eye Capsules in the earlier, lower-level layer).

In contrast to regular feed-forward nets this requires a bit of iteration in the forward pass through the network, but you can still use back-propagation train it.

It is an interesting way to add a bit of structure to the data. So far, it looks like they are able to provide better generalization from limited training data.

AIE

• answered 3 months ago

Capsule networks try to mimic Hinton's observations of the human brain on the machine. The motivation stems from the fact that neural networks needed better modeling of the spatial relationships of the parts. Instead of modeling the co-existence, disregarding the relative positioning, capsule-nets try to model the global relative transformations of different sub-parts along a hierarchy. This is the eqivariance vs. invariance trade-off, as explained above by others.

These networks therefore include somewhat a viewpoint / orientation awareness and respond differently to different orientations. This property makes them more discriminative, while potentially introducing the capability to perform pose estimation as the latent-space features contain interpretable, pose specific details.

All this is accomplished by including a nested layer called capsules within the layer, instead of concatenating yet another layer in the network. These capsules can provide vector output instead of a scalar one per node.

The crucial contribution of the paper is the dynamic routing which replaces the standard max-pooling by a smart strategy. This algorithm applies a mean-shift clustering on the capsule outputs to ensure that the output gets sent only to the appropriate parent in the layer above.

Authors also couple the contributions with a margin loss and reconstruction loss, which simultaneously help in learning the task better and show state of the art results on MNIST.

The recent-paper is named Dynamic Routing Between Capsules and is available on Arxiv: https://arxiv.org/pdf/1710.09829.pdf .

AIE

• answered 3 months ago

In the abstract of the paper Dynamic Routing between Capsules (November 7, 2017) that formally introduced capsule neural networks, the authors write

A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or an object part. We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation parameters. Active capsules at one level make predictions, via transformation matrices, for the instantiation parameters of higher-level capsules. When multiple predictions agree, a higher level capsule becomes active. We show that a discriminatively trained, multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional net at recognizing highly overlapping digits. To achieve these results we use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher-level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.

AIE

• answered 3 months ago

One of the major advantages of convolutional neural networks is their invariance to translation. However, this invariance comes with a price, that is, it does not consider how different features are related to each other. For example, if we have a picture of a face, a CNN will have difficulties distinguishing the relationship between the "mouth" feature and "nose" feature. The max-pooling layers are the main reason for this effect, because, when we use max-pooling layers, we lose the precise locations of the mouth and nose, and we cannot say how they are related to each other.

Capsules try to keep the advantage of CNN and fix this drawback in two ways

Invariance: quoting from this paper

When the capsule is working properly, the probability of the visual entity being present is locally invariant – it does not change as the entity moves over the manifold of possible appearances within the limited domain covered by the capsule.

In other words, capsule takes into account the existence of the specific feature that we are looking for, like the mouth or nose. This property makes sure that capsules are translation invariant the same that CNNs are.

Equivariance: instead of making the feature translation invariance, the capsule will make it translation-equivariant or viewpoint-equivariant. In other words, as the feature moves and changes its position in the image, the feature vector representation will also change in the same way which makes it equivariant. This property of capsules tries to solve the drawback of max-pooling layers that I mentioned at the beginning.

How do capsule neural networks work?

6 Answers

Write your answer here