Around the clock ‐ capsule networks and image transformations

Capsule networks are a promising new model class for image processing tasks. We investigate and compare their classification performance with convolutional neural networks on a new entry level image data set. In our experiments the capsules are able to learn the effect of rotated input images from data and outperform a comparable convolutional architecture. The results also show that convolutional and capsule networks need structural adjustment to respond to transformations that are not included in the training set.


Introduction
In the last few years, convolutional neural networks (CNN) have become the state-of-the-art for many image processing tasks. Recently, capsule networks [1] were introduced as a natural extension of CNNs. Capsules are supposed to have several advantages over classical networks. These include an improved adaption to transformations in the input data and better interpretability. In this paper we examine and compare capsule networks and CNNs empirically on a new entry level data set, called CloCk. The task consists in determining the displayed time from synthetic images of an analog clock. On these images we investigate the influence of rotations on the model performance and the need for transformed data samples during training.

Capsule networks and image transformations
One of the main building blocks of artificial neural networks are the so called neurons. They are inspired by their biological equivalents in the brain and consist of an affine transformation followed by a non-linear function.
where w ∈ R d is a weight vector, b ∈ R the bias and ϕ : R → R is a non-linear activation function.
Capsules can be defined as a generalization of neurons.
.., I} be a set of input vectors. A capsule C : R d×I → R m applies affine transformations followed by a non-linear routing process ν : Here The routing process ν combines the votes to generate the capsule vector and can be understood as a clustering.
A layer of a neural network generally consists of multiple neurons or capsules. Definition 2.2 allows for an easy adaption of popular layers, like fully connected and convolution layers, to the capsule setting [1].
The exact choices of the routing process and the parameter spaces are free. This is beneficial for the representative power of the network. Nevertheless, it does not give any mathematical guarantees how the network will react to transformations of the input. To get an equivariant network, i.e. that the output changes in a predictable manner with the input, one has to limit these choices based on the transformation class. There exist several derivations for equivariance restrictions for CNNs and capsules [2]. However, the transformation class must be known in advance and follow a specific (group) structure.

CloCk data set
The Classify on the Clock (CloCk) [3] data set is composed of 43 200 synthetic images of an analog clock. Each time step between 00:00:00 and 11:59:59 is represented as a single image. The data set contains versions with different levels of detail for each image. In this paper we use the "full" version, which includes markings and numbers (cf. Figure 1).
The movement of the hands is connected by a linear relationship. By introducing an additional rotation ω, the forward system for the determination of the angles α of the hands becomes non-linear.

Setup and results
We create four different test scenarios from the CloCk test set to investigate the effect of transformed inputs on the model performance (cf. Table 1). For each image and scenario the rotation angle ω is uniformly drawn from the respective range. The ranges are discretized in steps of 1 • to decrease the difficulty for the models. The task is to classify the displayed hours, minutes and seconds. We compare two similar convolutional and capsule architectures with around 3.5 million parameters each. The capsule architecture uses Dynamic Routing [1] as routing process. There are no restrictions on the weights of the models that would accommodate for the rotated inputs. Therefore, the networks have to learn the effect of the transformation from data. For each architecture we train three different models to study the need for transformed samples in the training set. CONV and CAPS are trained on unrotated images. Additional "12" and "360" indicate rotation ranges [−12 • , 12 • ], [0 • , 360 • ) of the training data. Model The results in Table 1 indicate, that both architectures benefit from examples of transformed data in the training set. They are unable to generalize the concept of rotations from smaller angle ranges in the training set to full rotations in the test case. This underlines the importance of equivariance restrictions if the transformation class is known (cf. Section 2).
The capsule architecture outperforms the CNN in all test scenarios. The model CAPS360 is able to learn the effect of the rotations and shows a constant performance in all test cases.

Conclusion
We compared the classification performance of two similar convolutional and capsule architectures on rotated images of a clock. Both architectures required examples of the transformation in the training set to work best in the respective setting. The capsule models outperformed the CNNs in all test cases. This indicates an improved ability to adapt to transformations. Together with the concept of equivariance, capsule networks are a promising model class to face transformations that commonly arise in image processing tasks.