There is strong evidence that humans rely on coordinate frames or reference lines and curves to determine the position of points in space. Unlike widespread computer vision algorithms, where objects are distinguished by numerical representations of their properties.
Researchers from Google, Alphabet's daughter DeepMind and Oxford University propose what they call a stacked capsule Autoencoder (SCAE): reasons for objects based on the geometric relationships between their parts. Since these relationships do not depend on the position from which the model views the objects, the model classifies objects with high accuracy, even if the viewing angle changes.
The SCAE and other capsule systems make sense of objects by geometrically interpreting organized sets of their parts. Sets of mathematical functions (capsules) that are responsible for analyzing various object properties (such as position, size, and hue) are transferred to a type of AI model that is commonly used to analyze visual images, and several of the predictions of the capsules are reused to make presentations of parts. Because these representations are preserved throughout the SCAE analysis, capsule systems can use them to identify objects, even if the positions of parts are swapped or transformed.
Another unique feature of capsule systems? They drive with attention. As with all deep neural networks, the functions of the capsules are arranged in interconnected layers that transmit "signals" from the input data and slowly adjust the synaptic strength (weights) of each connection. (In this way, they extract features and learn to make predictions.) For capsules, however, the weights are calculated dynamically based on the ability of the previous layer's functions to predict the next-layer output.
The SCAE consists of several stages. In the first step, the pixels of the images to be analyzed are removed from the Constellation Capsule Auto-Encoder (CCAE). The second stage – the Part Capsule Autoencoder (PCAE) – segments an image into components and inserts their poses before the image is reconstructed. Finally, the Object Capsule Autoencoder (OCAE) attempts to organize discovered pieces and their poses into a smaller set of objects that they then attempt to reconstruct. It registers industry-leading unattended image classification results in two open-source datasets, the SVHN (the Images with small, cropped digits) and the MINST (handwritten digits). After assigning labels to the SCAE images of each individual image and the resulting clusters, it achieved an accuracy of 55% for SVHN and 98.7% for MNIST, which were further improved to 67% and 99%, respectively.