Vision Transformers Have Taken The Field of Computer Vision by Storm, But What Do Vision Transformers Learn?

Vision transformers (ViTs) are a type of neural network architecture that has reached tremendous popularity for vision tasks such as image classification, semantic segmentation, and object detection. The main difference between the vision and original transformers was the replacement of the discrete tokens of text with continuous pixel values extracted from image patches. ViTs extracts features from the image by attending to different regions of it and combining them to make a prediction. However, despite the recent widespread use, little is known about the inductive biases or features that ViTs tend to learn. While feature visualizations and image reconstructions have been successful in understanding the workings of convolutional neural networks (CNNs), these methods have not been as successful in understanding ViTs, which are difficult to visualize.

The latest work from a group of researchers from the University of Maryland-College Park and New York University enlarges the ViTs literature with an in-depth study concerning their behavior and their inner-processing mechanisms. The authors established a visualization framework to synthesize images that maximally activate neurons in the ViT model. In particular, the method involved taking gradient steps to maximize feature activations by starting from random noise and applying various regularization techniques, such as penalizing total variation and using augmentation ensembling, to improve the quality of the generated images.

The analysis found that patch tokens in ViTs preserve spatial information throughout all layers except the last attention block, which learns a token-mixing operation similar to the average pooling operation widely used in CNNs. The authors observed that the representations remain local, even for individual channels in deep layers of the network.

To this end, the CLS token seems to play a relatively minor role throughout the network and is not used for globalization until the last layer. The authors demonstrated this hypothesis by performing inference on images without using the CLS token in layers 1-11 and then inserting a value for the CLS token at layer 12. The resulting ViT could still successfully classify 78.61% of the ImageNet validation set instead of the original 84.20%.

Hence, both CNNs and ViTs exhibit a progressive specialization of features, where early layers recognize basic image features such as color and edges, while deeper layers recognize more complex structures. However, an important difference found by the authors concerns the reliance of ViTs and CNNs on background and foreground image features. The study observed that ViTs are significantly better than CNNs at using the background information in an image to identify the correct class and suffer less from the removal of the background. Additionally, ViT predictions are more resilient to the removal of high-frequency texture information compared to ResNet models (results visible in Table 2 of the paper).

Finally, the study also briefly analyzes the representations learned by ViT models trained in the Contrastive Language Image Pretraining (CLIP) framework which connects images and text. Interestingly, they found that CLIP-trained ViTs produce features in deeper layers activated by objects in clearly discernible conceptual categories, unlike ViTs trained as classifiers. This is reasonable yet surprising because text available on the internet provides targets for abstract and semantic concepts like "morbidity" (examples are visible in Figure 11).

Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don't forget to join our 13k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Lorenzo Brigato is a Postdoctoral Researcher at the ARTORG center, a research institution affiliated with the University of Bern, and is currently involved in the application of AI to health and nutrition. He holds a Ph.D. degree in Computer Science from the Sapienza University of Rome, Italy. His Ph.D. thesis focused on image classification problems with sample- and label-deficient data distributions.

Paper Github. our 13k+ ML SubReddit Discord Channel Email Newsletter