MultiMAE: Multi-modal Multi-task Masked Autoencoders

Swiss Federal Institute of Technology Lausanne (EPFL)
* Equal contribution
ECCV 2022

We introduce Multi-modal Multi-task Masked Autoencoders (MultiMAE), an efficient and effective pre-training strategy for Vision Transformers. Given a small random sample of visible patches from multiple modalities, the MultiMAE pre-training objective is to reconstruct the masked-out regions.

Abstract

We propose a pre-training strategy called Multi-modal Multi-task Masked Autoencoders (MultiMAE). It differs from standard Masked Autoencoding in two key aspects:

  1. It can optionally accept additional modalities of information in the input beside RGB images (hence "multi-modal").
  2. Its training objective accordingly includes predicting multiple outputs besides reconstructing the RGB image (hence "multi-task").

We make use of masking (across image patches and input modalities) to make training MultiMAE tractable as well as to ensure cross-modality predictive coding is indeed learned by the network. We show this pre-training strategy leads to a flexible, simple, and efficient framework with improved transfer results to downstream tasks. In particular, the same exact pre-trained network can be flexibly used when additional information besides RGB images is available or when no information other than RGB is available – in all configurations yielding competitive to or significantly better results than the baselines. To avoid needing training datasets with multiple modalities and annotated tasks, we train MultiMAE entirely using pseudo labeling, which makes framework widely applicable to any RGB dataset.

The experiments are performed on multiple transfer tasks (image classification, semantic segmentation, depth estimation) and datasets (ImageNet, ADE20K, Taskonomy, Hypersim, NYUv2). The results show an intriguingly impressive capability by the model in cross-modal/task predictive coding and transfer. Please see our Github repository for code and pre-trained models.

MultiMAE pre-training

Standard Masked Autoencoding is an effective pre-training strategy, but so far has been limited to a single modality, namely RGB images. Practically speaking, often more than just a single modality is available (either from sensors or pseudo labeled), and pre-training strategies should be able to optionally make use of them. In addition, pre-training on more than just a single task is a powerful tool to steer what representations a model will learn and has been shown to improve performance on transfer tasks.

The MultiMAE pre-training objective is to reconstruct masked-out patches of multiple modalities. To do that, a small set of visible patches is sampled from all modalities and processed jointly using a Transformer encoder. Task-specific decoders reconstruct the masked-out patches by first performing a single cross-attention step from queries to all encoded tokens, followed by a shallow Transformer. The queries consist of mask tokens (in gray), with the task-specific encoded tokens added at their respective positions. By keeping the number of encoded patches constant, the bulk of the computational cost does not scale with the number of modalities, making MultiMAE a highly efficient pre-training method.

MultiMAE pre-training architecture.

MultiMAE pre-training consists of reconstructing the masked-out patches of different modalities. Only the small set of visible patches are processed by a shared Transformer encoder, after which they are decoded with task-specific heads.

We pre-train MultiMAE on two additional tasks besides RGB images, namely scene depth and semantic segmentation. Using powerful pre-trained networks, we pseudo label the entire ImageNet-1K dataset with these tasks. This has the advantage that no aligned multi-task dataset is required – all one needs is a large unlabeled RGB dataset and off-the-shelf pseudo labeling networks.

Masked multi-modal reconstructions

During pre-training, MultiMAE is always given the same number of visible input tokens, but they are sampled randomly across the different modalities using a symmetric Dirichlet distribution. Explore in this interactive demo, how different random mask samples affect the MultiMAE predictions. Use the and buttons to cycle through different images and the button to sample new random masks. Note that since we only compute the loss on the masked-out regions, we overlay the input patches on the predictions.

(Demos are best viewed in the Chrome browser on desktop.)

Masked inputs

MultiMAE predictions

Original reference

RGB

Depth

Semantic



Masking percentage

How does the percentage of masked patches affect MultiMAE predictions? Use the slider to choose how many patches are given as input to the MultiMAE model.

Masked inputs

MultiMAE predictions

Original reference

RGB

Original RGB reference.

Depth

Original depth reference.

Semantic

Original semantic reference.

Hint: Drag the slider to change the number of masked tokens.


MultiMAE downstream transfer

In the RGB-only transfer case (left part of figure), the encoder of a pre-trained MultiMAE can directly be used as a ViT with a newly initialized decoder head for the given downstream task. Since MultiMAE was pre-trained on more than just RGB, it can optionally accept additional modalities at transfer time (right part of figure).

MultiMAE fine-tuning modes.

MultiMAE can either be used directly as a ViT (left) for single-modal downstream transfer, or can optionally be fine-tuned using multiple available modalities (right).

RGB-only fine-tuning

MultiMAE is an effective pre-training strategy even when RGB is the only available downstream modality. It retains the benefits of MAE, an already powerful pre-training strategy, and can sometimes even outperform it. Notice how MAE and MultiMAE can both significantly surpass supervised ImageNet-1K pre-training.

Method Arch. Classification
(Top 1 acc. ↑)
Semantic Segmentation
(mIoU ↑)
Depth
(δ1 ↑)
ImageNet-1K ADE20K
Hypersim NYUv2 NYUv2
Supervised (DeiT) ViT-B 81.8 45.8 33.9 50.1 80.7
MAE ViT-B 83.3 46.2 36.5 50.8 85.1
MultiMAE ViT-B 83.3 46.2 37.0 52.0 86.4

Using additional modalities

Making use of additionally available modalities during fine-tuning has the potential to significantly increase performance. On NYUv2 and Hypersim semantic segmentation, using either RGB-only or depth-only inputs works well, but when combining the two, we see large increases in performance – especially so when the ground-truth depth is of high quality, as in Hypersim.

NYUv2 semantic segmentation (mIoU ↑)

Hypersim semantic segmentation (mIoU ↑)

In case ground truth depth is not available at transfer time, MultiMAE can also make use of pseudo labeled depth (and even pseudo labeled semantic segmentation 😉 ). Here, pD = pseudo depth, and pS = pseudo semantic segmentation. While the improvement is not as large as with ground-truth depth, adding pseudo labeled depth and semantic segmentation still provides a slight performance increase.

ADE20K semantic segmentation (mIoU ↑)

NYUv2 semantic segmentation (mIoU ↑)

Hypersim semantic segmentation (mIoU ↑)

Cross-modal predictive coding

MultiMAE demonstrates intriguing capabilities for cross-modal predictive coding since not only was it trained to predict each modality from any other modality, but also to combine information from multiple modalities. In the following interactive demonstrations, you can explore some of these capabilities.

Mask transitions

How do MultiMAE predictions change if you give it more RGB patches than depth patches, or even only semantic information? In this interactive demo, you can use the slider to change the proportion of visible tokens per modality. Notice how semantically stable the predictions stay as tokens are completely removed from certain modalities and added to others, suggesting that MultiMAE has learned a shared representation for these modalities.

Masked inputs

MultiMAE predictions

Original reference

RGB

Original RGB reference.

Depth

Original depth reference.

Semantic

Original semantic reference.

Hint: Drag the slider to change the ratio of visible tokens per modality.




Hue change

In this demo, we give as input the full depth map and two RGB patches, and we show the full reconstructed RGB image. No semantic maps are given as input. Using the slider, you can change the hue of a single RGB patch, while the other is locked. Notice how only the color of the selected objects change, while the rest stays constant. MultiMAE effectively propagates color information from two RGB patches to only the relevant objects in the image, which it extracts from the depth map. In the images showing fruit and vegetables, color information is not only propagated to the same object, but to all objects that MultiMAE implicitly assigns to the same semantic class – even over larger distances.

Full depth input

Masked RGB input

RGB reconstruction

Original RGB reference

Hint: Drag the slider to change the hue of only a single RGB patch. Use the buttons to explore different images.



BibTeX

@article{bachmann2022multimae,
  author    = {Roman Bachmann and David Mizrahi and Andrei Atanov and Amir Zamir},
  title     = {{MultiMAE}: Multi-modal Multi-task Masked Autoencoders},
  journal   = {arXiv preprint arXiv:2204.01678},
  year      = {2022},
}