MIMIC: Masked Image Modeling with Image Correspondences

06/27/2023
by   Kalyani Marathe, et al.
0

Many pixelwise dense prediction tasks-depth estimation and semantic segmentation in computer vision today rely on pretrained image representations. Therefore, curating effective pretraining datasets is vital. Unfortunately, the effective pretraining datasets are those with multi-view scenes and have only been curated using annotated 3D meshes, point clouds, and camera parameters from simulated environments. We propose a dataset-curation mechanism that does not require any annotations. We mine two datasets: MIMIC-1M with 1.3M and MIMIC-3M with 3.1M multi-view image pairs from open-sourced video datasets and from synthetic 3D environments. We train multiple self-supervised models with different masked image modeling objectives to showcase the following findings: Representations trained on MIMIC-3M outperform those mined using annotations on multiple downstream tasks, including depth estimation, semantic segmentation, surface normals, and pose estimation. They also outperform representations that are frozen and when downstream training data is limited to few-shot. Larger dataset (MIMIC-3M) significantly improves performance, which is promising since our curation method can arbitrarily scale to produce even larger datasets. MIMIC code, dataset, and pretrained models are open-sourced at https://github.com/RAIVNLab/MIMIC.

READ FULL TEXT

page 14

page 15

page 16

page 18

page 19

page 20

page 21

page 22

research
08/11/2023

DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models

Current deep networks are very data-hungry and benefit from training on ...
research
11/03/2022

Could Giant Pretrained Image Models Extract Universal Representations?

Frozen pretrained models have become a viable alternative to the pretrai...
research
06/29/2023

MIS-FM: 3D Medical Image Segmentation using Foundation Models Pretrained on a Large-Scale Unannotated Dataset

Pretraining with large-scale 3D volumes has a potential for improving th...
research
06/15/2023

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

Due to the limited scale and quality of video-text training corpus, most...
research
02/04/2022

StandardSim: A Synthetic Dataset For Retail Environments

Autonomous checkout systems rely on visual and sensory inputs to carry o...
research
10/19/2022

A Unified View of Masked Image Modeling

Masked image modeling has demonstrated great potential to eliminate the ...
research
06/10/2021

3D Semantic Mapping from Arthroscopy using Out-of-distribution Pose and Depth and In-distribution Segmentation Training

Minimally invasive surgery (MIS) has many documented advantages, but the...

Please sign up or login with your details

Forgot password? Click here to reset