Curator: Creating Large-Scale Curated Labelled Datasets using Self-Supervised Learning

12/28/2022
by   Tarun Narayanan, et al.
14

Applying Machine learning to domains like Earth Sciences is impeded by the lack of labeled data, despite a large corpus of raw data available in such domains. For instance, training a wildfire classifier on satellite imagery requires curating a massive and diverse dataset, which is an expensive and time-consuming process that can span from weeks to months. Searching for relevant examples in over 40 petabytes of unlabelled data requires researchers to manually hunt for such images, much like finding a needle in a haystack. We present a no-code end-to-end pipeline, Curator, which dramatically minimizes the time taken to curate an exhaustive labeled dataset. Curator is able to search massive amounts of unlabelled data by combining self-supervision, scalable nearest neighbor search, and active learning to learn and differentiate image representations. The pipeline can also be readily applied to solve problems across different domains. Overall, the pipeline makes it practical for researchers to go from just one reference image to a comprehensive dataset in a diminutive span of time.

READ FULL TEXT

page 2

page 4

page 5

page 6

research
01/20/2022

CELESTIAL: Classification Enabled via Labelless Embeddings with Self-supervised Telescope Image Analysis Learning

A common class of problems in remote sensing is scene classification, a ...
research
10/13/2022

Evaluating the Label Efficiency of Contrastive Self-Supervised Learning for Multi-Resolution Satellite Imagery

The application of deep neural networks to remote sensing imagery is oft...
research
12/23/2020

Self-supervised self-supervision by combining deep learning and probabilistic logic

Labeling training examples at scale is a perennial challenge in machine ...
research
01/04/2023

MoBYv2AL: Self-supervised Active Learning for Image Classification

Active learning(AL) has recently gained popularity for deep learning(DL)...
research
07/21/2022

A Wavelet Transform and self-supervised learning-based framework for bearing fault diagnosis with limited labeled data

Traditional supervised bearing fault diagnosis methods rely on massive l...
research
08/10/2021

Scalable Reverse Image Search Engine for NASAWorldview

Researchers often spend weeks sifting through decades of unlabeled satel...
research
07/07/2021

Scalable Data Balancing for Unlabeled Satellite Imagery

Data imbalance is a ubiquitous problem in machine learning. In large sca...

Please sign up or login with your details

Forgot password? Click here to reset