Large-scale noisy web image-text datasets have been proven to be efficie...
Self-supervised learning on large-scale multi-modal datasets allows lear...
Recent models such as XLS-R and Whisper have made multilingual speech
te...
We present ISAAC (Input-baSed ApproximAte Curvature), a novel method tha...
Though research has shown the complementarity of camera- and inertial-ba...
Spatio-temporal grounding describes the task of localizing events in spa...
Most approaches for self-supervised learning (SSL) are optimised on cura...
Large scale Vision-Language (VL) models have shown tremendous success in...
Temporal action segmentation in untrimmed videos has gained increased
at...
Contrastive learning has become a prominent ingredient in learning
repre...
Although action recognition systems can achieve top performance when
eva...
Recently, research has increasingly focused on developing efficient neur...
Multilingual text-video retrieval methods have improved significantly in...
Vision-language models trained on large, randomly collected data had
sig...
Recently, a number of new Semi-Supervised Learning methods have emerged....
Transformers for visual-language representation learning have been getti...
The top-k classification accuracy is one of the core metrics in machine
...
Although action recognition has achieved impressive results over recent
...
Differentiable sorting algorithms allow training with sorting and rankin...
Multi-modal learning from video data has seen increased attention recent...
The ability to generalize learned representations across significantly
d...
The task of multimodal learning has seen a growing interest recently as ...
In this paper, we explore self-supervised audio-visual models that learn...
Reconstructing the 3D geometry of an object from an image is a major
cha...
The integration of algorithmic components into neural architectures has
...
Both generalized and incremental few-shot learning have to deal with thr...
The problem of grounding VQA tasks has seen an increased attention in th...
Sorting and ranking supervision is a method for training neural networks...
Action recognition and detection in the context of long untrimmed video
...
Multimodal self-supervised learning is getting more and more attention a...
Nowadays, there is an abundance of data involving images and surrounding...
Understanding the structure of complex activities in videos is one of th...
Current state-of-the-art models for video action recognition are mostly ...
Action recognition has become a rapidly developing research field within...
Action recognition is so far mainly focusing on the problem of classific...
The task of temporally detecting and segmenting actions in untrimmed vid...
Video learning is an important task in computer vision and has experienc...
Action recognition is a fundamental problem in computer vision with a lo...
Action detection and temporal segmentation of actions in videos are topi...
We present an approach for weakly supervised learning of human actions. ...
We present an approach for weakly supervised learning of human actions f...
We describe an end-to-end generative approach for the segmentation and
r...