Referring Video Object Segmentation (RVOS) requires segmenting the objec...
Unsupervised Video Object Segmentation (VOS) aims at identifying the con...
The sparsity of Deep Neural Networks is well investigated to maximize th...
Spiking Neural Networks (SNNs) provide an energy-efficient deep learning...
Large Language Models (LLMs) have shown the potential to revolutionize
n...
The integration of self-attention mechanisms into Spiking Neural Network...
Editing real facial images is a crucial task in computer vision with
sig...
Biologically inspired spiking neural networks (SNNs) have garnered
consi...
This work studies how to transform an album to vivid and coherent storie...
Text-video retrieval is a challenging cross-modal task, which aims to al...
Large language models (LLMs) based on the generative pre-training transf...
Few-shot class-incremental learning (FSCIL) aims at learning to classify...
Contrastive learning-based video-language representation learning approa...
Interactive segmentation enables users to segment as needed by providing...
Existing text-video retrieval solutions are, in essence, discriminant mo...
Unified visual grounding pursues a simple and generic technical route to...
Robot grasping is subject to an inherent tradeoff: Grippers with a large...
Unsupervised domain adaption has been widely adopted in tasks with scarc...
Multimodal named entity recognition (MNER) and multimodal relation extra...
Weakly supervised semantic segmentation is typically inspired by class
a...
Recently, the ability of self-supervised Vision Transformer (ViT) to
rep...
We consider two biologically plausible structures, the Spiking Neural Ne...
While the Vision Transformer (VT) architecture is becoming trendy in com...
The Savage-Hutter (SH) equations are a hyperbolic system of nonlinear pa...
The transformer models have shown promising effectiveness in dealing wit...
Recently, MLP-like vision models have achieved promising performances on...
Recent Transformer-based methods have achieved advanced performance in p...
Recently, DETR pioneered the solution of vision tasks with transformers,...
Visual recognition has been dominated by convolutional neural networks (...
In this paper, we present Vision Permutator, a conceptually simple and d...
Continual learning tackles the setting of learning different tasks
seque...
This paper provides a strong baseline for vision transformers on the Ima...
It is well-known that stochastic gradient noise (SGN) acts as implicit
r...
Transformers, which are popular for language modeling, have been explore...
Deep artificial neural networks have been proposed as a model of primate...
Since the wide employment of deep learning frameworks in video salient o...
Video-based human pose estimation in crowded scenes is a challenging pro...
Detecting and recognizing human action in videos with crowded scenes is ...
This paper presents our solution to ACM MM challenge: Large-scale
Human-...
Video summarization is an effective way to facilitate video searching an...
In recent years, the growing ubiquity of Internet memes on social media
...
Knowledge Distillation (KD) aims to distill the knowledge of a cumbersom...
Hashing has been widely used for efficient large-scale multimedia data
r...
State-of-the-art CNN based recognition models are often computationally
...
In this paper, we present a novel unsupervised video summarization model...
To mitigate the detection performance drop caused by domain shift, we ai...
This paper presents a summary of the 2019 Unconstrained Ear Recognition
...
This paper presents a summary of the 2019 Unconstrained Ear Recognition
...
Detecting the relations among objects, such as "cat on sofa" and "person...