In this paper, we present VideoGen, a text-to-video generation approach,...
In this paper, we study Text-to-3D content generation leveraging 2D diff...
We analyze the DETR-based framework on semi-supervised object detection
...
One of the mainstream schemes for 2D human pose estimation (HPE) is lear...
Structured text extraction is one of the most valuable and challenging
a...
The issue of detecting deepfakes has garnered significant attention in t...
Multi-modal 3D object detection has received growing attention as the
in...
Despite recent advances in syncing lip movements with any audio waves,
c...
Multi-object tracking (MOT) aims at estimating bounding boxes and identi...
With basic Semi-Supervised Object Detection (SSOD) techniques, one-stage...
In this paper, we address the problem of detecting 3D objects from multi...
Existing methods of multi-person video 3D human Pose and Shape Estimatio...
It is widely agreed that reference-based super-resolution (RefSR) achiev...
Neural Radiance Fields (NeRF) have constituted a remarkable breakthrough...
In this paper, we present StrucTexTv2, an effective document image
pre-t...
Creating the photo-realistic version of people sketched portraits is use...
In the field of skeleton-based action recognition, current top-performin...
In this paper, we propose a cross-modal distillation method named
Stereo...
Previous studies have explored generating accurately lip-synced talking ...
Current domain adaptation methods for face anti-spoofing leverage labele...
Masked image modeling (MIM) learns visual representation by masking and
...
DETR is a novel end-to-end transformer architecture object detector, whi...
We present a strong object detector with encoder-decoder pretraining and...
High resolution and advanced semantic representation are both vital for ...
Recently, transformer-based networks have shown impressive results in
se...
Current lane detection methods are struggling with the invisibility lane...
Vision Transformer and its variants have demonstrated great potential in...
The human brain can effortlessly recognize and localize objects, whereas...
In this paper, we study the problem of one-shot skeleton-based action
re...
Video-text retrieval (VTR) is an attractive yet challenging task for
mul...
Despite encouraging progress in deepfake detection, generalization to un...
This paper proposes a novel Unified Feature Optimization (UFO) paradigm ...
We propose a novel image retouching method by modeling the retouching pr...
3D object detection task from lidar or camera sensors is essential for
a...
Recent advances in face forgery techniques produce nearly visually
untra...
Recently, Neural Radiance Fields (NeRF) is revolutionizing the task of n...
Freezing the pre-trained backbone has become a standard paradigm to avoi...
In this paper, we present a model pretraining technique, named MaskOCR, ...
Human-Object Interaction Detection tackles the problem of joint localiza...
Birds-eye-view (BEV) semantic segmentation is critical for autonomous dr...
Most existing unsupervised person re-identification (Re-ID) methods use
...
While local-window self-attention performs notably in vision tasks, it
s...
Visual appearance is considered to be the most important cue to understa...
Concurrent perception datasets for autonomous driving are mainly limited...
A common challenge posed to robust semantic segmentation is the expensiv...
Advanced face swapping methods have achieved appealing results. However,...
Low-cost monocular 3D object detection plays a fundamental role in auton...
Monocular 3D object detection is a critical yet challenging task for
aut...
To achieve disentangled image manipulation, previous works depend heavil...
Despite superior performance on many computer vision tasks, deep convolu...