Despite the rapid advancement of unsupervised learning in visual
represe...
Open-world instance-level scene understanding aims to locate and recogni...
Precise and controllable image editing is a challenging task that has
at...
Text-guided diffusion models (TDMs) are widely applied but can fail
unex...
Previous works have shown that increasing the window size for
Transforme...
Despite significant efforts, cutting-edge video segmentation methods sti...
Recent advances in generative adversarial networks (GANs) have demonstra...
Open-vocabulary scene understanding aims to localize and recognize unsee...
Modern deep networks can be better generalized when trained with noisy
s...
This technical report describes our 2nd-place solution for the ECCV 2022...
This paper presents Holistically-Attracted Wireframe Parsing (HAWP) for ...
Federated learning aims to train models collaboratively across different...
Scene text recognition has attracted increasing interest in recent years...
This report presents our winner solution to ECCV 2022 challenge on
Out-o...
Occlusion poses a great threat to monocular multi-person 3D human pose
e...
Most existing scene text detectors focus on detecting characters or word...
In recent years, video instance segmentation (VIS) has been largely adva...
There are two mainstreams for object detection: top-down and bottom-up. ...
Temporal action detection (TAD) is an important yet challenging task in ...
State-of-the-art document dewarping techniques learn to predict 3-dimens...
Large-scale pre-training has been proven to be crucial for various compu...
Recently, Vision-Language Pre-training (VLP) techniques have greatly
ben...
In this work, we present SeqFormer, a frustratingly simple model for vid...
Class Incremental Learning (CIL) aims at learning a multi-class classifi...
A typical pipeline for multi-object tracking (MOT) is to use a detector ...
Mixup-based augmentation has been found to be effective for generalizing...
Artificial intelligence (AI) provides a promising substitution for
strea...
Although deep learning methods have achieved advanced video object
recog...
Video instance segmentation aims to detect, segment, and track objects i...
Learning 3D representations by fusing point cloud and multi-view data ha...
This paper presents a neural network built upon Transformers, namely Pla...
Human vision is able to capture the part-whole hierarchical information ...
Temporal action detection (TAD) aims to determine the semantic label and...
Leveraging the advances of natural language processing, most recent scen...
Object detection, instance segmentation, and pose estimation are popular...
Person search aims to simultaneously localize and identify a query perso...
In this work we present SwiftNet for real-time semi-supervised video obj...
Can our video understanding systems perceive objects when a heavy occlus...
Current developments in temporal event or action localization usually ta...
Semantic segmentation (SS) is an important perception manner for self-dr...
In this paper, we focus on the semantic image synthesis task that aims a...
Semantic segmentation is important for many real-world systems, e.g.,
au...
We present a novel Bipartite Graph Reasoning GAN (BiGraphGAN) for the
ch...
The goal of object detection is to determine the class and location of
o...
While scene text recognition techniques have been widely used in commerc...
We propose a novel Generative Adversarial Network (XingGAN or CrossingGA...
Non-Local (NL) blocks have been widely studied in various vision tasks.
...
This paper presents a fast and parsimonious parsing method to accurately...
Crowd counting in images is a widely explored but challenging task. Thou...
This paper presents regional attraction of line segment maps, and hereby...