Current ASR systems are mainly trained and evaluated at the utterance le...
Advancements in the generation quality of various Generative Models (GMs...
Recent text-to-image diffusion models are able to generate convincing re...
Videos are created to express emotion, exchange information, and share
e...
Image manipulation detection algorithms are often trained to discriminat...
State-of-the-art (SOTA) Generative Models (GMs) can synthesize
photo-rea...
Recent advances in OCR have shown that an end-to-end (E2E) training pipe...
Entity synonyms discovery is crucial for entity-leveraging applications....
We propose real-time, six degrees of freedom (6DoF), 3D face pose estima...
In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and...
It is highly desirable yet challenging to generate image captions that c...
In this paper, we propose an algorithm, named hashing-based non-maximum
...
Large-scale pre-training methods of learning cross-modal representations...
This paper studies face recognition (FR) and normalization in surveillan...
Gait, the walking pattern of individuals, is one of the most important
b...
Real-world face recognition datasets exhibit long-tail characteristics, ...
Pedestrian detection is a critical problem in computer vision with
signi...
The large pose discrepancy between two face images is one of the fundame...
This paper proposes a novel framework for fluorescence plant video
proce...