In this paper, we propose a novel cross-modal distillation method, calle...
The rudimentary adversarial attacks utilize additive noise to attack fac...
Unrestricted adversarial attacks present a serious threat to deep learni...
Multi-view clustering (MVC) has gained broad attention owing to its capa...
Table Detection (TD) is a fundamental task towards visually rich documen...
The convergence of text, visual, and audio data is a key step towards
hu...
Table Detection has become a fundamental task for visually rich document...
Adversarial attacks can mislead deep neural networks (DNNs) by adding
im...
Naked eye recognition of age is usually based on comparison with the age...
With the development of various applications, such as social networks an...
The global Information and Communications Technology (ICT) supply chain ...
Few-shot learning (FSL) has attracted considerable attention recently. A...
Due to the characteristics of Information and Communications Technology ...
Mining attacks aim to gain an unfair share of extra rewards in the block...
Large-scale multi-modal contrastive pre-training has demonstrated great
...
Vision transformer (ViT) recently has drawn great attention in computer
...
Lung nodule detection in chest X-ray (CXR) images is common to early
scr...
3D face recognition systems have been widely employed in intelligent
ter...
Human intelligence is multimodal; we integrate visual, linguistic, and
a...
Cross-modal encoders for vision-language (VL) tasks are often pretrained...
Vision Transformer (ViT) models have recently drawn much attention in
co...
In this work, we introduce Dual Attention Vision Transformers (DaViT), a...
Visual recognition is recently learned via either supervised learning on...
Tabular data in digital documents is widely used to express compact and
...
Contrastive language-image pretraining (CLIP) links vision and language
...
Unsignalized intersection driving is challenging for automated vehicles....
Automated visual understanding of our diverse and open world demands com...
Utilizing 3D point cloud data has become an urgent need for the deployme...
Recently, Vision Transformer and its variants have shown great promise o...
Meta-learning model can quickly adapt to new tasks using few-shot labele...
This paper investigates two techniques for developing efficient
self-sup...
The complex nature of combining localization and classification in objec...
Physical-layer key generation (PKG) establishes cryptographic keys from
...
We present an efficient high-resolution network, Lite-HRNet, for human p...
In this paper, we are interested in the bottom-up paradigm of estimating...
We present in this paper a new architecture, named Convolutional vision
...
This paper presents a new Vision Transformer (ViT) architecture Multi-Sc...
Local binary pattern (LBP) as a kind of local feature has shown its
simp...
Recently, many detection methods based on convolutional neural networks
...
The use of a few examples for each class to train a predictive model tha...
The typical bottom-up human pose estimation framework includes two stage...
Physical-layer key generation (PKG) in multi-user massive MIMO networks ...
In this paper, we are interested in bottom-up multi-person human pose
es...
High-resolution representations are essential for position-sensitive vis...
High-resolution representation learning plays an essential role in many
...
This is an official pytorch implementation of Deep High-Resolution
Repre...
There has been significant progress on pose estimation and increasing
in...
State-of-the-art human pose estimation methods are dominated by complex
...
In this paper, we present a simple and modularized neural network
archit...