Knowledge Amalgamation for Object Detection with Transformers

by   Haofei Zhang, et al.

Knowledge amalgamation (KA) is a novel deep model reusing task aiming to transfer knowledge from several well-trained teachers to a multi-talented and compact student. Currently, most of these approaches are tailored for convolutional neural networks (CNNs). However, there is a tendency that transformers, with a completely different architecture, are starting to challenge the domination of CNNs in many computer vision tasks. Nevertheless, directly applying the previous KA methods to transformers leads to severe performance degradation. In this work, we explore a more effective KA scheme for transformer-based object detection models. Specifically, considering the architecture characteristics of transformers, we propose to dissolve the KA into two aspects: sequence-level amalgamation (SA) and task-level amalgamation (TA). In particular, a hint is generated within the sequence-level amalgamation by concatenating teacher sequences instead of redundantly aggregating them to a fixed-size one as previous KA works. Besides, the student learns heterogeneous detection tasks through soft targets with efficiency in the task-level amalgamation. Extensive experiments on PASCAL VOC and COCO have unfolded that the sequence-level amalgamation significantly boosts the performance of students, while the previous methods impair the students. Moreover, the transformer-based students excel in learning amalgamated knowledge, as they have mastered heterogeneous detection tasks rapidly and achieved superior or at least comparable performance to those of the teachers in their specializations.


page 2

page 4

page 10

page 11

page 14


Searching Intrinsic Dimensions of Vision Transformers

It has been shown by many researchers that transformers perform as well ...

SeqCo-DETR: Sequence Consistency Training for Self-Supervised Object Detection with Transformers

Self-supervised pre-training and transformer-based networks have signifi...

Wake Word Detection with Streaming Transformers

Modern wake word detection systems usually rely on neural networks for a...

A Comprehensive Study of Vision Transformers on Dense Prediction Tasks

Convolutional Neural Networks (CNNs), architectures consisting of convol...

Toward Transformer-Based Object Detection

Transformers have become the dominant model in natural language processi...

PanoSwin: a Pano-style Swin Transformer for Panorama Understanding

In panorama understanding, the widely used equirectangular projection (E...

Few-Shot Object Detection with Fully Cross-Transformer

Few-shot object detection (FSOD), with the aim to detect novel objects u...

Please sign up or login with your details

Forgot password? Click here to reset