Distilled Mid-Fusion Transformer Networks for Multi-Modal Human Activity Recognition

05/05/2023
by   Jingcheng Li, et al.
0

Human Activity Recognition is an important task in many human-computer collaborative scenarios, whilst having various practical applications. Although uni-modal approaches have been extensively studied, they suffer from data quality and require modality-specific feature engineering, thus not being robust and effective enough for real-world deployment. By utilizing various sensors, Multi-modal Human Activity Recognition could utilize the complementary information to build models that can generalize well. While deep learning methods have shown promising results, their potential in extracting salient multi-modal spatial-temporal features and better fusing complementary information has not been fully explored. Also, reducing the complexity of the multi-modal approach for edge deployment is another problem yet to resolve. To resolve the issues, a knowledge distillation-based Multi-modal Mid-Fusion approach, DMFT, is proposed to conduct informative feature extraction and fusion to resolve the Multi-modal Human Activity Recognition task efficiently. DMFT first encodes the multi-modal input data into a unified representation. Then the DMFT teacher model applies an attentive multi-modal spatial-temporal transformer module that extracts the salient spatial-temporal features. A temporal mid-fusion module is also proposed to further fuse the temporal features. Then the knowledge distillation method is applied to transfer the learned representation from the teacher model to a simpler DMFT student model, which consists of a lite version of the multi-modal spatial-temporal transformer module, to produce the results. Evaluation of DMFT was conducted on two public multi-modal human activity recognition datasets with various state-of-the-art approaches. The experimental results demonstrate that the model achieves competitive performance in terms of effectiveness, scalability, and robustness.

READ FULL TEXT
research
08/13/2019

MEx: Multi-modal Exercises Dataset for Human Activity Recognition

MEx: Multi-modal Exercises Dataset is a multi-sensor, multi-modal datase...
research
09/30/2022

Husformer: A Multi-Modal Transformer for Multi-Modal Human State Recognition

Human state recognition is a critical topic with pervasive and important...
research
06/27/2018

Exploiting Spatial-Temporal Modelling and Multi-Modal Fusion for Human Action Recognition

In this report, our approach to tackling the task of ActivityNet 2018 Ki...
research
12/04/2019

Template co-updating in multi-modal human activity recognition systems

Multi-modal systems are quite common in the context of human activity re...
research
10/26/2022

TAMFormer: Multi-Modal Transformer with Learned Attention Mask for Early Intent Prediction

Human intention prediction is a growing area of research where an activi...
research
11/08/2022

Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity Recognition

To properly assist humans in their needs, human activity recognition (HA...
research
02/20/2021

Efficient Multi-stream Temporal Learning and Post-fusion Strategy for 3D Skeleton-based Hand Activity Recognition

Recognizing first-person hand activity is a challenging task, especially...

Please sign up or login with your details

Forgot password? Click here to reset