Lifting Transformer for 3D Human Pose Estimation in Video

by   Wenhao Li, et al.

Despite great progress in video-based 3D human pose estimation, it is still challenging to learn a discriminative single-pose representation from redundant sequences. To this end, we propose a novel Transformer-based architecture, called Lifting Transformer, for 3D human pose estimation to lift a sequence of 2D joint locations to a 3D pose. Specifically, a vanilla Transformer encoder (VTE) is adopted to model long-range dependencies of 2D pose sequences. To reduce redundancy of the sequence and aggregate information from local context, fully-connected layers in the feed-forward network of VTE are replaced with strided convolutions to progressively reduce the sequence length. The modified VTE is termed as strided Transformer encoder (STE) and it is built upon the outputs of VTE. STE not only significantly reduces the computation cost but also effectively aggregates information to a single-vector representation in a global and local fashion. Moreover, a full-to-single supervision scheme is employed at both the full sequence scale and single target frame scale, applying to the outputs of VTE and STE, respectively. This scheme imposes extra temporal smoothness constraints in conjunction with the single target frame supervision. The proposed architecture is evaluated on two challenging benchmark datasets, namely, Human3.6M and HumanEva-I, and achieves state-of-the-art results with much fewer parameters.


CrossFormer: Cross Spatio-Temporal Transformer for 3D Human Pose Estimation

3D human pose estimation can be handled by encoding the geometric depend...

UniPose: Unified Human Pose Estimation in Single Images and Videos

We propose UniPose, a unified framework for human pose estimation, based...

AMPose: Alternatively Mixed Global-Local Attention Model for 3D Human Pose Estimation

The graph convolutional network (GCN) has been applied to 3D human pose ...

Context Modeling in 3D Human Pose Estimation: A Unified Perspective

Estimating 3D human pose from a single image suffers from severe ambigui...

Swin-Pose: Swin Transformer Based Human Pose Estimation

Convolutional neural networks (CNNs) have been widely utilized in many c...

DPIT: Dual-Pipeline Integrated Transformer for Human Pose Estimation

Human pose estimation aims to figure out the keypoints of all people in ...

Please sign up or login with your details

Forgot password? Click here to reset