VDTR: Video Deblurring with Transformer

by   Mingdeng Cao, et al.
Tsinghua University

Video deblurring is still an unsolved problem due to the challenging spatio-temporal modeling process. While existing convolutional neural network-based methods show a limited capacity for effective spatial and temporal modeling for video deblurring. This paper presents VDTR, an effective Transformer-based model that makes the first attempt to adapt Transformer for video deblurring. VDTR exploits the superior long-range and relation modeling capabilities of Transformer for both spatial and temporal modeling. However, it is challenging to design an appropriate Transformer-based model for video deblurring due to the complicated non-uniform blurs, misalignment across multiple frames and the high computational costs for high-resolution spatial modeling. To address these problems, VDTR advocates performing attention within non-overlapping windows and exploiting the hierarchical structure for long-range dependencies modeling. For frame-level spatial modeling, we propose an encoder-decoder Transformer that utilizes multi-scale features for deblurring. For multi-frame temporal modeling, we adapt Transformer to fuse multiple spatial features efficiently. Compared with CNN-based methods, the proposed method achieves highly competitive results on both synthetic and real-world video deblurring benchmarks, including DVD, GOPRO, REDS and BSD. We hope such a Transformer-based architecture can serve as a powerful alternative baseline for video deblurring and other video restoration tasks. The source code will be available at <https://github.com/ljzycmd/VDTR>.


page 1

page 4

page 6

page 7

page 8

page 9


RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution

Space-time video super-resolution (STVSR) is the task of interpolating v...

Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition

Recent video recognition models utilize Transformer models for long-rang...

Dynamic Temporal Filtering in Video Models

Video temporal dynamics is conventionally modeled with 3D spatial-tempor...

Benchmarking the Robustness of Spatial-Temporal Models Against Corruptions

The state-of-the-art deep neural networks are vulnerable to common corru...

VidFace: A Full-Transformer Solver for Video FaceHallucination with Unaligned Tiny Snapshots

In this paper, we investigate the task of hallucinating an authentic hig...

Adaptive Human Matting for Dynamic Videos

The most recent efforts in video matting have focused on eliminating tri...

Bringing Old Films Back to Life

We present a learning-based framework, recurrent transformer network (RT...

Please sign up or login with your details

Forgot password? Click here to reset