Choose a Transformer: Fourier or Galerkin

05/31/2021

∙

In this paper, we apply the self-attention from the state-of-the-art Transformer in Attention Is All You Need the first time to a data-driven operator learning problem related to partial differential equations. We put together an effort to explain the heuristics of, and improve the efficacy of the self-attention by demonstrating that the softmax normalization in the scaled dot-product attention is sufficient but not necessary, and have proved the approximation capacity of a linear variant as a Petrov-Galerkin projection. A new layer normalization scheme is proposed to allow a scaling to propagate through attention layers, which helps the model achieve remarkable accuracy in operator learning tasks with unnormalized data. Finally, we present three operator learning experiments, including the viscid Burgers' equation, an interface Darcy flow, and an inverse interface coefficient identification problem. All experiments validate the improvements of the newly proposed simple attention-based operator learner over their softmax-normalized counterparts.

READ FULL TEXT

Choose a Transformer: Fourier or Galerkin

SOFT: Softmax-free Transformer with Linear Complexity

Transformer for Partial Differential Equations' Operator Learning

Scalable Transformer for PDE Surrogate Modeling

Sinkformers: Transformers with Doubly Stochastic Attention

Scaling Local Self-Attention For Parameter Efficient Visual Backbones

Lipschitz Normalization for Self-Attention Layers with Application to Graph Neural Networks

Scaling TransNormer to 175 Billion Parameters

Choose a Transformer: Fourier or Galerkin

Related Research

SOFT: Softmax-free Transformer with Linear Complexity

Transformer for Partial Differential Equations' Operator Learning

Scalable Transformer for PDE Surrogate Modeling

Sinkformers: Transformers with Doubly Stochastic Attention

Scaling Local Self-Attention For Parameter Efficient Visual Backbones

Lipschitz Normalization for Self-Attention Layers with Application to Graph Neural Networks

Scaling TransNormer to 175 Billion Parameters