PASTS: Progress-Aware Spatio-Temporal Transformer Speaker For Vision-and-Language Navigation

05/19/2023
by   Liuyi Wang, et al.
0

Vision-and-language navigation (VLN) is a crucial but challenging cross-modal navigation task. One powerful technique to enhance the generalization performance in VLN is the use of an independent speaker model to provide pseudo instructions for data augmentation. However, current speaker models based on Long-Short Term Memory (LSTM) lack the ability to attend to features relevant at different locations and time steps. To address this, we propose a novel progress-aware spatio-temporal transformer speaker (PASTS) model that uses the transformer as the core of the network. PASTS uses a spatio-temporal encoder to fuse panoramic representations and encode intermediate connections through steps. Besides, to avoid the misalignment problem that could result in incorrect supervision, a speaker progress monitor (SPM) is proposed to enable the model to estimate the progress of instruction generation and facilitate more fine-grained caption results. Additionally, a multifeature dropout (MFD) strategy is introduced to alleviate overfitting. The proposed PASTS is flexible to be combined with existing VLN models. The experimental results demonstrate that PASTS outperforms all existing speaker models and successfully improves the performance of previous VLN models, achieving state-of-the-art performance on the standard Room-to-Room (R2R) dataset.

READ FULL TEXT

page 3

page 10

page 11

page 12

research
07/25/2023

Kefa: A Knowledge Enhanced and Fine-grained Aligned Speaker for Navigation Instruction Generation

We introduce a novel speaker model Kefa for navigation instruction gener...
research
06/09/2022

FOAM: A Follower-aware Speaker Model For Vision-and-Language Navigation

The speaker-follower models have proven to be effective in vision-and-la...
research
05/26/2023

Spatio-Temporal Transformer-Based Reinforcement Learning for Robot Crowd Navigation

The social robot navigation is an open and challenging problem. In exist...
research
09/01/2023

Learning-based NLOS Detection and Uncertainty Prediction of GNSS Observations with Transformer-Enhanced LSTM Network

The global navigation satellite systems (GNSS) play a vital role in tran...
research
08/26/2021

SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments

This paper presents a novel approach for the Vision-and-Language Navigat...
research
06/07/2018

Speaker-Follower Models for Vision-and-Language Navigation

Navigation guided by natural language instructions presents a challengin...
research
08/24/2022

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

In vision-and-language navigation (VLN), an embodied agent is required t...

Please sign up or login with your details

Forgot password? Click here to reset