FCM: Forgetful Causal Masking Makes Causal Language Models Better Zero-Shot Learners

10/24/2022
by   Hao Liu, et al.
0

Large language models (LLM) trained using the next-token-prediction objective, such as GPT3 and PaLM, have revolutionized natural language processing in recent years by showing impressive zero-shot and few-shot capabilities across a wide range of tasks. In this work, we propose a simple technique that significantly boosts the performance of LLMs without adding computational cost. Our key observation is that, by performing the next token prediction task with randomly selected past tokens masked out, we can improve the quality of the learned representations for downstream language understanding tasks. We hypothesize that randomly masking past tokens prevents over-attending to recent tokens and encourages attention to tokens in the distant past. By randomly masking input tokens in the PaLM model, we show that we can significantly improve 1B and 8B PaLM's zero-shot performance on the SuperGLUE benchmark from 55.7 to 59.2 and from 61.6 to 64.0, respectively. Our largest 8B model matches the score of PaLM with an average score of 64, despite the fact that PaLM is trained on a much larger dataset (780B tokens) of high-quality conversation and webpage data, while ours is trained on the smaller C4 dataset (180B tokens). Experimental results show that our method also improves PaLM's zero and few-shot performance on a diverse suite of tasks, including commonsense reasoning, natural language inference and cloze completion. Moreover, we show that our technique also helps representation learning, significantly improving PaLM's finetuning results.

READ FULL TEXT
research
05/24/2022

Large Language Models are Zero-Shot Reasoners

Pretrained large language models (LLMs) are widely used in many sub-fiel...
research
06/01/2023

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Large language models are commonly trained on a mixture of filtered web ...
research
10/09/2021

Vector-quantized Image Modeling with Improved VQGAN

Pretraining language models with next-token prediction on massive text c...
research
07/10/2023

Large Language Models as General Pattern Machines

We observe that pre-trained large language models (LLMs) are capable of ...
research
06/02/2023

Unifying (Machine) Vision via Counterfactual World Modeling

Leading approaches in machine vision employ different architectures for ...
research
07/14/2023

EmotionPrompt: Leveraging Psychology for Large Language Models Enhancement via Emotional Stimulus

Large language models (LLMs) have achieved significant performance in ma...
research
05/27/2023

What indeed can GPT models do in chemistry? A comprehensive benchmark on eight tasks

Large Language Models (LLMs) with strong abilities in natural language p...

Please sign up or login with your details

Forgot password? Click here to reset