Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance

by   Yao Fu, et al.
University of Washington
Allen Institute for Artificial Intelligence

As large language models (LLMs) are continuously being developed, their evaluation becomes increasingly important yet challenging. This work proposes Chain-of-Thought Hub, an open-source evaluation suite on the multi-step reasoning capabilities of large language models. We are interested in this setting for two reasons: (1) from the behavior of GPT and PaLM model family, we observe that complex reasoning is likely to be a key differentiator between weaker and stronger LLMs; (2) we envisage large language models to become the next-generation computational platform and foster an ecosystem of LLM-based new applications, this naturally requires the foundation models to perform complex tasks that often involve the composition of linguistic and logical operations. Our approach is to compile a suite of challenging reasoning benchmarks to track the progress of LLMs. Our current results show that: (1) model scale clearly correlates with reasoning capabilities; (2) As of May 2023, Claude-v1.3 and PaLM-2 are the only two models that are comparable with GPT-4, while open-sourced models still lag behind; (3) LLaMA-65B performs closely to code-davinci-002, indicating that with successful further development such as reinforcement learning from human feedback (RLHF), it has great potential to be close to GPT-3.5-Turbo. Our results also suggest that for the open-source efforts to catch up, the community may focus more on building better base models and exploring RLHF.


page 1

page 2

page 3

page 4


WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Large language models (LLMs), such as GPT-4, have shown remarkable perfo...

Can Foundation Models Talk Causality?

Foundation models are subject to an ongoing heated debate, leaving open ...

LLMs Understand Glass-Box Models, Discover Surprises, and Suggest Repairs

We show that large language models (LLMs) are remarkably good at working...

A Cost Analysis of Generative Language Models and Influence Operations

Despite speculation that recent large language models (LLMs) are likely ...

Holistic Evaluation of Language Models

Language models (LMs) are becoming the foundation for almost all major l...

Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models

Current literature, aiming to surpass the "Chain-of-Thought" approach, o...

Examining the Inter-Consistency of Large Language Models: An In-depth Analysis via Debate

Large Language Models (LLMs) have demonstrated human-like intelligence a...

Please sign up or login with your details

Forgot password? Click here to reset