Mathematical Capabilities of ChatGPT

by   Simon Frieder, et al.

We investigate the mathematical capabilities of ChatGPT by testing it on publicly available datasets, as well as hand-crafted ones, and measuring its performance against other models trained on a mathematical corpus, such as Minerva. We also test whether ChatGPT can be a useful assistant to professional mathematicians by emulating various use cases that come up in the daily professional activities of mathematicians (question answering, theorem searching). In contrast to formal mathematics, where large databases of formal proofs are available (e.g., the Lean Mathematical Library), current datasets of natural-language mathematics, used to benchmark language models, only cover elementary mathematics. We address this issue by introducing a new dataset: GHOSTS. It is the first natural-language dataset made and curated by working researchers in mathematics that (1) aims to cover graduate-level mathematics and (2) provides a holistic overview of the mathematical capabilities of language models. We benchmark ChatGPT on GHOSTS and evaluate performance against fine-grained criteria. We make this new dataset publicly available to assist a community-driven comparison of ChatGPT with (future) large language models in terms of advanced mathematical comprehension. We conclude that contrary to many positive reports in the media (a potential case of selection bias), ChatGPT's mathematical abilities are significantly below those of an average mathematics graduate student. Our results show that ChatGPT often understands the question but fails to provide correct solutions. Hence, if your goal is to use it to pass a university exam, you would be better off copying from your average peer!


page 10

page 12

page 13

page 20

page 21


Measuring and Improving BERT's Mathematical Abilities by Predicting the Order of Reasoning

Imagine you are in a supermarket. You have two bananas in your basket an...

Evaluating Language Models for Mathematics through Interactions

The standard methodology of evaluating large language models (LLMs) base...

Towards a Mathematics Formalisation Assistant using Large Language Models

Mathematics formalisation is the task of writing mathematics (i.e., defi...

Extracting Mathematical Concepts with Large Language Models

We extract mathematical concepts from mathematical text using generative...

Math Agents: Computational Infrastructure, Mathematical Embedding, and Genomics

The advancement in generative AI could be boosted with more accessible m...

A Mathematical Abstraction for Balancing the Trade-off Between Creativity and Reality in Large Language Models

Large Language Models have become popular for their remarkable capabilit...

VNHSGE: VietNamese High School Graduation Examination Dataset for Large Language Models

The VNHSGE (VietNamese High School Graduation Examination) dataset, deve...

Please sign up or login with your details

Forgot password? Click here to reset