ToolQA: A Dataset for LLM Question Answering with External Tools

by   Yuchen Zhuang, et al.

Large Language Models (LLMs) have demonstrated impressive performance in various NLP tasks, but they still suffer from challenges such as hallucination and weak numerical reasoning. To overcome these challenges, external tools can be used to enhance LLMs' question-answering abilities. However, current evaluation methods do not distinguish between questions that can be answered using LLMs' internal knowledge and those that require external information through tool use. To address this issue, we introduce a new dataset called ToolQA, which is designed to faithfully evaluate LLMs' ability to use external tools for question answering. Our development of ToolQA involved a scalable, automated process for dataset curation, along with 13 specialized tools designed for interaction with external knowledge in order to answer questions. Importantly, we strive to minimize the overlap between our benchmark data and LLMs' pre-training data, enabling a more precise evaluation of LLMs' tool-use reasoning abilities. We conducted an in-depth diagnosis of existing tool-use LLMs to highlight their strengths, weaknesses, and potential improvements. Our findings set a new benchmark for evaluating LLMs and suggest new directions for future advancements. Our data and code are freely available to the broader scientific community on GitHub.


page 7

page 8


Why Does ChatGPT Fall Short in Answering Questions Faithfully?

Recent advancements in Large Language Models, such as ChatGPT, have demo...

LLM Guided Inductive Inference for Solving Compositional Problems

While large language models (LLMs) have demonstrated impressive performa...

Pre-training Language Models for Comparative Reasoning

In this paper, we propose a novel framework to pre-train language models...

CREATOR: Disentangling Abstract and Concrete Reasonings of Large Language Models through Tool Creation

Large Language Models (LLMs) have demonstrated significant progress in u...

Structural Embeddings of Tools for Large Language Models

It is evident that the current state of Large Language Models (LLMs) nec...

ActKnow: Active External Knowledge Infusion Learning for Question Answering in Low Data Regime

Deep learning models have set benchmark results in various Natural Langu...

A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets

The development of large language models (LLMs) such as ChatGPT has brou...

Please sign up or login with your details

Forgot password? Click here to reset