Can ChatGPT and Bard Generate Aligned Assessment Items? A Reliability Analysis against Human Performance

04/09/2023
by   Abdolvahab Khademi, et al.
0

ChatGPT and Bard are AI chatbots based on Large Language Models (LLM) that are slated to promise different applications in diverse areas. In education, these AI technologies have been tested for applications in assessment and teaching. In assessment, AI has long been used in automated essay scoring and automated item generation. One psychometric property that these tools must have to assist or replace humans in assessment is high reliability in terms of agreement between AI scores and human raters. In this paper, we measure the reliability of OpenAI ChatGP and Google Bard LLMs tools against experienced and trained humans in perceiving and rating the complexity of writing prompts. Intraclass correlation (ICC) as a performance metric showed that the inter-reliability of both the OpenAI ChatGPT and the Google Bard were low against the gold standard of human ratings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/09/2021

Statistical Perspectives on Reliability of Artificial Intelligence Systems

Artificial intelligence (AI) systems have become increasingly popular in...
research
07/05/2023

Performance Comparison of Large Language Models on VNHSGE English Dataset: OpenAI ChatGPT, Microsoft Bing Chat, and Google Bard

This paper presents a performance comparison of three large language mod...
research
07/17/2023

On the application of Large Language Models for language teaching and assessment technology

The recent release of very large language models such as PaLM and GPT-4 ...
research
03/08/2023

The Carbon Emissions of Writing and Illustrating Are Lower for AI than for Humans

As AI systems proliferate, their greenhouse gas emissions are an increas...
research
07/14/2021

DULA: A Differentiable Ergonomics Model for Postural Optimization in Physical HRI

Ergonomics and human comfort are essential concerns in physical human-ro...
research
08/03/2023

Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings

This study investigates the consistency of feedback ratings generated by...
research
11/11/2020

Automatic Open-World Reliability Assessment

Image classification in the open-world must handle out-of-distribution (...

Please sign up or login with your details

Forgot password? Click here to reset