Socratis: Are large multimodal models emotionally aware?

by   Katherine Deng, et al.

Existing emotion prediction benchmarks contain coarse emotion labels which do not consider the diversity of emotions that an image and text can elicit in humans due to various reasons. Learning diverse reactions to multimodal content is important as intelligent machines take a central role in generating and delivering content to society. To address this gap, we propose Socratis, a societal reactions benchmark, where each image-caption (IC) pair is annotated with multiple emotions and the reasons for feeling them. Socratis contains 18K free-form reactions for 980 emotions on 2075 image-caption pairs from 5 widely-read news and image-caption (IC) datasets. We benchmark the capability of state-of-the-art multimodal large language models to generate the reasons for feeling an emotion given an IC pair. Based on a preliminary human study, we observe that humans prefer human-written reasons over 2 times more often than machine-generated ones. This shows our task is harder than standard generation tasks because it starkly contrasts recent findings where humans cannot tell apart machine vs human-written news articles, for instance. We further see that current captioning metrics based on large vision-language models also fail to correlate with human preferences. We hope that these findings and our benchmark will inspire further research on training emotionally aware models.


page 1

page 2

page 3

page 4


Multimodal Emotion-Cause Pair Extraction in Conversations

Emotion cause analysis has received considerable attention in recent yea...

IRFL: Image Recognition of Figurative Language

Figures of speech such as metaphors, similes, and idioms allow language ...

Human Emotion Knowledge Representation Emerges in Large Language Model and Supports Discrete Emotion Inference

How humans infer discrete emotions is a fundamental research question in...

Multimodal Dual Emotion with Fusion of Visual Sentiment for Rumor Detection

In recent years, rumors have had a devastating impact on society, making...

ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer

Image narrative generation describes the creation of stories regarding t...

Modulating Language Models with Emotions

Generating context-aware language that embodies diverse emotions is an i...

Decoding the Underlying Meaning of Multimodal Hateful Memes

Recent studies have proposed models that yielded promising performance f...

Please sign up or login with your details

Forgot password? Click here to reset