Diffusion Based Augmentation for Captioning and Retrieval in Cultural Heritage

by   Dario Cioni, et al.

Cultural heritage applications and advanced machine learning models are creating a fruitful synergy to provide effective and accessible ways of interacting with artworks. Smart audio-guides, personalized art-related content and gamification approaches are just a few examples of how technology can be exploited to provide additional value to artists or exhibitions. Nonetheless, from a machine learning point of view, the amount of available artistic data is often not enough to train effective models. Off-the-shelf computer vision modules can still be exploited to some extent, yet a severe domain shift is present between art images and standard natural image datasets used to train such models. As a result, this can lead to degraded performance. This paper introduces a novel approach to address the challenges of limited annotated data and domain shifts in the cultural heritage domain. By leveraging generative vision-language models, we augment art datasets by generating diverse variations of artworks conditioned on their captions. This augmentation strategy enhances dataset diversity, bridging the gap between natural images and artworks, and improving the alignment of visual cues with knowledge from general-purpose datasets. The generated variations assist in training vision and language models with a deeper understanding of artistic characteristics and that are able to generate better captions with appropriate jargon.


page 2

page 4

page 7

page 8


FuseCap: Leveraging Large Language Models to Fuse Visual Data into Enriched Image Captions

Image captioning is a central task in computer vision which has experien...

Learning Visual Representations with Caption Annotations

Pretraining general-purpose visual features has become a crucial part of...

Diversify Your Vision Datasets with Automatic Diffusion-Based Augmentation

Many fine-grained classification tasks, like rare animal identification,...

Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models

Medical Image Segmentation is crucial in various clinical applications w...

Cream: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Advances in Large Language Models (LLMs) have inspired a surge of resear...

Visual Question Answering for Cultural Heritage

Technology and the fruition of cultural heritage are becoming increasing...

EGO-CH: Dataset and Fundamental Tasks for Visitors BehavioralUnderstanding using Egocentric Vision

Equipping visitors of a cultural site with a wearable device allows to e...

Please sign up or login with your details

Forgot password? Click here to reset