Cats and Captions vs. Creators and the Clock: Comparing Multimodal Content to Context in Predicting Relative Popularity

by   Jack Hessel, et al.

The content of today's social media is becoming more and more rich, increasingly mixing text, images, videos, and audio. It is an intriguing research question to model the interplay between these different modes in attracting user attention and engagement. But in order to pursue this study of multimodal content, we must also account for context: timing effects, community preferences, and social factors (e.g., which authors are already popular) also affect the amount of feedback and reaction that social-media posts receive. In this work, we separate out the influence of these non-content factors in several ways. First, we focus on ranking pairs of submissions posted to the same community in quick succession, e.g., within 30 seconds, this framing encourages models to focus on time-agnostic and community-specific content features. Within that setting, we determine the relative performance of author vs. content features. We find that victory usually belongs to "cats and captions," as visual and textual features together tend to outperform identity-based features. Moreover, our experiments show that when considered in isolation, simple unigram text features and deep neural network visual features yield the highest accuracy individually, and that the combination of the two modalities generally leads to the best accuracies overall.


page 1

page 7

page 8


A Multimodal Approach to Predict Social Media Popularity

Multiple modalities represent different aspects by which information is ...

TIB-VA at SemEval-2022 Task 5: A Multimodal Architecture for the Detection and Classification of Misogynous Memes

The detection of offensive, hateful content on social media is a challen...

How Community Feedback Shapes User Behavior

Social media systems rely on user feedback and rating mechanisms for per...

Protecting Anonymous Speech: A Generative Adversarial Network Methodology for Removing Stylistic Indicators in Text

With Internet users constantly leaving a trail of text, whether through ...

Leveraging Community and Author Context to Explain the Performance and Bias of Text-Based Deception Detection Models

Deceptive news posts shared in online communities can be detected with N...

VICSOM: VIsual Clues from SOcial Media for psychological assessment

Sharing multimodal information (typically images, videos or text) in Soc...

C-CLIP: Contrastive Image-Text Encoders to Close the Descriptive-Commentative Gap

The interplay between the image and comment on a social media post is on...

Please sign up or login with your details

Forgot password? Click here to reset