Enhancing Subtask Performance of Multi-modal Large Language Model

by   Yongqiang Zhao, et al.

Multi-modal Large Language Model (MLLM) refers to a model expanded from a Large Language Model (LLM) that possesses the capability to handle and infer multi-modal data. Current MLLMs typically begin by using LLMs to decompose tasks into multiple subtasks, then employing individual pre-trained models to complete specific subtasks, and ultimately utilizing LLMs to integrate the results of each subtasks to obtain the results of the task. In real-world scenarios, when dealing with large projects, it is common practice to break down the project into smaller sub-projects, with different teams providing corresponding solutions or results. The project owner then decides which solution or result to use, ensuring the best possible outcome for each subtask and, consequently, for the entire project. Inspired by this, this study considers selecting multiple pre-trained models to complete the same subtask. By combining the results from multiple pre-trained models, the optimal subtask result is obtained, enhancing the performance of the MLLM. Specifically, this study first selects multiple pre-trained models focused on the same subtask based on distinct evaluation approaches, and then invokes these models in parallel to process input data and generate corresponding subtask results. Finally, the results from multiple pre-trained models for the same subtask are compared using the LLM, and the best result is chosen as the outcome for that subtask. Extensive experiments are conducted in this study using GPT-4 annotated datasets and human-annotated datasets. The results of various evaluation metrics adequately demonstrate the effectiveness of the proposed approach in this paper.


page 1

page 3


Revisiting Pre-training in Audio-Visual Learning

Pre-training technique has gained tremendous success in enhancing model ...

Team Yao at Factify 2022: Utilizing Pre-trained Models and Co-attention Networks for Multi-Modal Fact Verification

In recent years, social media has enabled users to get exposed to a myri...

Grafting Pre-trained Models for Multimodal Headline Generation

Multimodal headline utilizes both video frames and transcripts to genera...

MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation

The recent popularity of text-to-image diffusion models (DM) can largely...

Multi-Dataset Co-Training with Sharpness-Aware Optimization for Audio Anti-spoofing

Audio anti-spoofing for automatic speaker verification aims to safeguard...

What are the best systems? New perspectives on NLP Benchmarking

In Machine Learning, a benchmark refers to an ensemble of datasets assoc...

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Recent advancements in Large Language Models (LLMs) such as GPT4 have di...

Please sign up or login with your details

Forgot password? Click here to reset