HiLM-D: Towards High-Resolution Understanding in Multimodal Large Language Models for Autonomous Driving

by   Xinpeng Ding, et al.

Autonomous driving systems generally employ separate models for different tasks resulting in intricate designs. For the first time, we leverage singular multimodal large language models (MLLMs) to consolidate multiple autonomous driving tasks from videos, i.e., the Risk Object Localization and Intention and Suggestion Prediction (ROLISP) task. ROLISP uses natural language to simultaneously identify and interpret risk objects, understand ego-vehicle intentions, and provide motion suggestions, eliminating the necessity for task-specific architectures. However, lacking high-resolution (HR) information, existing MLLMs often miss small objects (e.g., traffic cones) and overly focus on salient ones (e.g., large trucks) when applied to ROLISP. We propose HiLM-D (Towards High-Resolution Understanding in MLLMs for Autonomous Driving), an efficient method to incorporate HR information into MLLMs for the ROLISP task. Especially, HiLM-D integrates two branches: (i) the low-resolution reasoning branch, can be any MLLMs, processes low-resolution videos to caption risk objects and discern ego-vehicle intentions/suggestions; (ii) the high-resolution perception branch (HR-PB), prominent to HiLM-D,, ingests HR images to enhance detection by capturing vision-specific HR feature maps and prioritizing all potential risks over merely salient objects. Our HR-PB serves as a plug-and-play module, seamlessly fitting into current MLLMs. Experiments on the ROLISP benchmark reveal HiLM-D's notable advantage over leading MLLMs, with improvements of 4.8 detection.


page 1

page 2

page 3

page 5

page 6


3D Vehicle Detection Using Camera and Low-Resolution LiDAR

Nowadays, Light Detection And Ranging (LiDAR) has been widely used in au...

On failures of RGB cameras and their effects in autonomous driving applications

RGB cameras are arguably one of the most relevant sensors for autonomous...

DOLPHINS: Dataset for Collaborative Perception enabled Harmonious and Interconnected Self-driving

Vehicle-to-Everything (V2X) network has enabled collaborative perception...

Periphery-Fovea Multi-Resolution Driving Model guided by Human Attention

Inspired by human vision, we propose a new periphery-fovea multi-resolut...

The Moral Machine Experiment on Large Language Models

As large language models (LLMs) become more deeply integrated into vario...

BLVD: Building A Large-scale 5D Semantics Benchmark for Autonomous Driving

In autonomous driving community, numerous benchmarks have been establish...

Symbolic Perception Risk in Autonomous Driving

We develop a novel framework to assess the risk of misperception in a tr...

Please sign up or login with your details

Forgot password? Click here to reset