Avoiding the Crash: A Vision-Language Model Evaluation of Critical Traffic Scenarios

Published in SAE Technical Paper, 2025

Autonomous Vehicles (AVs) are transforming transportation by minimizing human error and improving traffic efficiency. These systems typically rely on deep neural networks (DNNs) for critical perception tasks such as image classification and object detection. However, DNN performance can degrade over time without retraining, potentially leading to dangerous misinterpretations of road scenes.

In this work, we evaluate Vision-Language Models (VLMs)—specifically LLaVA-7B and MoE-LLaVA—for their ability to reason about and interpret real-world AV crash footage. These models can align visual cues with textual logic, offering a semantically richer understanding than traditional DNNs.

We created a dataset of real-world crash videos, decomposed them into frame sequences, and tested the VLMs’ ability to detect anomalies, reason about causality, and align scene outcomes with road regulations. Results show that VLMs generalize better to high-risk, unseen scenarios and provide interpretable explanations for their decisions.

👉 Read the full paper (PDF)