Exploring the capabilities of LLMs in imputing missing industry-grade accident data for scenario generation

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Autonomous driving (AD) and advanced driver-assistance systems (ADAS) require extensive testing to ensure safety and reliability. Industrial-grade, structured accident databases are perfect sources for generating realistic test scenarios. However, these databases often suffer from missing or incomplete values such as GPS coordinates or vehicle speed, which limits their utility in scenario generation. Large language models (LLMs) offer a promising solution to this problem due to their ability to process natural language.

This study investigates the capabilities of LLMs to infer missing values within industrial-grade accident databases, with a focus on variables deemed necessary for the reconstruction of the scenario. Specifically, the study addresses three research questions: the usability of accident data for scenario generation, the reliability of accident databases with a focus on key variables, and the inference performance of LLMs on numeric and categorical fields.

Using a Design Science Research methodology with four experimental cycles, the input and output constraints were varied in each in order to explore their impact on the inference quality. The results show that while the database structure does contain the necessary variables for a basic and accurate reconstruction of the accident scenario, a majority of the cases are missing essential variables. LLMs demonstrated a strong potential for imputing such values, particularly when flexibility in the output was introduced (e.g., ranges for numeric data and LLM-knowledge based lists for categorical data). Sanity checks performed confirm that the LLM was not merely guessing the value but rather making informed decisions based on the data provided.

These findings informed the development of a design artefact: a framework for imputing missing values in accident databases. Future work should explore fine-tuning LLMs on domain-specific datasets and combining them with traditional imputation methods to further explore the enhancement of accuracy and generalisability of accident data imputation.

Description

Keywords

Citation

ISBN

Articles

Department

Defence location

Collections

Endorsement

Review

Supplemented By

Referenced By