Creating Synthetic Dialogue Datasets for NLU Training. An Approach Using Large Language Models

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

This thesis explores the topic of using the GPT-4 large language model, to generate high-quality, diverse synthetic dialogue datasets for training Natural Language Understanding (NLU) models in task-oriented dialogue systems. By employing a schema-guided framework and prompt engineering, the study explores whether synthetic data can replace real-world data. The research focuses on domain classification, active intent classification, and slot multi-labelling. Results show that while synthetic datasets can moderately match real-world data, issues like quality and annotation inconsistency persist.

Description

Keywords

Language Technology

Citation

Endorsement

Review

Supplemented By

Referenced By