Creating Synthetic Dialogue Datasets for NLU Training. An Approach Using Large Language Models
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
This thesis explores the topic of using the GPT-4 large language model, to generate high-quality,
diverse synthetic dialogue datasets for training Natural Language Understanding (NLU) models
in task-oriented dialogue systems.
By employing a schema-guided framework and prompt engineering, the study explores
whether synthetic data can replace real-world data. The research focuses on domain classification,
active intent classification, and slot multi-labelling. Results show that while synthetic
datasets can moderately match real-world data, issues like quality and annotation inconsistency
persist.
Description
Keywords
Language Technology