Creating Synthetic Dialogue Datasets for NLU Training. An Approach Using Large Language Models
Creating Synthetic Dialogue Datasets for NLU Training. An Approach Using Large Language Models
Abstract
This thesis explores the topic of using the GPT-4 large language model, to generate high-quality,
diverse synthetic dialogue datasets for training Natural Language Understanding (NLU) models
in task-oriented dialogue systems.
By employing a schema-guided framework and prompt engineering, the study explores
whether synthetic data can replace real-world data. The research focuses on domain classification,
active intent classification, and slot multi-labelling. Results show that while synthetic
datasets can moderately match real-world data, issues like quality and annotation inconsistency
persist.
Degree
Student essay
Collections
Date
2024-06-20Author
Laszlo, Bogdan
Keywords
Language Technology
Language
eng