|dc.description.abstract||Background: Natural Language Understanding (NLU) is an important component in Dialogue Systems
(DS) which makes the utterances of humans understandable by machines. A central aspect of NLU is
intent classification. In intent classification, an NLU receives a user utterance, and outputs a list of N
ranked hypotheses (an N-best list) of the predicted intent along with a confidence estimation (a real
number between 0 and 1) that is assigned to each hypothesis.
Objectives: In this study, we perform an in-depth evaluation of the confidence estimation of 5 NLUs,
namely Watson Assistant, Language Understanding Intelligent Service (LUIS), Snips.ai and Rasa in two
different configurations (Sklearn and DIET). We measure the calibration on two levels: rank level (results
for specific ranks) and model level (aggregated results across ranks), as well as the performance on a
model level. Calibration here refers to the relation between confidence estimates and true likelihood, i.e.
how useful the confidence estimate associated with a certain hypothesis is for assessing its likelihood of
Methodology: We conduct an exploratory case study on the NLUs. We train the NLUs using a subset of a
multi-domain dataset proposed by Liu et al. (2021) on intent classification tasks. We assess the calibration
of the NLUs on model- and rank levels using reliability diagrams and correlation coefficient with respect
to instance-level accuracy, while we measure the performance through accuracy and F1-score.
Results: The evaluation results show that on a model level, the best calibrated NLU is Rasa-Sklearn and
the least calibrated NLU is Snips, while Watson surpasses other NLUs as the best performing NLU and
Rasa-Sklearn as the worst performing NLU. The rank-level results resonate with the model-level results.
However, on lower ranks, some measures become less informative due to low variation of the confidence
Conclusion: Our findings convey that when choosing an NLU for a dialogue system, there is a trade-off
between calibration and performance, that is, a well-performing NLU is not necessarily well-calibrated,
and vice versa. While the chosen metrics of calibration is clearly useful, we also note some limitations
and conclude that further investigation is needed to find the optimal metric of calibration. Also, it should
be noted that to some extent, our results rest on the assumption that the chosen metrics of calibration is
suitable for our purposes.||en_US