DEPARTMENT OF PHILOSOPHY, LINGUISTICS AND THEORY OF SCIENCE EVALUATING CONFIDENCE ESTIMATION IN NLU FOR DIALOGUE SYSTEMS Ranim Khojah Master’s Thesis: 30 credits Programme: Master’s Programme in Language Technology Level: Advanced level Semester and year: Spring, 2022 Supervisor: Staffan Larsson and Alexander Berman Examiner: Eleni Gregoromichelaki Keywords: Natural Language Understanding (NLU), Intent ranking, Confidence Calibration. Abstract Background: Natural Language Understanding (NLU) is an important component in Dialogue Systems (DS) which makes the utterances of humans understandable by machines. A central aspect of NLU is intent classification. In intent classification, an NLU receives a user utterance, and outputs a list of N ranked hypotheses (an N -best list) of the predicted intent along with a confidence estimation (a real number between 0 and 1) that is assigned to each hypothesis. Objectives: In this study, we perform an in-depth evaluation of the confidence estimation of 5 NLUs, namely Watson Assistant, Language Understanding Intelligent Service (LUIS), Snips.ai and Rasa in two different configurations (Sklearn and DIET).Wemeasure the calibration on two levels: rank level (results for specific ranks) and model level (aggregated results across ranks), as well as the performance on a model level. Calibration here refers to the relation between confidence estimates and true likelihood, i.e. how useful the confidence estimate associated with a certain hypothesis is for assessing its likelihood of being correct. Methodology: We conduct an exploratory case study on the NLUs. We train the NLUs using a subset of a multi-domain dataset proposed by Liu et al. (2021) on intent classification tasks. We assess the calibration of the NLUs on model- and rank levels using reliability diagrams and correlation coefficient with respect to instance-level accuracy, while we measure the performance through accuracy and F1-score. Results: The evaluation results show that on a model level, the best calibrated NLU is Rasa-Sklearn and the least calibrated NLU is Snips, while Watson surpasses other NLUs as the best performing NLU and Rasa-Sklearn as the worst performing NLU. The rank-level results resonate with the model-level results. However, on lower ranks, some measures become less informative due to low variation of the confidence estimates. Conclusion: Our findings convey that when choosing an NLU for a dialogue system, there is a trade-off between calibration and performance, that is, a well-performing NLU is not necessarily well-calibrated, and vice versa. While the chosen metrics of calibration is clearly useful, we also note some limitations and conclude that further investigation is needed to find the optimal metric of calibration. Also, it should be noted that to some extent, our results rest on the assumption that the chosen metrics of calibration is suitable for our purposes. 1 Preface On my first day at university, a teacher told me: ”You’ll know that you’re learning once your brain starts hurting.”. Five years later, my brain still hurts. I’ve learned a lot in this thesis and enjoyed every bit of it! I’d like to thank my patient supervisors: Staffan Larsson and Alexander Berman, for their great guidance and feedback. I’d also like to express my gratitude to my examiner, Eleni Gregoromichelaki for providing detailed feedback to shape the thesis. Thank you, mama, baba, and ablacım, for supporting me and always being there to listen to me complain. 2 Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Aims and Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 Explanatory Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 Materials and Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.1 NLU Services and Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.1.1 Watson Assistant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.1.2 LUIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.1.3 Snips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.1.4 Rasa Opensource . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.1 Case Study Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.2 Intent Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.3 Evaluation of Confidence Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.3.1 Confidence Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.3.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5.1 Reliability Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5.1.1 Model-level Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5.1.2 Rank-level Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.2 Spearman’s Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.2.1 Model-level Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.2.2 Rank-level Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3 6.1 Validity of Calibration Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 6.2 Interpreting Calibration and Potential Applications . . . . . . . . . . . . . . . . . . . . 17 6.3 Performance vs. Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 7 Ethics and Validity Threats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 7.1 Construct Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 7.2 Internal Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 7.3 External Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 7.4 Data Fallacies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 8 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 9.1 Rasa Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 9.2 Reliability Diagrams with Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . 23 9.3 Rank-level Ranks vs Spearman’s Correlations Plot with Standard Deviation . . . . . . . 23 9.4 Results of Rank-level Spearman’s Correlation . . . . . . . . . . . . . . . . . . . . . . . 24 9.5 Post-hoc Analysis: t-test Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4 1 Introduction Dialogue Systems (DS) have received much attention in academia and industry in recent years. The main goal of work on dialogue systems is to improve the quality of human-computer dialogues by making them more natural. This is achieved in part through the development of the Natural Language Understanding (NLU) component which is responsible for understanding the semantics of user’s utterances. Present day NLUs typically apply machine learning models on unstructured data (i.e., the user utterances) to extract features (e.g., keywords, word counts and word embeddings) and predict the intent of the user accordingly (Jung, 2019; Shridhar et al., 2019). NLU services and frameworks (henceforth NLUs) are widely used by dialogue developers to allow them to create and train NLU models for dialogue systems. The task of choosing an NLU to use may be informed by knowledge about how well-performing and well-calibrated the NLU is in a specific domain or context. In work on classification in machine learning, calibration is a property that illustrates how estimated confidences reflect real-world probabilities or true likelihood of predictions (Guo et al., 2017). In the context of dialogue systems, well-calibrated NLUs may have an impact on the performance of the DS by providing reliable output to other components in the DS (e.g., Dialogue Manager (DM)), whereas miscalibration can cause problems. Specifically, over-confidence can cause undesired actions from the DS, and under-confidence can cause undesired control questions and clarification questions which disrupt the flow of the dialogue. These disruptions can lead to decreased efficiency and increased distraction and thus result in serious risks in critical applications and domains, such as healthcare and automotive. 1.1 Research Questions This study addresses methodological questions regarding how to measure calibration, and aims to answer the following research questions: RQ1: To what extent are the state-of-the-art NLUs well-calibrated? RQ1.1: How well-calibrated are NLUs on model level? RQ1.2: How well-calibrated are the NLUs on rank level? RQ2: How does are state-of-the-art NLUs perform in intent classification tasks? RQ3: To what extent are calibration and performance of an NLU correlated? Note that in RQ1, we are interested in comparing calibration across the NLUs, rather than judging the overall calibration e.g., high or low calibration. 1.2 Aims and Contribution This study’s main contribution is C1) providing an evaluation of confidence calibration for state-of-the- art NLUs, something that – to the best of our knowledge – hasn’t been previously done. C2)We propose and test new ways of measuring calibration of all hypotheses rather than only for top hypothesis on rank 1 and model levels. Finally, we C3) make our evaluation scripts publicly available on GitHub1 along with the dataset we have used to allow the replication of the study and to ease building on it. Our evaluation aims to help dialogue developers choose an appropriate NLU and also to adapt their dia- logue system to specific NLUs. Since our evaluation results describe the reliability of the NLUs in terms of their calibration and performance, dialogue system developers will have more information both about the properties of the NLUs and on how to interpret the output of a given NLU. For example, depending on the degree of calibration in the NLU, grounding behaviours such as positive and negative feedback (indi- cating understanding or lack thereof) or contextual or interactive disambiguation (clarification requests) can be motivated. If the confidence estimates reflect true likelihood, then if two (or more) hypotheses have similar confidence estimates, this may indicate the presence of an ambiguity in the user input (from the perspective of the NLU, i.e., disregarding the dialogue context) that needs to be resolved. Conversely, if confidence estimates (especially for non-top ranks) do not reflect true likelihood, then even if the top two (or more) hypotheses have similar estimates, this may not be a reliable indication of ambiguity but rather be caused by noise. 2 Background 2.1 Related Work In prior work, benchmarks and evaluations have been performed to identify the best NLU service in dif- ferent domains like Software Engineering (Abdellatif et al., 2021), Meteorology (Canonico & De Russis, 2018), Question Answering (QA) (Braun et al., 2017) and others (McTear et al., 2016; Stoyanchev et al., 2016; Kar & Haldar, 2016; Koetter et al., 2018). Generally, these evaluation studies have been conducted to draw the trade-off line between different NLU services in terms of the usability of their user interfaces (Gregori, 2017), the technical features the NLUs provide (e.g., language and device support) (Koetter et al., 2018) and performance as regards identifying the correct intent of a user’s utterance (Braun et al., 2017; Liu et al., 2021). NLU performance is usually evaluated with accuracy or F1 score (Braun et al., 2017), both of which depend only on the top hypothesis returned by the NLU and disregard the associated confidence estimates. For example, an NLU that predicts 3 out of 10 intents incorrectly with high confidence estimates has the same performance of an NLU that predicts 3 out of 10 intents incorrectly with low confidence estimates. In addition, various methods for visualizing and measuring confidence calibration (the extent to which confidence estimates reflect true likelihoods) have been discussed in work on machine learning for clas- sification tasks. For example, Liu et al. (2021) and Vasudevan et al. (2019) visualize calibration of neural network models through reliability diagrams. Another proposed calibration measure is the correlation coefficient of confidence estimate with respect to F1 score through Spearman’s correlation (Dong et al., 2018) and with respect to instance-level accuracy through Pearson’s correlation (Vasudevan et al., 2019). Expected Calibration Error (ECE) (Naeini et al., 2015) has also been used to measure the calibration of many models (Guo et al., 2017; Kuleshov et al., 2018). Furthermore, Nixon et al. (2019) extend ECE in deep learning models by looking into the probabilities of all predictions rather than the top one. In this study, we apply two of the previously proposed calibration assessment methods to NLUs, that is, reliability diagrams and correlation between confidence estimates and instance-level accuracy. On a model level, we follow Nixon et al. (2019) in considering confidence estimates of all hypotheses returned 1https://github.com/ranimkhojah/confidence-estimation-benchmark 2 by an NLU – including hypotheses with lower ranks (2-N ) – but using another measure. We also measure calibration on rank level, enabling a more fine-grained analysis. 2.2 Terminology NLUs are commonly used to perform intent classification and entity recognition tasks. In intent classifi- cation, an NLU takes a user utterance as an input, then it parses it into a machine-readable representation in order to return a prediction of the user’s intent accordingly (Tur & De Mori, 2011; Wang et al., 2005). In dialogue systems, intent classification data is used to train an NLU model, that is, user utterances with a corresponding expected intent for each utterance. When using an NLU, an utterance U is fed to the trained NLU model, and the output normally includes the information in Listing 1. 1 { 2 ’utterance’ : ’U’ , 3 ’top_intent’: ’intent_1’, 4 ’intent_ranking’ : { 5 ’intent_1’ : ’conf_1’, # hypothesis on rank 1 6 ’intent_2’ : ’conf_2’, # rank 2 7 ... , 8 ’intent_N’ : ’conf_N’ # rank N 9 } 10 } Listing 1: Abstract example of a JSON response from an NLU when parsing an utterance. Given an utterance U , the response from an NLU includes the user utterance and a prediction consisting of the top intent and an intent ranking. The intent ranking consists ofN -best intent hypotheses with their corresponding confidence estimates. The confidence estimates reflect how confident the NLU model is regarding each hypothesis. Different NLUs may have different ways of computing these estimates, and they may be slightly different notions of confidence. However, for the purpose of using the estimates in a dialogue system, we are interested in how well they reflect true probabilities. We will note variations in how confidence estimates are computed, but we will not take them into account in our quantitative tests. 2.3 Explanatory Example We present Figure 1 as a simplified demonstration of how NLUs are used in dialogue systems, using a scenario where a user asks a dialogue system to perform a task within the home domain. The user interacts and communicate with the dialogue system through a user interface. The user utterance is transferred to an NLU component which uses the utterance as an input to perform intent classification. Next, it returns a prediction with the top intent and the intent ranking that is sent to a Dialogue Manager (DM) which steers the dialogue accordingly. In case of a high estimated confidence for the top hypothesis in the intent ranking, the DM integrates the user’s intent, and information is sent to the Natural Language Generation (NLG) unit that generates a response and sends it to the user. 3 User Interface (UI) Natural Language Understanding (NLU) 1. 2. 3. top_intent: tv_on, Turn on the television, please. Intent Classification intent_ranking: [ { intent: tv_on, 6. confidence: 0.91 TV on. }, { intent: play_music, confidence: 0.08 }, { 5. 4. intent: call_contact, confidence: 0.01 Natural Language Dialogue Manager } Generation (NLG) (DM) ] Figure 1: An explanatory example that demonstrates the role of NLU in a dialogue system. 3 Materials and Technical Details 3.1 NLU Services and Frameworks NLU services and frameworks (henceforth NLUs) can be used to construct the NLU component in a dialogue system. In this study, we choose which NLUs to evaluate based on the following criteria: NLUs that i) can perform intent classification and ii) return at least the 10-best hypotheses in the output. We evaluate 5 commonly-used NLUs: Watson’s Assistant (IBM, 2010), Language Understanding Intelligent Service (LUIS) (Microsoft, 2017), Snips (Snips, 2013) andRasa (Rasa, 2016)with two different pipelines. In this section, we explain how each of the 5 NLUs is trained and tested through a user interface and/or an API (Summarized in Table 1). 3.1.1 Watson Assistant Watson Assistant (henceforthWatson) is a cloud-based NLU developed by IBM. It enables managing and building an NLU through the IBM cloud interface which doesn’t require any programming experience. Building an NLU includes creating intents and corresponding examples, detecting and resolving conflicts between intents, and training and testing an NLUmodel. Training an NLU with Watson can only be done through the user interface by manually uploading the training set in a specific format, while testing is possible through the interface and the API. When parsing an utterance, Watson returns the top 10 hypotheses along with their confidence estimates. Confidence estimates are not normalized as they are calculated independently for each intent that the NLU model has been trained on – in other words, Watson is a multiple-binary classifier which assumes that there is a possibility for an utterance to have more than one correct intent. In addition, Watson has an optional built-in “Irrelevant” intent that corresponds to an out-of-scope (OOS) input2. 2https://cloud.ibm.com/docs/assistant?topic=assistant-irrelevance-detection 4 3.1.2 LUIS Language Understanding Intelligent Service (LUIS) is provided byMicrosoft and runs on the Azure cloud platform. Similarly to Watson, training is performed via the user interface. Given a training set of intents and their respective user utterances, LUIS trains an intent I using its user utterances as positive examples and the user utterances of other intents as negative examples3 of intent I. There is no limit to the number of hypotheses that LUIS returns; in other words, if the NLU is trained onN intents, the intent ranking is of lengthN . In the intent ranking, confidence estimates are not normalized, and a “None” intent is included as a representation of an out-of-scope intent. The “None” intent is trained on 0 examples by default, and it requires the user to train it on example utterances4. 3.1.3 Snips Snips.ai is an AI voice platform for connected devices (currently acquired by Snonos5) which provides an NLU called Snips NLU (henceforth Snips). In this study, we use version v0.20.2 of Snips NLU. By default, Snips returns all hypotheses of all intents with their confidence estimates that are not normal- ized, in addition to a built-in “None” intent that is pre-trained by Snips on examples from noise text to cover out-of-scope utterances (Coucke et al., 2018). 3.1.4 Rasa Opensource Rasa is an open-source NLU provided by Rasa Technologies. It can run on different pipelines that are configurable, which increases the flexibility of the NLU (Bocklisch et al., 2017). The training and testing is performed by sending requests to Rasa Open Source server through Rasa HTTP API6, or through Rasa NLU’s SDK for Python. Rasa returns the top 10 hypotheses by default and their corresponding normalized confidence estimates. In addition, Rasa does not offer a built-in out-of-scope intent. In this study, we use Rasa v2.4.3 to create two NLU models with two different pipelines (See Appendix 9.1). We use the pre-configured pipelines offered by Rasa. The first pipeline uses an Sklearn Intent Classifier and a pre-trained bag of words (BoW) model where one feature vector is assigned to one input utterance (See Listing 2). In contrast, the second pipeline is built on a Dual Intent and Entity Transformer (DIET) architecture, and is based on a sequence model, meaning that it considers the orders of the words present in an utterance (Bunk et al., 2020). For featurization, this pipeline uses bag of words and n-grams (See Listing 3). We refer to the two pipelines above as Rasa-Sklearn and Rasa-DIET respectively. NLU Packaging Classifier Type Version/Invoked on OOS Intent Watson Cloud-based service Multi-binary classifier Invoked in April 2022 Yes LUIS Cloud-based service Multi-class classifier Invoked in April 2022 Yes Snips Open-source framework Multi-class classifier v0.20.2 Yes Rasa Open-source framework Multi-class classifier v2.4.3 No Table 1: Summary of NLUs. (OOS: Out-of-Scope). 3https://docs.microsoft.com/en-us/azure/cognitive-services/luis/luis-concept-model#intents-classify-utterances 4https://docs.microsoft.com/en-us/azure/cognitive-services/luis/concepts/intents#none-intent 5https://www.sonos.com 6https://rasa.com/docs/rasa/http-api/ 5 3.2 Dataset To conduct intent classification as a part of our evaluation, we use the dataset proposed by Liu et al. (2021). The dataset consists of 25716 annotated user utterances for human-robot interaction and cover 64 Intents, 18 scenarios and 21 domains. The user utterances in the dataset are annotated with, inter alia, the intent, scenario and a normalized version of user utterance which ignores noise, punctuation and converts numbers to text. For instance, the user utterance “Olly, wake me up at 7am!” is annotated with the intent set and the scenario alarm, and is normalized to ”wake me up at seven am”. For the NLU evaluation, we generalize the 21 domains and combine them into 6 high-level domains in Table 2. Then we select 10 intents with the most number of examples (See Table 3). Since Watson and Rasa-Sklearn return only the top 10 hypotheses in the intent ranking, by training the NLUs on 10 intents, we ensure consistency across the evaluated NLUs and guarantee that the NLUs will return all the intents that were used to train the NLU. This allows us to include all possible hypotheses in the evaluation. Domain Scenarios Intents personal assistance calendar, datetime, query, set, remove, convert, createoradd, sende- weather, lists, alarm mail, querycontact, addcontact General general quirky, greet, negate, dontcare, repeat, affirm, commandstop, confirm, explain, praise Audio and vision pro- music, audio, news, query, settings, volume_mute, volume_down, vol- grammes play ume_up, music, radio, audiobook, podcasts, game IoT iot hue_lightchange, hue_lightoff, hue_lighton, hue_lightdim, cleaning, hue_lightup, coffee, wemo_on, wemo_off Outdoor activities transport, recommenda- query, ticket, traffic, taxi, locations, events, tion, social, takeaway movies, post, order Q&A qa stock, factoid, definition, maths, currency Table 2: Domains, scenarios and intents in Liu et al. (2021)’s dataset. Intent Size Normalized Example query 5981 what’s the time in australia set 1748 wake me up at nine am on friday music 1205 start playing music from favourites quirky 1088 i am not tired i am actually happy factoid 1052 tell me comics of charlie chaplin remove 986 cancel my seven am alarm negate 939 you don’t understand it right sendemail 694 send a group mail to lookafter explain 684 could you clarify me on it further repeat 585 please let’s start over Total 14962 examples Table 3: Selected intents for the case study with their respective sizes (i.e. number of example utterances) and example utterances from the original dataset. 6 Data Preparation Data Collection Data Analysis Data Training data processing x10 Cross- Raw data validation NLU Service Statistical Testing data NLU model Predictions analysis x10 Evaluation Figure 2: The evaluation process. This process was iterated 5 times, once per NLU. The highlighted box (i.e. Evaluation) is the core component of the study where we assess the confidence estimation of NLUs. 4 Methodology 4.1 Case Study Setup We conduct an exploratory case study to examine the calibration and performance of the 5 commonly- used NLUs described in Section 3.1 using the multi-domain dataset in Section 3.2. Furthermore, since confidence estimation is the main focus of our study we only evaluate NLUs that return at least the top 10 hypotheses. We summarize the components of our study in Table 4. An overview of our study’s execution is illustrated in Figure 2. We start with selecting the data for the evaluation, applying a cross-validation method (with 10 iterations) and then train an NLU model (See Section 4.2). During data collection, we test the NLUmodels and obtain results in the form of predictions, that we use to perform the evaluation of the confidence calibration and performance (See Section 4.3). Finally, we verify our evaluation results using a post-hoc statistical analysis (See Section 5). Objective Explore The Context Natural Language Understanding for Dialogue Systems The Cases NLUs: IBM Watson Assistant, Microsoft LUIS, Rasa and Snips.ai Theory Confidence estimation of NLUs Research Questions RQ1, RQ2 and RQ3 (incl. sub-RQs) NLU Selection Strategy Cloud-based and open source NLUs that return at least 10 hypotheses Units of Analysis Calibration and Performance Table 4: Case Study Planning (Template from Runeson & Höst (2009)). 4.2 Intent Classification We select 10 intents with the greatest number of examples (See Table 3) to maximize the size of the dataset and minimize data sparsity. These 10 intents cover 5 domains and 15 scenarios (See Table 2) and have a total of 14962 utterances. We perform Repeated Random Sub-sampling (Dubitzky et al., 2007) with 10 iterations to generate 10 datasets, where each dataset is split into training and test sets with a 2:1 ratio. Next, the training sets are 7 processed by first cleaning the example utterances e.g., removing special characters and examples with missing fields. We train and test 10 NLU models using 10 splits, resulting in 10 trained NLU models for each NLU. In the test results, we ignore hypotheses with “None”/“irrelevant” intent in the intent ranking to ensure that all NLUs have the same intent ranking length and make their results comparable. Alse, we do not normalize the confidence estimates for any of the NLUs (Clarified in Section 7). 4.3 Evaluation of Confidence Estimation The evaluation of confidence estimation is performed on two levels: rank and model. On rank level, results are obtained for specific prediction ranks. For example, results for rank 1 pertain to the top- ranked (most confident) prediction hypotheses. On model level, the results of all ranks are aggregated for each NLU model. The main focus of the evaluation is the calibration of the NLUs. However, we also assess performance in order to investigate the correlation between the NLUs’ calibration and performance. The former is measured using reliability diagrams and correlation coefficient with respect to instance-level accuracy, and the latter is measured through accuracy and F1-score. Furthermore, the evaluation is conducted for each split and then averaged across splits. 4.3.1 Confidence Calibration Confidence calibration is the extent to which a model is able to produce confidence estimates that reflect the accuracy (true likelihood) of the respective intent hypotheses (Guo et al., 2017). For example, a model is well-calibrated if hypotheses with confidence estimate 0.7 are correct in 70% of the cases. In this study, we visualize calibration using reliability diagrams and numerically estimate calibration using Spearman’s correlation coefficient with respect to instance-level accuracy, as outlined below. Reliability diagrams: Reliability diagrams are visualizations of a model’s calibration (Guo et al., 2017). They plot the true likelihood (accuracy) of a prediction as a function of confidence estimation. Hence, a perfectly-calibrated model is visualized as the identity function, and any deviation indicates miscalibration. In detail, reliability diagrams are plotted by partitioning predictions into bins, each of which represents a confidence range. In our study, we use 10 uniformly distributed bins, i.e. [0.0-0.1], [0.1-0.2], …[0.9- 1.0]. For each bin, mean accuracy is calculated – in other words, the proportion between correct and total number of predictions in the bin (Code in GitHub7). Correlation coefficient with respect to instance-level accuracy: In order to numerically measure the degree of calibration, we compare confidence estimates (scores in the range 0-1) with instance-level accuracies (1 for correct classifications, 0 for incorrect classifications). More specifically, wemeasure the extent to which an increase in score correlates with an increase in instance-level accuracy – in other words, the monotonicity of the relationship between confidence and accuracy. The degree of monotonicity is measured using the Spearman’s correlation coefficient (Xiao et al., 2016)8. 7https://github.com/ranimkhojah/confidence-estimation-benchmark/blob/master/scripts/calibration_evaluation_all_splits.ipynb 8We perform Spearman’s correlation rather than Pearson’s correlation since our data is not normally distributed. 8 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Confidence estimates Figure 3: Gold standard confidence estimates: The distribution of confidence estimates with respect to instance-level accuracy of a perfectly-calibrated model (Spearman’s correlation of 1.0) Given two variables (X and Y ) of size N each (x1, x2, ..xn and y1, y2...yn respectively), Spearman’s correlation coefficient (ρ) is calculated through the form∑ula: − 6 d 2 ρ = 1 i n(n2 − 1) where n is the number of samples, and d is the pairwise differences of the elements of the variables xi and yi. In intent classification, a perfectly-calibrated NLU has a Spearman’s correlation coefficient 1.0, and it always estimates a confidence 1 for correct hypotheses and confidence 0 for incorrect hypotheses, as shown in Figure 3. Note that other approaches to numerically estimate calibration have been discussed in literature besides Spearman’s correlation (Dong et al., 2018), e.g. negative log likelihood and Brier score (Kull et al., 2017) and expected calibration error (Nixon et al., 2019). Different measurement approaches have different advantages and weaknesses (Ashukha et al., 2020; Nixon et al., 2019), and no gold standard seems to exist. In this study, we have opted for Spearman’s correlation due to the fact that monotonicity in the relation between confidence and accuracy is an important characteristic of good calibration. Also, it seems intuitive an NLU with more monotonic relation between confidence and accuracy is easier to recalibrate (i.e., improve the calibration) (Nixon et al., 2019). 4.3.2 Performance Since performance considers only first-ranked hypotheses, it cannot be conducted on a rank level. To measure the performance, we use F1-score and Accuracy. We use F1-score since it considers false pos- itives and false negatives through precision and recall. Another reason is the unbalanced distribution of the example utterances across intents. However, the complexity of F1-score makes it harder to interpret; therefore, we also include accuracy since in this particular multi-domain dataset, false negatives have no major risks. Given true positives (TP), true negative (TN), false positives (FP), and false negatives (FN) in the clas- sification results, the following formulas are used to calculate the performance measures. 9 Instance-level accuracy TP + TN Accuracy = TP + TN + FP + FN TP TP Precision = Recall = TP + FP TP + FN 2 ∗ Precision ∗Recall 2 ∗ TP F1 = = Precision+Recall 2 ∗ TP + FP + FN 5 Results and Analysis In this section we present our results (averaged across the 10 splits). Our collected data are qualitative in the form of visualizations (reliability diagram) and quantitative (Spearman’s correlation, accuracy, F1-score). For our quantitative results, we provide the average across splits along with standard deviation (SD), whereas for the qualitative results and all visualizations, we plot the averaged results without SD to avoid cluttered diagrams. However, we provide the visualizations with SD of mean accuracies of splits in Appendix 9.2. Moreover, to verify our quantitative results, post-hoc statistical hypothesis testing is conducted to de- termine whether there is a statistically significant difference (SSD) between the NLUs’ results. We are interested in the SSD between all pairs of NLUs. Therefore, we run a parametric statistical test (t-test) on NLU pairs. Each NLU has 10 rows that represent the results (e.g., Spearman’s Correlation, accuracy or F1-score) obtained from each split of the 10 splits. Then, we use Cohen’s d to judge the effect size of the statistical significance differences (SSD) between each pair of NLUs and represent the magnitude with the notation: L- Large, M- Moderate, S- Small, N- Negligible. 5.1 Reliability Diagrams We get a visual overview of the NLUs’ calibration through reliability diagrams on a model level (Figure 4) and rank level (Figures 5, 6, 7, 8). In rank-level reliability diagrams, we merge ranks 4-10 due to observed signs of data sparsity; in detail, most of the confidence estimates were within a small range and ended up in 1-2 bins. We also include a histogram with each reliability diagram to visualize the sizes of bins. 5.1.1 Model-level Results On a model level (Figure 4), all NLUs show a generally monotonous relationship between confidence and accuracy, except for Watson in lower ranges. In general, calibration of NLUs is better in larger bins than smaller bins besides Watson in the range [0.2-0.3] (shown in Figure 4b). In particular, Rasa-Sklearn (the purple curve) is the closest to the gold standard, and is thus the best calibrated NLU according to this analysis. Moreover, Snips underestimates the true likelihood of predictions, while LUIS overestimates predictions. In contrast, Rasa-DIET’s calibration varies depending on the confidence estimate – more specifically, it is both over- and under-confident, for different parts of the confidence range. 10 Watson underestimates predictions when the confidence estimate is between 0.1 and 0.2, and overes- timates otherwise. Furthermore, we observe a discrepancy in Watson’s second and third bins in the reliability diagram (Figure 4), a sudden underestimation followed by a drop that indicates an extreme overestimation. According to the bin sizes shown in Figure 4b, Watson’s confidences mostly lie within the range 0.2-0.3, whereas the other NLUs’ confidence estimates are mostly within the range 0.0-0.1. 1.0 watson luis snips 0.8 rasa-sklearn rasa-diet 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Mean Confidence Estimates (a) Model-level reliability diagram watson luis 40000 snipsrasa-sklearn rasa-diet 30000 20000 10000 0 Bin (b) Histogram (Bins sizes) Figure 4: Model-level reliability diagram (a) and corresponding histogram of bins sizes (b). In the relia- bility diagram (a), the x-axis shows the mean confidence estimates in 10 bins, while the y-axis shows the mean accuracy of the confidence estimates in each bin (averaged across splits). The black diagonal line plots the identity function representing a gold standard of a perfectly-calibrated model. 5.1.2 Rank-level Results On the first rank (Figure 5), the NLUs are fairly well-calibrated in general. Although miscalibration is observed in the first two bins of rank 1 in Watson (confidence estimate between 0.0 and 0.2) which is also confirmed by the high standard deviation in these two bins (See Appendix 9.2). On ranks 2 (Figure 6) and 3 (Figure 7), the degree of calibration decreases (in comparison with the previous rank), for three of the NLUs (Watson, LUIS and Snips – all over-confident), while for the Rasa NLUs, the trend seems inverted. Moreover, on ranks 4-10 (Figure 8), the reliability diagrams are difficult to interpret due to data sparsity shown in Histogram 8b. 11 Number of examples (Size) (Average) Accuracy of bin 0.0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 1.0 watson luis snips 0.8 rasa-sklearn rasa-diet 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Mean Confidence Estimates (a) Rank-level reliability diagram (rank 1) watson luis 4000 snips rasa-sklearn rasa-diet 3000 2000 1000 0 Bin (b) Histogram (Bins sizes) Figure 5: Rank-level reliability diagram (a) and corresponding histogram of bins sizes (b) on rank 1. Note that Rasa-Sklearn and Rasa-DIET don’t have any first-ranked hypotheses with a confidence estimate within the first two and three bins respectively. This is due to that Rasa returns normalized confidence estimate whereas the other NLUs don’t. 5.2 Spearman's Correlation Coefficient 5.2.1 Model-level Results The calculated Spearman’s correlations between the confidence estimates and instance-level accuracy (Table 5) show that Rasa-Sklearn has the highest Spearman’s Correlation with a score of ∼0.510, and is followed by LUIS, Rasa-DIET, Watson and Snips with the lowest Spearman’s correlation of ∼0.507. In addition, LUIS and Rasa-DIET are not significantly different, while the differences between each other pair of NLUs is significantly different with a large effect size as detailed in the t-test results in Table 9 in Appendix 9.5. Since the reliability diagrams show a high degree of monotonicity in the relation between confidence and accuracy, one may expect Spearman’s correlations to be close to 1.0. To this end, the actual range of Spearman’s correlation coefficients (0.50-0.51) could be seen as an indication of a conflict between the two types of measurement. However, this apparent conflict can be explained by the distribution of the confidence estimates with respect to the instance-level accuracies. Figure 9 plots the confidence estimates of hypotheses returned by Rasa-sklearn. It shows that the confidence estimates assigned to both correct and incorrect hypotheses are within a wide range (i.e., 0.0-1.0). As illustrated by the gold standard in Figure 3, a model can achieve a Spearman’s correlation of 1.0 only when it always estimates a 12 Number of examples (Size) (Average) Accuracy of bin 0.0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 1.0 watson luis snips 0.8 rasa-sklearn rasa-diet 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Mean Confidence Estimates (a) Rank-level reliability diagram (rank 2). watson luis 4000 snips rasa-sklearn rasa-diet 3000 2000 1000 0 Bin (b) Histogram (Bins sizes) Figure 6: Rank-level reliability diagram (a) and corresponding histogram of bins sizes (b) on rank 2. Note that the ranges of non-empty bins for Rasa-Sklearn and Rasa-DIET become narrower in comparison with rank 1 (Figure 5). In particular, second-ranked hypotheses are not assigned a confidence estimate higher than 0.5. confidence 1.0 for correct hypotheses and confidence 0.0 for incorrect hypotheses – in other words, when it has perfect accuracy. Hence, in cases with imperfect overall accuracy – e.g. in the presence of genuine ambiguity – a Spearman’s correlation of around 0.5 is not necessarily an indication of poor calibration. 5.2.2 Rank-level Results As for the rank-level Spearman’s correlation results in Appendix 9.4, a trend is observed across all ranks. To illustrate the trend, we propose a new kind of diagram in Figure 10 (with SD in Appendix 9.3) that has not been provided in literature, where the NLUs’ correlation coefficients start to drop after rank 1 at a decreasing rate, then they level off and reach a minimum on the lowest rank. Furthermore, the results of each NLU (in Table 8) with respect to the post-hoc analysis (in Table 10) show that the NLU with the significantly highest Spearman’s correlations is Rasa-Sklearn (on ranks 1-3) with a large effect size, and LUIS (on ranks 4-7) with a moderate/large effect size. On ranks 8, 9 and 10, the results are either not significant or significant with a small or negligible effect size. In summary, Rasa- Sklearn shows a significantly best calibration on ranks 1-3, while LUIS is significantly best-calibrated on ranks 4-7. 13 Number of examples (Size) (Average) Accuracy of bin 0.0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 1.0 watson luis snips 0.8 rasa-sklearn rasa-diet 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Mean Confidence Estimates (a) Rank-level reliability diagram (rank 3) 5000 watson luis snips 4000 rasa-sklearn rasa-diet 3000 2000 1000 0 Bin (b) Histogram (Bins sizes) Figure 7: Rank-level reliability diagram (a) and corresponding histogram of bins sizes (b) on rank 3. Note that the ranges of non-empty bins for Rasa-Sklearn and Rasa-DIET become narrower in comparison with rank 2 (Figure 6). In particular, third-ranked hypotheses are not assigned a confidence estimate higher than 0.3. The decrease of Spearman’s correlation in lower ranks indicates that the monotonicity drops for all NLUs the lower we go in the intent ranking. In particular, for hypotheses lower than rank 4, the confidence estimates are not informative nor significant. Third-ranked hypotheses are roughly half as monotonous (Spearman’s Correlation around 0.2) as top-ranked ones (Spearman’s Correlation around 0.4) with second- ranked ones in between (Spearman’s Correlation around 0.3). However, there are other factors that can cause such pattern, such as the variation of the confidence estimates (further discussed in Section 6). 5.3 Performance In order to investigate the correlation between the calibration and performance of NLUs, we measure the performance of the NLUs in intent classification by evaluating the accuracy and the F1-score. In this section, we present the averaged results of accuracy and F1-scores across 10 splits for each NLU. Accuracy: The results in Table 6 show that Watson surpasses all NLUs with ∼0.92 accuracy, followed by Rasa-DIET, Snips, LUIS and Rasa-Sklearn with the lowest accuracy score of ∼0.87. F1-score: The results of the F1-scores in Table 7 show that Watson achieves the highest F1-score of ∼0.92, followed by Snips, LUIS, Rasa-DIET and Rasa-Sklearn with a score of ∼0.79. 14 Number of examples (Size) (Average) Accuracy of bin 0.0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 1.0 watson luis snips 0.8 rasa-sklearn rasa-diet 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Mean Confidence Estimates (a) Rank-level reliability diagram (ranks 4-10) 35000 watson luis 30000 snips rasa-sklearn rasa-diet 25000 20000 15000 10000 5000 0 Bin (b) Histogram (Bins sizes) Figure 8: Rank-level reliability diagram (a) and corresponding histogram of bins sizes (b) on ranks 4-10. 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Confidence estimates Figure 9: Rasa-Sklearn confidence estimates: The distribution of confidence estimates with respect to instance-level accuracy of Rasa-sklearn which is the best calibrated model across the evaluated NLUs (Spearman’s correlation of ∼0.51). Also, according to the analysis in Table 11, the performance scores of LUIS and Snips are not significantly different, meaning that LUIS and Snips can possibly achieve similar results. Also, all of the pairwise performances between the NLUs is significant with a large effect size. Overall, the performance results are consistent with the results of the following studies that have 3 NLUs 15 Number of examples (Size) Instance-level accuracy (Average) Accuracy of bin 0.0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 NLU Watson LUIS Snips Rasa-Sklearn Rasa-DIET Mean 0.50838 0.50935 0.50669 0.51024 0.50906 Median 0.50851 0.50934 0.506491 0.51026 0.50888 p-value <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 SD 0.00075 0.00055 0.00064 0.00046 0.00074 Table 5: Model-level Spearman’s Correlation Coefficient ρ: The mean, median, p-value and the standard deviation (SD). 0.40 watsonluis snips 0.35 rasa-sklearn rasa-diet 0.30 0.25 0.20 0.15 0.10 0.05 0.00 1 2 3 4 5 6 7 8 9 10 Ranks Figure 10: Rank-level Spearman’s Correlation for Watson, LUIS, Snips, Rasa-Sklearn and Rasa-DIET on ranks 1-10. NLU Watson LUIS Snips Rasa-Sklearn Rasa-DIET Mean 0.92287 0.88726 0.88991 0.87263 0.90376 Median 0.91997 0.890405 0.89060 0.87866 0.89973 SD 0.00225 0.00417 0.00414 0.00386 0.003860 Table 6: (Averaged) Accuracy Scores of NLUs in addition to the median and the standard deviation (SD). NLU Watson LUIS Snips Rasa-Sklearn Rasa-DIET Mean 0.92144 0.88890 0.89029 0.79020 0.81890 Median 0.91972 0.89300 0.89166 0.79561 0.81716 SD 0.00234 0.00373 0.00407 0.00358 0.00331 Table 7: (Averaged) F1-scores of NLUs in addition to the median and the standard deviation (SD). in common (i.e., Watson, LUIS and Rasa-Sklearn): 1) Liu et al. (2021) who use the complete version of our dataset, and 2) Abdellatif et al. (2021) who use two datasets from the software engineering domain. However, it’s different from the results in (Braun et al., 2017) that use Telegram chatbot and StackEx- change corpora in QA domain; their results show that Watson is the worst performing NLU, and Rasa and LUIS are on top. In addition, Coucke et al. (2018) builds on Braun et al. (2017)’s evaluation by adding Snips. Their results show that Snips outperforms Rasa-Sklearn, as well as Watson using chatbot 16 (Average) Spearman's Correlation and askUbuntu datasets. 6 Discussion In this section, we focus on interpreting the results of the evaluation and analysis to better understand the NLUs’ confidence estimation and get more insights about our calibration measures. 6.1 Validity of Calibration Measures In this study, we conduct an evaluation on model and rank levels by applying i) reliability diagrams and ii) correlation of confidence estimates with respect to instance-level accuracy. In order to assess the validity and relevance of the chosen evaluation measurements, we look at the extent to which results from the different measurements resonate with each other. The model-level reliability diagram seems to resonate with the model-level Spearman’s correlations. For instance, Rasa-Sklearn shows the best calibration in the reliability diagram as well as the strongest mono- tonicity across NLUs. In addition, Watson shows an obvious miscalibration in the reliability diagrams and weak monotonicity. However, we note that the Spearman’s correlation is lower on a rank level than on a model level. This can be explained by the fact that ranks extend across smaller ranges of confidence estimates, which increases the noise (See model-level Histogram 4b in comparison with rank-level Histograms 5b,6b,7b,8b). Gen- erally, when using Spearman’s correlation between confidence estimates and instance-level accuracy, a higher Spearman’s correlation coefficient may be caused by stronger monotonicity, but can also be due to a larger variation in the confidence estimates, hence, less noise. Additionally, a dissonance has been observed in rank-level calibration. While Spearman’s correlation re- sults suggest a decrease in the calibration with the descend of the rank, the rank-level reliability diagrams show that Rasa-Sklearn and Rasa-DIET have better calibration in lower ranks. This dissonance can be caused by the range factor that causes lower coefficients on rank level than model level, that is, lower ranks have smaller confidence ranges and less variance. Therefore, comparisons of the monotonicity and differences in Spearman’s correlation may be hard to interpret across ranks, especially when the range and variation of confidence estimates vary in each rank. 6.2 Interpreting Calibration and Potential Applications Can ourmodel-level quantitative results be used to rate calibration in some absolute sense (e.g. coefficient X is good and Y is bad)? This seems difficult at this stage since no previous research – as far as we know – used Spearman’s correlation to measure the correlation of confidence with respect to instance-level accuracy9. Nevertheless, they enable comparisons of degree of calibration between NLUs. On a model level, monotonicity is viewed as a characteristic of well-calibrated NLUs. The stronger the monotonicity, the more reliable the ranking of hypotheses in a prediction. This enables, inter alia, thresholding of hypotheses when processing and applying semantic clarification on the output of the 9Dong et al. (2018) uses Spearman’s correlation between confidence and instance-level F1 score rather than accuracy, and Vasudevan et al. (2019) uses Pearson’s correlation rather than Spearman’s correlation between confidence and instance-level accuracy. 17 NLU. Differences in the degree of calibration across ranks has been observed for all NLUs. Specifically, several of the NLUs are better calibrated for higher-ranked hypotheses than for lower-ranked ones. For dialogue systems developers, we may interpret this as indicating that it may be useful to look at the top two or three hypotheses when trying to detect ambiguity in input utterances. Looking at hypotheses ranked lower than 4 is likely to not be very informative. Fortunately, ambiguities are much more frequently 2-way (i.e. there are two possible interpretations of an input) or 3-way than 4-way or more. Knowledge about calibration is potentially useful for any downstream task that relies on confidence estimates associated with NLU hypotheses, such as choice of grounding strategies, ambiguity detection or re-scoring of hypotheses based on contextual information not available to the NLU but to the dialogue manager (such as dialogue state). 6.3 Performance vs. Calibration When plotting the model-level performance and correlation, no clear pattern can be seen (see Figure 11). Our results suggest that the best calibrated NLU is Rasa-Sklearn and the poorest calibrated NLU is Snips, while Watson outperforms the NLUs and Rasa-Sklearn shows the worst performance. A consequence of this is that when it comes to choosing an NLU for a dialogue system, there is likely to be a trade-off between performance (good for getting the right interpretation) and calibration (good for deciding on grounding stratigies and detecting input that is ambiguous from theNLU perspective). In fact, such trade-off between calibration and performance has been previously observed for neural networks (Guo et al., 2017). A potential choice between good performance or good calibration can depend on the domain and context of the dialogue system, where some domains may require better performance while other domains prioritize a reliable estimation of the hypotheses’ likelihood. 0.92 watson 0.92 luis snips 0.90 0.91 rasa-sklearn rasa-diet 0.88 0.90 0.86 0.89 0.84 watson 0.82 luis 0.88 snips 0.80 rasa-sklearn rasa-diet 0.5070 0.5075 0.5080 0.5085 0.5090 0.5095 0.5100 0.5070 0.5075 0.5080 0.5085 0.5090 0.5095 0.5100 Spearman's Correlation Spearman's Correlation (a) Accuracy vs. Spearman’s Correlation (b) F1-score vs. Spearman’s Correlation Figure 11: (Model-level) Accuracy (a) and F1-score (b) vs. Spearman’s Correlation. 7 Ethics and Validity Threats 7.1 Construct Validity Since there is no previous research – to the best of our knowledge – that has evaluated the calibration of NLUs, we faced a construct validity threat when planning the evaluation process and choosing the evaluation methods. To mitigate this threat, we extend established calibration measurement methods that have been successfully used in classification tasks, and adapt them to NLUs on model and rank levels. However, we observe some limitations of Spearman’s correlation on a rank level, especially on lower 18 Accuracy F1-score ranks where the correlation coefficent doesn’t necessarily represent the calibration due to the presence of other factors like the range and variance of the confidence estimates (explained in Section 6.1). Therefore, further exploration is needed into how calibration of NLUs can be measured 7.2 Internal Validity The selection of NLUs in this study introduced many parameters that could possibly change our results. The NLUs in general and especially Rasa can change their behaviour with the change of their configu- rations. Therefore, we use the default configurations for consistency across NLUs. We argue that our configuration choice is what an average dialogue developer would apply when using the evaluated NLUs. Furthermore, None/irrelevant intents were removed from the output of NLUs, decreasing internal validity since the absence of the “None” hypothesis may have an impact on the model-level confidence estimates. Normalizing the confidence estimates in the intent ranking (after removing the None intent) may seem like a mitigation. However, this caused issues due to Watson being a multi-binary classifier, and we lack knowledge of how confidence estimates are calculated for each NLU, making the scores not suitable for normalization. 7.3 External Validity Evaluating machine-learning based NLUs involves splitting data into training and test sets. This intro- duces a risk that obtained results depend on a specific split. In order to estimate and partly mitigate this risk, we perform repeated random sub-sampling as a Cross-validation method with 10 iterations (splits), we then average the results across iterations and provide the standard deviation. Our results show a low standard deviation across splits which increases the generalizability of our findings. Moreover, we enable reproducability and replicability of our evaluation by providing all the scripts, training and test sets and requirements. 7.4 Data Fallacies Throughout our study, we acknowledge the importance of being transparent, fair and unbiased to make the evaluation trustworthy, and to avoid inflicting damage on the evaluated parties. We attempt to minimize the sampling bias by i) considering a multi-domain dataset rather than a context- specific dataset, and ii) selecting a representative subset that maximizes the number of domains, scenarios, intents and examples covered. In addition, we mitigate the Danger of Summary Metrics by i) considering the raw results of all splits (Appendix 9.2, 9.3, 9.4) along with the standard deviations in our data analysis and discussion and ii) analysing the results on a rank level which has provided a more detailed view on the model-level evaluation. Finally, we do not claim nor draw a final conclusion in RQ3 in regard to the correlation between calibra- tion and performance due to the small sample size that didn’t allow us to perform a statistical test. We only present a scatter plot that shows the absence of a clear trend or pattern which does not indicate a potential correlation between NLU calibration and performance. 19 8 Conclusion and Future Work We took a methodology for evaluating the calibration of neural networks in intent classification tasks (Vasudevan et al., 2019; Guo et al., 2017) and applied it to NLU, in order to measure the calibration of 5 state-of-the-art NLUs, and evaluate their performances. We also extended the methodology to look at hypotheses on all ranks in the intent ranking on rank level (results per rank) and a model level (results of aggregated ranks). Our findings show that on a model level, Rasa-Sklearn is the best calibrated NLU and Snips with the poorest calibration. On a rank level, the calibration decreases in lower ranks for Watson, LUIS and Snips, and vice versa for Rasa. We also highlight a trade-off between calibration and performance where Rasa-Sklearn – the best calibrated model – had the worst performance. Future work can involve adapting our evaluation methods to detect ambiguity. For example, given an utterance that has two possible intents (hence ambiguous), a well-calibrated NLU should be able to assign similar confidence estimates to the two possible intents. We also encourage further improvement of our rank-level quantitative analysis by applying another mea- sure of calibration, e.g., Brier score (square loss), Log loss (Kull et al., 2017) or the improved ECE approach by Nixon et al. (2019). The results of the loss and/or the error of the NLU confidence estimates in regard to the instance-level accuracy would widen the scope of the results and may be more relevant for assessing rank-level calibration of the NLUs. 20 References Abdellatif, A., Badran, K., Costa, D., & Shihab, E. (2021). A comparison of natural language under- standing platforms for chatbots in software engineering. IEEE Transactions on Software Engi- neering. Ashukha, A., Lyzhov, A., Molchanov, D., & Vetrov, D. (2020). Pitfalls of in-domain uncertainty estima- tion and ensembling in deep learning. arXiv preprint arXiv:2002.06470. Bocklisch, T., Faulkner, J., Pawlowski, N., & Nichol, A. (2017). Rasa: Open source language under- standing and dialogue management. arXiv preprint arXiv:1712.05181. Braun, D., Mendez, A. H., Matthes, F., & Langen, M. (2017). Evaluating natural language understanding services for conversational question answering systems. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue (pp. 174–185). Bunk, T., Varshneya, D., Vlasov, V., & Nichol, A. (2020). Diet: Lightweight language understanding for dialogue systems. arXiv preprint arXiv:2004.09936. Canonico, M. & De Russis, L. (2018). A comparison and critique of natural language understanding tools. Cloud Computing, 2018, 120. Coucke, A., Saade, A., Ball, A., Bluche, T., Caulier, A., Leroy, D., Doumouro, C., Gisselbrecht, T., Caltagirone, F., Lavril, T., et al. (2018). Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190. Dong, L., Quirk, C., & Lapata, M. (2018). Confidence modeling for neural semantic parsing. arXiv preprint arXiv:1805.04604. Dubitzky, W., Granzow, M., & Berrar, D. P. (2007). Fundamentals of data mining in genomics and proteomics. Springer Science & Business Media. Gregori, E. (2017). Evaluation of modern tools for an omscs advisor chatbot. SMARTech: smartech.gatech.edu. Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. In International Conference on Machine Learning (pp. 1321–1330).: PMLR. IBM (2010). Ibm watson. Online available at: https://www.ibm.com/watson. Accessed on: 2022-04-14. Jung, S. (2019). Semantic vector learning for natural language understanding. Computer Speech & Language, 56, 130–145. Kar, R. & Haldar, R. (2016). Applying chatbots to the internet of things: Opportunities and architectural elements. arXiv preprint arXiv:1611.03799. Koetter, F., Blohm, M., Kochanowski, M., Goetzer, J., Graziotin, D., & Wagner, S. (2018). Motivations, classification and model trial of conversational agents for insurance companies. arXiv preprint arXiv:1812.07339. Kuleshov, V., Fenner, N., & Ermon, S. (2018). Accurate uncertainties for deep learning using calibrated regression. In International conference on machine learning (pp. 2796–2804).: PMLR. Kull, M., Silva Filho, T., & Flach, P. (2017). Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers. InArtificial Intelligence and Statistics (pp. 623–631).: PMLR. 21 Liu, X., Eshghi, A., Swietojanski, P., & Rieser, V. (2021). Benchmarking natural language understanding services for building conversational agents. In Increasing Naturalness and Flexibility in Spoken Dialogue Interaction (pp. 165–183). Springer. McTear, M., Callejas, Z., & Griol, D. (2016). The conversational interface: Talking to smart devices: Springer international publishing. Doi: https://doi. org/10.1007/978-3-319-32967-3. Microsoft (2017). Luis (language understanding) - cognitive services. Online available at: https: //www.luis.ai/home. Accessed on: 2022-04-14. Naeini, M. P., Cooper, G., & Hauskrecht, M. (2015). Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence. Nixon, J., Dusenberry, M. W., Zhang, L., Jerfel, G., & Tran, D. (2019). Measuring calibration in deep learning. CVPR Workshops, 2(7). Rasa (2016). Rasa: Open source conversational ai. Online available at: https://rasa.com/. Ac- cessed on: 2022-04-14. Runeson, P. & Höst, M. (2009). Guidelines for conducting and reporting case study research in software engineering. Empirical software engineering, 14(2), 131–164. Shridhar, K., Dash, A., Sahu, A., Pihlgren, G. G., Alonso, P., Pondenkandath, V., Kovács, G., Simistira, F., & Liwicki, M. (2019). Subword semantic hashing for intent classification on small datasets. In 2019 International Joint Conference on Neural Networks (IJCNN) (pp. 1–6).: IEEE. Snips (2013). Snips.ai. Online available at: https://snips.ai/. Accessed on: 2022-04-14. Stoyanchev, S., Lison, P., & Bangalore, S. (2016). Rapid prototyping of form-driven dialogue systems using an open-source framework. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue (pp. 216–219). Tur, G. & De Mori, R. (2011). Spoken language understanding: Systems for extracting semantic infor- mation from speech. John Wiley & Sons. Vasudevan, V. T., Sethy, A., & Ghias, A. R. (2019). Towards better confidence estimation for neural models. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7335–7339).: IEEE. Wang, Y.-Y., Deng, L., & Acero, A. (2005). Spoken language understanding. IEEE Signal Processing Magazine, 22(5), 16–31. Xiao, C., Ye, J., Esteves, R. M., & Rong, C. (2016). Using spearman’s correlation coefficients for ex- ploratory data analysis on big dataset. Concurrency and Computation: Practice and Experi- ence, 28(14), 3866–3878. 22 9 Appendix 9.1 Rasa Pipelines 1 version: ’2.0’ 2 3 language: en 4 5 pipeline: 6 - name: SpacyNLP 7 model: ”en_core_web_md” 8 case_sensitive: False 9 - name: SpacyTokenizer 10 ”intent_tokenization_flag”: False 11 ”intent_split_symbol”: ”_” 12 ”token_pattern”: None 13 - name: SpacyFeaturizer 14 ”pooling”: ”mean” 15 - name: ”SklearnIntentClassifier” 16 kernels: [”linear”] 17 ”gamma”: [0.1] 18 ”max_cross_validation_folds”: 1 19 ”scoring_function”: ”f1_weighted” Listing 2: The configuration of Rasa-Sklearn pipeline used in this study. 1 version: ’2.0’ 2 3 language: en 4 5 pipeline: 6 - name: WhitespaceTokenizer 7 - name: RegexFeaturizer 8 - name: LexicalSyntacticFeaturizer 9 - name: CountVectorsFeaturizer 10 - name: CountVectorsFeaturizer 11 analyzer: char_wb 12 min_ngram: 1 13 max_ngram: 4 14 - name: DIETClassifier 15 epochs: 10 Listing 3: The configuration of Rasa-DIET pipeline used in this study. 9.2 Reliability Diagrams with Standard Deviation We include the reliability diagrams we present in Section 5 with the standard deviation of mean accuracies for splits plotted. We provide the model-level reliability diagram in Figure 12 and rank-level reliability diagrams in Figure 13. 9.3 Rank-level Ranks vs Spearman's Correlations Plot with Standard Deviation This Appendix includes the plot of the rank-level Spearman’s correlation we presented in Section 5 in Figure 10 with the standard deviation of splits plotted in Figure 14. 23 1.0 watson luis snips 0.8 rasa-sklearn rasa-diet 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Mean Confidence Estimates Figure 12: Model-level reliability diagram for Watson, LUIS, Snips, Rasa-Sklearn and Rasa-DIET with standard deviation. 9.4 Results of Rank-level Spearman's Correlation In Section 5, we plot the rank-level Spearman’s correlation of all ranks in Figure 10. In this section, we provide the raw data of the averaged Spearman’s correlation across splits in Table 8 on rank 1-10 along with the standard deviation of splits. NLU Rank Spearman’s Correlation (ρ) p-value Standard Deviation Watson 1 0.36478 <0.0001 0.00588 LUIS 1 0.39553 <0.0001 0.01253 Snips 1 0.3908 <0.0001 0.00661 Rasa-Sklearn 1 0.41435 <0.0001 0.01068 Rasa-DIET 1 0.40187 <0.0001 0.00785 Watson 2 0.27888 <0.0001 0.01051 LUIS 2 0.29684 <0.0001 0.01383 Snips 2 0.25592 <0.0001 0.00891 Rasa-Sklearn 2 0.34423 <0.0001 0.01026 Rasa-DIET 2 0.31463 <0.0001 0.00987 Watson 3 0.15082 <0.0001 0.01208 LUIS 3 0.18616 <0.0001 0.00713 Snips 3 0.15376 <0.0001 0.01087 Rasa-Sklearn 3 0.19943 <0.0001 0.00566 Rasa-DIET 3 0.18482 <0.0001 0.00728 Watson 4 0.08652 <0.0001 0.00791 LUIS 4 0.12456 <0.0001 0.00807 Snips 4 0.10944 <0.0001 0.01031 Rasa-Sklearn 4 0.10235 <0.0001 0.01723 Rasa-DIET 4 0.11088 <0.0001 0.01128 Watson 5 0.05189 <0.0001 0.01222 LUIS 5 0.08779 <0.0001 0.01159 24 (Average) Accuracy of bin Snips 5 0.07635 <0.0001 0.0137 Rasa-Sklearn 5 0.06645 <0.0001 0.01872 Rasa-DIET 5 0.08253 <0.0001 0.01207 Watson 6 0.05453 <0.0001 0.01664 LUIS 6 0.07276 <0.0001 0.0116 Snips 6 0.04989 <0.0001 0.00622 Rasa-Sklearn 6 0.04359 <0.0001 0.01043 Rasa-DIET 6 0.06002 <0.0001 0.01147 Watson 7 0.05031 <0.0001 0.00933 LUIS 7 0.05709 <0.0001 0.00833 Snips 7 0.03705 <0.0001 0.01007 Rasa-Sklearn 7 0.04221 <0.0001 0.00658 Rasa-DIET 7 0.04454 <0.0001 0.01123 Watson 8 0.04464 <0.0001 0.01699 LUIS 8 0.03457 <0.0001 0.0088 Snips 8 0.02296 <0.0001 0.00871 Rasa-Sklearn 8 0.0272 <0.0001 0.01406 Rasa-DIET 8 0.03761 <0.0001 0.00858 Watson 9 0.0271 <0.0001 0.01933 LUIS 9 0.02164 <0.0001 0.00826 Snips 9 0.0277 <0.0001 0.00801 Rasa-Sklearn 9 0.02503 <0.0001 0.00858 Rasa-DIET 9 0.03419 <0.0001 0.0112 Watson 10 0.02725 <0.0001 0.01198 LUIS 10 0.01272 <0.0001 0.02247 Snips 10 0.01393 <0.0001 0.0141 Rasa-Sklearn 10 0.01372 <0.0001 0 Rasa-DIET 10 0.02131 <0.0001 0.0218 Table 8: Spearman’s Correlation Coefficient: averaged rank-level NLU correlation for 10 splits with the p-value and the standard deviation of the splits correlation coefficients. 9.5 Post-hoc Analysis: t-test Calculations In this Appendix, we present our statistical test results using t-test and Cohen’s d on Spearman’s Cor- relation results (Model level in Table 9 and rank level in Table 10), and performance results in Table 11. 25 1.0 watson luis snips 0.8 rasa-sklearn rasa-diet 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Mean Confidence Estimates 1.0 watson luis snips 0.8 rasa-sklearn rasa-diet 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Mean Confidence Estimates 1.0 watson luis snips 0.8 rasa-sklearn rasa-diet 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Mean Confidence Estimates 1.0 watson luis snips 0.8 rasa-sklearn rasa-diet 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Mean Confidence Estimates Figure 13: Rank-level reliability diagrams for ranks 1 (top), 2, 3 and 4-10 (bottom) with standard devia- tion. 26 (Average) Accuracy of bin (Average) Accuracy of bin (Average) Accuracy of bin (Average) Accuracy of bin watson 0.4 luis snips rasa-sklearn rasa-diet 0.3 0.2 0.1 0.0 1 2 3 4 5 6 7 8 9 10 Ranks Figure 14: Spearman’s Correlation for ranks 1-10 (with standard deviation). Pairwise Comp. t-Statistic p-value df Effect Size SSD (p<.05) (Watson, LUIS) -3.1645 0.01147 9 L Yes (Watson, Snips) 4.9025 0.00084 9 L Yes (Watson, Rasa-Sklearn) -5.4977 0.0003813 9 L Yes (Watson, Rasa-DIET) -2.9555 0.01608 9 L Yes (LUIS, Snips) 25.569 <0.00001 9 L Yes (LUIS, Rasa-Sklearn) -3.8306 0.00402 9 L Yes (LUIS, Rasa-DIET) -78.645 0.2895 9 S No (Snips, Rasa-Sklearn) -16.545 <0.00001 9 L Yes (Snips, Rasa-DIET) -7.8118 <0.00001 9 L Yes (Rasa-DIET, Rasa-Sklearn) -4.1319 0.002552 9 L Yes Table 9: t-test on pairwise NLUs’ Spearman’s correlation scores on a model level to determine if there is a statistical significant difference (SSD) and Cohen’s d to measure the effect size (L- Large, M-Moderate, S- Small, N- Negligible). 27 (Average) Spearman's Correlation Pairwise Comp. t Statistics p-value df Effect Size SSD (p<.05) Rank 1 (Watson, LUIS) -7.6715 <0.00001 9 L Yes (Watson, Snips) -9.7613 <0.00001 9 L Yes (Watson, Rasa-Sklearn) -11.441 <0.00001 9 L Yes (Watson, Rasa-DIET) -10.782 <0.00001 9 L Yes (LUIS, Snips) 1.2402 0.2463 9 S No (LUIS, Rasa-Sklearn) -4.45 0.0016 9 L Yes (LUIS, Rasa-DIET) -1.8668 0.09477 9 M No (Snips, Rasa-Sklearn) -5.7598 0.0002729 9 L Yes (Snips, Rasa-DIET) -3.0576 0.01362 9 L Yes (Rasa-DIET, Rasa-Sklearn) -6.754 <0.00001 9 L Yes Rank 2 (Watson, LUIS) -3.2206 0.01048 9 L Yes (Watson, Snips) 6.4881 0.000113 9 L Yes (Watson, Rasa-Sklearn) -17.398 <0.00001 9 L Yes (Watson, Rasa-DIET) -8.6273 <0.00001 9 L Yes (LUIS, Snips) 9.9936 <0.00001 9 L Yes (LUIS, Rasa-Sklearn) -9.7455 <0.00001 9 L Yes (LUIS, Rasa-DIET) -3.7508 <0.00001 9 L Yes (Snips, Rasa-Sklearn) -17.882 <0.00001 9 L Yes (Snips, Rasa-DIET) -12.898 <0.00001 9 L Yes (Rasa-DIET, Rasa-Sklearn) -11.323 <0.00001 9 L Yes Rank 3 (Watson, LUIS) -6.7607 <0.00001 9 L Yes (Watson, Snips) -0.6851 0.5105 9 S No (Watson, Rasa-Sklearn) -13.616 <0.00001 9 L Yes (Watson, Rasa-DIET) -6.2648 0.000147 9 L Yes (LUIS, Snips) 7.0407 <0.00001 9 L Yes (LUIS, Rasa-Sklearn) -6.3356 0.0001352 9 L Yes (LUIS, Rasa-DIET) 0.46202 0.655 9 N No (Snips, Rasa-Sklearn) -11.323 -7.0872 9 L Yes (Snips, Rasa-DIET) -7.0872 <0.00001 9 L Yes (Rasa-DIET, Rasa-Sklearn) -4.6652 0.001177 9 L Yes Rank 4-10 (Watson, LUIS) -5.9362 <0.00001 49 L Yes (Watson, Snips) -0.72951 0.4692 49 N No (Watson, Rasa-Sklearn) 0.078179 0.938 49 N No (Watson, Rasa-DIET) -3.3111 0.00175 49 S Yes (LUIS, Snips) 9.1052 <0.00001 49 L Yes (LUIS, Rasa-Sklearn) 8.087 <0.00001 49 L Yes (LUIS, Rasa-DIET) 3.9641 0.0002393 49 M Yes (Snips, Rasa-Sklearn) 1.2524 0.2164 49 N No (Snips, Rasa-DIET) -4.1725 0.0001228 49 M Yes (Rasa-DIET, Rasa-Sklearn) 5.2551 <0.00001 49 M Yes Table 10: t-test on pairwise NLUs’ Spearman’s correlation scores on a rank level to determine if there is a statistical significant difference (SSD) and Cohen’s d to measure the effect size (L- Large, M- Moderate, S- Small, N- Negligible). 28 Pairwise Comp. t Statistics p-value df Effect Size SSD (p<.05) Accuracy (Watson, LUIS) 18.462 <0.00001 9 L Yes (Watson, Snips) 29.325 <0.00001 9 L Yes (Watson, Rasa-Sklearn) 25.059 <0.00001 9 L Yes (Watson, Rasa-DIET) 12.82 <0.00001 9 L Yes (LUIS, Snips) -0.62904 0.545 9 N No (LUIS, Rasa-Sklearn) 11.672 <0.00001 9 L Yes (LUIS, Rasa-DIET) -7.2468 <0.00001 9 L Yes (Snips, Rasa-Sklearn) 13.889 <0.00001 9 L Yes (Snips, Rasa-DIET) -7.7684 <0.00001 9 L Yes (Rasa-DIET, Rasa-Sklearn) 18.968 <0.00001 9 L Yes F1-score (Watson, LUIS) 15.437 <0.00001 9 L Yes (Watson, Snips) 25.432 <0.00001 9 L Yes (Watson, Rasa-Sklearn) 79.213 <0.00001 9 L Yes (Watson, Rasa-DIET) 73.47 <0.00001 9 L Yes (LUIS, Snips) 1.1095 0.296 9 S No (LUIS, Rasa-Sklearn) 95.383 <0.00001 9 L Yes (LUIS, Rasa-DIET) 49.549 <0.00001 9 L Yes (Snips, Rasa-Sklearn) 135.47 <0.00001 9 L Yes (Snips, Rasa-DIET) 88.435 <0.00001 9 L Yes (Rasa-DIET, Rasa-Sklearn) 18.098 <0.00001 9 L Yes Table 11: t-test on pairwise NLUs’ Performance to determine if there is a statistical significant difference (SSD) and Cohen’s d to measure the effect size (L- Large, M- Moderate, S- Small, N- Negligible). 29