Modelling rare events using non-parametric machine learning classifiers —Under what circumstances are support vector machines preferable to conventional parametric classifiers? by Lukas Ma∗ Bachelor’s Thesis in Statistics (15 credits ECTS) Department of Economics, School of Business, Economics and Law, University of Gothenburg February 2021 Supervisor: Mattias Sunde´n ∗Email: lukas.ma96@yahoo.com Acknowledgements I want to express my sincere thanks to my supervisor, Mattias Sunde´n, who guided me through the entire writing phase with continued support and patience. He has provided me with invaluable assistance in constructing the simulation study and offered insightful advice on my work. My gratitude is also extended to my fellow students, Tim Emanuelsson, Sofia Hartelius, Omid Raisi, and Yohanes Wolde-Senbet, for their constructive feedback on my thesis. i Abstract Rare event modelling is an important topic in quantitative social science research. However, despite the fact that traditional classifiers based upon general linear models (GLM) might lead to biased results, little attention in the social science community is devoted to methodological studies aimed at alleviating such bias, even fewer of them have considered the use of machine learning methods to tackle analytical problems imposed by rare events. In this thesis, I compared the classification performance of the SVMs – a group of machine learning classification algorithms – with that of the GLMs under the presence of imbalanced classes and rare events. The results of this study shows that the standard SVMs have no better classification performance than the traditional GLMs. In addition, the standard SVMs also tend to have low sensitivity, rendering it inappropriate for rare event modelling. Although the cost-sensitive SVMs could lead to more rare events be identified, these methods tend to suffer from overfitting as the events become rarer. Finally, the results of the empirical analysis using the Military Interstate Dispute (MID) data imply that the probabilistic outputs produced by Platt scaling are not reliable. For the above reasons, a wider application of SVMs in rare event modelling is not supported by the results of this study. ii Abbreviations AUC Area under ROC curve BA Balanced accuracy cdf Cumulative distribution function DEC Different error costs KNN k -nearest neighbours LR Logistic regression MID Military interstate dispute ML Maximum likelihood MLE Maximum likelihood estimators MSE Mean square errors pdf Probability density function pmf Probability mass function ROC Receiver operating characteristic RVM Relevance vector machines SMOTE Synthetic minority oversampling technique SVM Support vector machines iii Contents 1 Introduction 1 1.1 Aims and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Scope and constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Outline of chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Theory 5 2.1 Modelling binary response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Parametric and non-parametric methods . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 The linear function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 Binary response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Logistic Regression (LR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Model formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Inference and estimation of the model parameters . . . . . . . . . . . . . . . . . . 9 2.2.3 LR from the perspective of Generalised Linear Models (GLM) . . . . . . . . . . . 11 2.2.4 Classification using LR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1 Classification using a separating hyperplane . . . . . . . . . . . . . . . . . . . . . . 15 2.3.2 Soft margin and support vector classifier . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.3 Extension to non-linear cases using kernels . . . . . . . . . . . . . . . . . . . . . . 20 2.3.4 Obtaining calibrated probabilistic outputs for SVM . . . . . . . . . . . . . . . . . . 22 3 Rare events bias 27 3.1 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Conventional solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.1 ReLogit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.2 Skewed link function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.3 Cost-sensitive SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Sampling methods for alleviating rare events bias . . . . . . . . . . . . . . . . . . . . . . . 33 4 Methodology 35 4.1 Statistical models and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.1 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3 Empirical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.2 Initial preparation before analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 iv 4.4 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.4.1 Conventional evaluation metrics for classification performance . . . . . . . . . . . . 39 4.4.2 Evaluation metrics for imbalanced classification . . . . . . . . . . . . . . . . . . . . 41 5 Results from the simulation study 43 5.1 Scenario I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2 Scenario II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.3 Scenario III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6 Results from the empirical analysis 50 6.1 Classification performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.2 Probabilistic estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.2.1 Reliability diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.2.2 ROC curve and AUC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.2.3 Precision-Recall curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 7 Discussions 61 7.1 Empirical implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 7.1.1 Modelling rare events using SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 7.1.2 The reliability issue of probabilities produced by Platt scaling . . . . . . . . . . . . 61 7.2 Methodological limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 8 Conclusions 64 References 65 Appendix A Additional tables 68 Appendix B Additional figures 70 v Chapter 1 Introduction There are many types of events in our world that, despite being very rare in terms of occurrence, could induce a substantial impact on human society once they actually take place. Examples of such events can be wars, pandemics, financial crises, economic depressions, etc. Statisticians often use the term rare events1 to describe such highly improbable outcomes, which usually appear in data as a binary outcome variable characterised by an overwhelming number of zeros representing the non-events, and a tiny fraction of ones representing the events [King & Zeng, 2001a]. In various social science disciplines, rare events are considered as a topic of great importance because of its potential impact on our society, and hence are extensively studied. For instance, researchers in business administration and innovation studies might be interested in building a model to predict technical breakthroughs ex ante among millions of registered patents [Hain & Jurowetzki, 2020], while scholars in international conflict studies might want to construct a predictive model for military conflicts [King & Zeng, 2001a]. However, the quantitative modelling of rare events is a challenging task. As noted by [King & Zeng, 2001a, 2001b], the use of logistic regression (LR), a method commonly used by quantitative social science researchers, might systematically underestimate the probability of rare event occurrence, yielding bias results, also known as rare event bias (more on this issue in Section 3.1 below). Although the correction proposed by King & Zeng [2001a, 2001b] could alleviate the rare event bias to some extent, the need for alternative statistical procedures specifically targeted at rare event modelling is still paramount. Following from the technological development in computer science, non-parametric statistical pro- cedures based upon machine learning algorithms emerged and now receive increasing attention among researchers outside the computer science community. Comparing with the traditional classification ap- proaches, machine learning classifiers are more capable of handling complex data and demonstrate better predictive accuracy in general [Hain & Jurowetzki, 2020]. This argues strongly in favour of a wider ap- plication of machine learning methods in social science research, since quantitative research in this field often involve the modelling of complex real-world relationships [Hain & Jurowetzki, 2020]. Additionally, the improvement in predictive accuracy provided by machine learning methods also makes these meth- ods attractive candidates for the rare event modelling. However, as noted by James et al. [2013] and Hain & Jurowetzki [2020], results obtained from non-parametric machine learning methods are often hard to interpret, some of these methods, such as support vector machines (SVM), do not even produce probabilistic outputs. This highlights the need for more research concerning the application of machine learning methods in quantitative social science studies. My thesis is intended to contribute to this kind of research. 1Or black swan events, a term used in Taleb’s [2007] bestseller The Black Swan: The Impact of the Highly Improbable. 1 1.1 Aims and objectives The aim of this study is to explore the possibility of a wider application of non-parametric machine learning classification algorithms in quantitative social science research, especially research in those fields where rare events are common, such as economics, political science, and conflict studies. This can be done by conducting a comparative study between one such machine learning classifier and classifiers that are commonly used in quantitative social science research. In this thesis, I chose to compare different variants of SVM with methods from the general linear model (GLM) family such as logistic regression (or GLM with logit link). This comparative study was performed in two parts: a simulation study and an empirical analysis. The objective of the simulation study is to evaluate the classification performance of different methods, trying to determine in which type of data the use of a particular classification method is preferred, while the empirical analysis has the objective to extend the results from the simulation study to the context of social science, paving the way for a discussion about the applicability of different methods in social science research concerning rare events. Therefore, this thesis seeks to answer the following research question: 1. Under the presence of rare events, do SVMs generally outperform the conventional parametric classifiers in terms of classification performance? In addition, in quantitative social science research, we are also interested in constructing a probability model for rare event prediction [King & Zeng, 2001b]. This means that we not only should devote more attention to out-of-sample accuracy than to in-sample accuracy [James et al., 2013], but also focus on whether the classifier is able to produce reliable probability estimates, which can then be used in rare event prediction. Since the outputs from SVM is are not probabilistic and must be converted to probabilities using calibration techniques such as Platt scaling (more on this issue in Section 2.3.4 below), it is also the objective of this thesis to investigate whether the use of such probability output can be motivated. Therefore, this thesis also seeks to address the following question: 2. Can we motivate the use of the probabilistic outputs produced by Platt scaling in rare event mod- elling? 1.2 Related work To date, there already exist a considerable amount of studies in different levels dedicated to the modelling of class imbalance and rare events. According to Ferna´ndez et al. [2018], it is well documented in statistical literature that the presence of class imbalance and rare events would hamper the performance of nearly all classifiers in their standard form, including those that based upon machine learning. This is because all standard classifiers are designed to maximise the overall accuracy. In such setting, the classification accuracy of the majority class is prioritised at the cost of the accuracy of the minority class. Special techniques – such as modifications in the standard classifiers – are required for rare event modelling [Ferna´ndez et al., 2018]. For classifiers based upon the GLM, such as logistic and probit regression, the common modification approach is to either adjust the regression output ex post or to apply another link function. An example for the former case is the ReLogit proposed by King & Zeng [2001a, 2001b]. Their strategy for alleviating rare event bias was to first estimate the bias before running logistic regression, and then subtract this estimated bias from the intercept of the regression output. As for the link function approach, Van der Paal 2 [2014] had compared the ability of different link functions in modelling rare events. Using four real-world data sets from the UCI Machine Learning Repository, Van der Paal [2014] provided empirical evidence that GLMs with skew link function tend to perform better than GLMs with symmetric link functions. Additionally, he also compared the classification performance of the GLMs with two machine learning classifiers – Random forest and SVM – and found that both machine learning classifiers outperformed the GLMs. In quantitative social science research, comparisons between the GLMs and machine learning methods of the kind mentioned above are, to my knowledge, very limited – not to mention that existing studies on this topic focus predominantly on the use of neural networks. For instance, Zeng [1999] compared the classification performance of two GLM models (logit and probit) with that of a ten-hidden-unit neural network, using both synthetic data generated from a non-linear model with various noise levels and empirical data from previous research in international relations. His conclusion is that the neural network model outperforms the GLMs. Beck et al. [2000] did a similar comparison using the Militarised Interstate Dispute (MID) data, which has a structure typical for a rare event data set. Using the militarised conflicts in 1947–85 as the training set and conflicts in 1986–89 as the test set, Beck et al. [2000] concluded not only that the neural network model outperforms the conventional logistic regression model, but also achieved 16.7 percent accuracy in predicting militarised conflicts and 99.42 percent accuracy in non-conflicts in the test set. Finally, in a recent working paper, [Hain & Jurowetzki, 2020] explored the use of autoencoder, a type of neural network, in detecting breakthrough patents. Among the 2,722 breakthrough patents in their test set (corresponding to a rarity equal to 0.6 percent), the trained autoencoder correctly classified 1,402 of them. However, this comes with the cost of large number of false positives, yielding a precision score equal to 0.0328. 1.3 Scope and constraints The problem of rare event bias can be addressed in many ways. This thesis is limited to the internal methods, which is a set of methods created from modifying the the formulation or algorithm of certain statistical procedures so that they become more suitable for rare event data. The sampling methods, i.e., techniques that alleviate rare event bias by altering the data structure, are not included in this study. However, some of the sampling methods were briefly described in Section 3.3 for informational purposes. Additionally, this study focuses only on a few number of classifiers from the SVM and the GLM family, respectively. Other popular classifiers, such as Artificial neural networks, elastic nets, and random forest, are not included in this study. Moreover, other modifications in the SVM and the GLM family than those listed in Table 4.1 are not included in this study. Finally, in evaluating the output from different statistical procedures, I focused only on their classifi- cation performance and probabilistic output. Other subjects for evaluation, such as sparsity and model selection, are outside the scope of this study. 1.4 Outline of chapters This thesis is organised as followed: Chapter 2 provides the mathematical details behind the statistical methods evaluated in this study; Chapter 3 describes the rare event bias using the mathematical details behind LR introduced in the previous chapter, with focus on why such bias is so problematic and what remedies are available to alleviate it; Chapter 4 presents the evaluation metrics used in this thesis and describes how the simulation study and the empirical analysis was constructed and performed; Chapter 3 5 and 6 report the results of simulation study and empirical analysis, respectively; Chapter 7 is devoted to the discussion of the results in previous chapters and finally, Chapter 8 concludes. 4 Chapter 2 Theory 2.1 Modelling binary response 2.1.1 Parametric and non-parametric methods In the field of statistics, the term supervised learning refers to a set of methods that are used to model a certain outcome, or output, based on the values of a number of features, or inputs [James et al., 2013]. More formally, let X = (X1, X2, . . . , Xh) be a vector consisting h independent input variables – also known as predictors – and Y be the response variable. Both X and Y are real-valued random variables, whose values are determined by a certain probability distribution. Since our goal is to learn about Y given the information provided by X, we try to find a function, f(·), that transform the values of X into Y . That is, we assume the following relationship between our predictors and response Y = f (X) +  (2.1) where  is called the error term, i.e. the error between Y , the true response, and f(X), the mapping based on the information provided by the predictors [Friedman et al., 2009; James et al., 2013]. The reason why the error term exists is that we do not expect to gain full knowledge about the response through the values of a set of predictor selected a priori. The mapping f (X) is not the same as Y , it only represents the systematic information provided by X on Y . In other words, f(X) is merely a prediction of Y . In addition, it can be shown that the best prediction of Y at any point X = x is the conditional mean of the response Y given X (see Friedman et al. [2009, p.18] for more details), that is f(x) = E [Y |X = x] (2.2) which also means that f(X) = E [Y |X] . (2.3) In reality, the mapping f(·) is unknown to us. The essence of supervised learning is therefore to estimate this function and then use the estimation fˆ (·) to either make prediction of the future outcome when encountering a new set of values of X or draw inference about the relationship between the response and the predictors, as described above, or both [James et al., 2013]. According to James et al. [2013], there are two sets of methods for estimating f (·) – parametric models and non-parametric models. When estimating f(·) using the parametric approach, we first assume that f(·) has certain functional form and determine a fixed number of parameters to this function. After that, we estimate fˆ(·) by fitting or training a model to our observed data. These observed data points we use 5 to estimate fˆ(·) are also called training set. The non-parametric approach, on the other hand, does not make such an assumption on f(·) as in the parametric case. For this reason, the non-parametric models are more flexible and hence more accurate than the parametric ones. However, the superiority of the non-parametric methods in predictive accuracy does not come without a price: the more flexible one method is, the harder the interpretation of its resulted output will be. Therefore, we have a trade-off between prediction accuracy and model interpretability [James et al., 2013]. As illustrated in figure 2.1, parametric approaches, such as LR, are easier to interpret compared to the non-parametric approaches, such as SVM. Meanwhile, the relatively high flexibility in methods from the non-parametric family might result in higher prediction accuracy in comparison to those from the parametric family. The advantages and disadvantages of LR and SVM, the two groups of methods that are the focus of this study, are discussed further in the methods’ respective sections, i.e., Sections 2.2 and 2.3. Figure 2.1: The trade-off between model flexibility and model interpretability. This figure is a modified version based on two similar figures included in James et al. [2013] and Hain & Jurowetzki [2020], respectively. 2.1.2 The linear function Regarding the parametric methods, one of the most common approaches is to model the response Y as a linear function of the inputs X = (X1, X2, . . . , Xh). The linear function is defined as followed2 f (X) = 〈W,X〉+ b (2.4) where the term b denotes the intercept of the linear model [Friedman et al., 2009]. In some occasions, especially in machine learning literature, the quantity b is termed as the bias since it can be considered as the residual error between the linear model’s prediction and the true response [Friedman et al., 2009; Murphy, 2012]. Meanwhile, the term 〈W,X〉 is the inner products between the weight vector3 2The notation expressing the linear function varies among different literature. In Murphy [2012], for instance, the following notation is used y (x) = wTx +  where the T above w indicates the transpose. In this thesis, I choose to use the version with angle brackets 〈W,X〉 for the purpose of highlighting the element of inner product in this equation. 3In some literature, such as Friedman et al. [2009] and James et al. [2013], the weight vector is often express as the vectors of model parameters and denoted as β (sometimes even θ). In these works, the linear function is denoted as Y = XTβ 6 W = (w1, w2, . . . , wh) and the inputs X. Using the definition of an inner product4, we can rewrite Eq.(2.4) as f (X) = h∑ j=1 wjXj + b. (2.5) As we will see in Sections 2.2 and 2.3, both the LR and SVM can be expressed in the same manner as Eq.(2.5) above or closely related to it. 2.1.3 Binary response By binary response we mean a discrete random variable Y that only has two outcomes, for instance Y ∈ {0, 1}. In classification setting, this means that we are dealing with a qualitative response such that the number of classes is equal to 2 (i.e., K = 2). The random variable Y is said to have a Bernoulli distribution with success probability p, which can be denoted as Y ∼ Bernoulli(p)5 where p = Pr(Y = 1) [Murphy, 2012]. When we model the binary response, often with the aim of predicting the probability that a certain event occurs (i.e., when Y = 1), we want to find a probability function p(X) that returns the conditional probability of the random variable Y given the values of X, that is [James et al., 2013; Murphy, 2012] p(X) = Pr (Y = 1|X) = E [Y |X] . (2.6) More formally, assume that we have a sample consisting n independent observations. Let yi ∈ {0, 1} denote the value of ith observation for the response variable Yi and xij the value of ith observation for predictor Xj , where i = 1, . . . , n, and j = 1, . . . , h. We have (Yi = yi|Xj = xij) ∼ Bernoulli(pi) as followed from the fact that Y is binary [Friedman et al., 2009; Wasserman, 2004]. The probability model that predicts the the binary response Y at every points X = x will be p (x; θ) = Pr (Y = yi|X = x; θ) (2.7) where θ = (θ0, θ1, . . . , θh) is the vector of parameters for the probability model. In addition, as followed from Eq.(2.6) and Eq.(2.7), we have p (X; θ) = Pr (Y = 1|X; θ) (2.8) 1− p (X; θ) = Pr (Y = 0|X; θ) . (2.9) Note that the mapping p(·) corresponds to f(·) above, only that the former is a function such that p : X → [0, 1] given a set of model parameters θ. In the classification setting, the model p (X; θ) is a probabilistic classifier. We can obtain p (X; θ) either by first constructing a joint probability model of the 4Let a and b be two vectors, the inner product of these two vectors, 〈a, b〉 are defined as 〈a, b〉 = n∑ i=1 aibi 5In some literature, the response variable is indexed as Y = (Y1, . . . , Yn) and represents a data set composed of n independent Bernoulli trials. In such case, Y is said to have a binomial distribution Y ∼ Bin(n, p). 7 form p(Y,X) and then deriving the conditional probability from it (generative approach), or by fitting a model of the form p (X; θ) to our data directly (discriminative approach) [Murphy, 2012]. Regarding the subsequent classification procedures, we used to impose a decision rule by selecting some threshold τ that assign the inputs X to one class (e.g. Y = 1) when p(X; θ) = Pr (Y = 1|X; θ) > τ and to the other class when p(X; θ) ≤ τ . Such procedure is called latent variable threshold model [Agresti, 2015]. More detail about the threshold model in the case of logistic regression is provided in Section 2.2.4. 2.2 Logistic Regression (LR) 2.2.1 Model formulation The probability model p(X; θ) can have many candidates, one of them is the so-called logistic or sigmoid function shown in Figure (2.2). The sigmoid function is defined as sigm (z) , 1 1 + exp(−z) = ez 1 + ez (2.10) where z is some arbitrary variable. By replacing z with the combination of inputs X and the model parameters θ, we obtain the model formulation for the logistic regression (LR) p(X; θ) = sigm(X; θ) = exp ( θTX ) 1 + exp (θTX) (2.11) where the quantity θTX is assumed to be linear, that is [Agresti, 2015; Friedman et al., 2009] θTX = h∑ j=0 θjXj . (2.12) In addition, by replacing the intercept θ0 with b and the rest of model parameters (θ1, . . . , θh) with W = (w1, . . . , wh), we can rewrite the model formulation of LR in the same manner as in Eq.(2.5), that is, the inner product of the weights and the predictors p(X) = sigm(X) = exp (∑h j=1wjXj + b ) 1 + exp (∑h j=1wjXj + b ) . (2.13) Alternatively, we can formulate the LR model in terms of odds, i.e. the quantity p(X)/[1 − p(X)]. Note that given the definition of the sigmoids function in Eq.(2.10), we can rewrite Eq.(2.13) as p(X) = 1 1 + exp ( − (∑h j=1wjXj + b )) . 8 Then we have p(X) + exp  −   h∑ j=1 wjXj + b    p(X) = 1 p(X) exp (∑h j=1wjXj + b ) = 1− p(X) p(X) 1− p(X) = exp   h∑ j=1 wjXj + b  . Taking the log of the last equation above, we obtain the logit function, also known as the log-odds [James et al., 2013; Wasserman, 2004] logit (p(X)) = log ( p(X) 1− p(X) ) = h∑ j=1 wjXj + b. (2.14) In other words, the LR is essentially a linear model, where the linearity lies in the log-odds. Figure 2.2: The Sigmoid function 2.2.2 Inference and estimation of the model parameters As shown in Eq.(2.13) and Eq.(2.14), the formulation of the logistic regression model is entirely depending on the quantity ∑h j=1wjXj + b (or θ TX in the matrix form). The sigmoid function sigm(·) is merely a function that projects the the value of θTX – which could range between (−∞,∞) – into the interval [0,1] and hence ensures that the probabilities generated from the model sum to 1 [Friedman et al., 2009]. In classification setting, the above feature means that the quantity θTX also determines the class to which we assign an observation given to the observation’s values in X. In other words, the set of model parameters6 θ reveals the relationship between the outputs and the inputs. For this reason, the inference and estimation of the model parameters θ are of paramount importance. As implied by Eq.(2.14), for each predictor Xi, i = 1, . . . , h, the magnitude of its corresponding parameter θi is the instantaneous effect on the log-odds followed by a one-unit change in the particular Xi in question [James et al., 2013]. However, the instantaneous effect of some predictor Xi on the 6For convenience, we use θ = (θ0, θ1, . . . , θh) to denote the model parameters instead of wj , j = 1, . . . , h, and b 9 probability p(X) is not fixed and might vary depending on the value of Xi. But regardless of the value of Xi, if θi > 0, then increase in Xi will also increase the probability p(X); In the case where θi < 0, an increase in Xi will induce a decrease in p(X) [James et al., 2013]. Furthermore, if θi = 0, then it would imply that the response Y is conditionally independent of Xi, given the other predictors [Agresti, 2015]. In addition, note that θ0 is the model parameter for X0, a column vector of ones. The quantity corresponds to the intercept (or bias) discussed in Section 2.1.2 and equals to E [Y |X = 0]. In other words, we can interpret θ0 as our expectation about the log-odds in absence of information provided by the predictors [James et al., 2013]. In practice, the set of model parameters θ is unknown to us and has to be estimated. One way of doing that is to find the set of estimated parameters θˆ that maximises the likelihood function of our sample. Such approach is called maximum likelihood estimation (MLE). More formally, suppose we have collected a sample of size n. Let y1, y2, . . . , yn be the observed values of the corresponding random variables Yi, i = 1, . . . , n. Since we only consider the discrete case in this thesis, the joint distribution of Yi is given by a probability mass function (pmf) p(y1, y2, . . . , yn). Assuming the probability distribution of the observations yi can be model by some unknown parameters θ, we specify the likelihood function of our observation as L (θ|y1, y2, . . . , yn) = p (θ|y1, y2, . . . , yn) = p (θ|y1) p (θ|y2) . . . p (θ|yn) = n∏ i=1 p (θ|yn) . (2.15) Thus, the likelihood function, if formulated as above, provides us with the probability or likelihood of observing the events {Yi = yi}, given the values of θ [Wackerly et al., 2008]. That is to say, the most reasonable value – the best estimate – of the unknown parameter θ is the one that maximises the probability of observing the same events {Yi = yi} as in our sample [Friedman et al., 2009] θˆMLE = arg max θ Pr ( Observing the events {Yi = yi} ) . We consider the case with h predictors as in Section 2.1.3. Let xi = (xi0, xi1, . . . , xih) be a vector of observed values of the h inputs in ith sample observation. Then the (conditional) likelihood function for sample observations (Yi = yi|xi) ∼ Bernoulli(pi), where pi = p(xi), is L(θ|yi) = n∏ i=1 p(xi; θ)yi (1− p(xi; θ)) 1−yi . (2.16) Instead of maximising Eq.(2.16) directly, it is usually easier to maximise the logarithm of it, the log-likelihood function. The log-likelihood function ` (θ|yi) has the same global maximum as the original function while still depends on θ, as shown in Eq.(2.17) below [Friedman et al., 2009; Greene, 2018] `(θ|yi) = n∑ i=1 yi log p(xi; θ) + n∑ i=1 (1− yi) log(1− p(xi; θ)). (2.17) As followed from Eq.(2.10) and the fact that 1 − p(X; θ) = 1/[1 + exp(θTX)], the log-likelihood for 10 the logistic regression model is equal to `(θ|yi) = n∑ i=1 [ yiθ Txi − log ( 1 + exp ( θTxi )) ] . (2.18) To find the set of θˆMLE that maximises Eq.(2.18), we derive the partial derivative ∂`(θ) ∂θ and set it equal to 0, that is [Agresti, 2012] ∂`(θ) ∂θ = n∑ i=1 xiyi − n∑ i=1 xi exp ( θTxi ) 1 + exp (θTxi) = 0 (2.19) The Eq.(2.19) above has a numerical solution, which can be obtained by using the Newton-Raphson algorithm, an iterative process for solving non-linear equations. The mathematical details behind the method are beyond the scope of this study. We therefore refer readers who are interested to the works of Agresti [2012] and Friedman et al. [2009] for further reading. 2.2.3 LR from the perspective of Generalised Linear Models (GLM) The logistic regression model discussed previously is, in fact, a special case of the generalised linear models (GLM). The GLM has three essential parts: 1) random component, 2) linear predictor, and 3) link function. LR differs from the other members of the GLM family, such as linear regression and Poisson regression, in the distribution of the random component and the link function – a function that connects the random component to the linear predictor [Agresti, 2015]. We begin with defining the exponential family distribution, which encompasses some of the well-known probability distributions such as Gaussian, Poisson, and Bernoulli (or Binomial). Let yi, i = 1, . . . , n, be the observed values of the random variable Y and assume that the quantity yi, which is also the random component of a GLM, has the following conditional distribution f(yi|θi, φ) = exp ( yiθi − b(θi) a(φ) + c(yi, φ) ) (2.20) where θi is the natural parameter7 and φ the dispersion or scale parameter. In addition, the quantity c(yi, φ) is a normalising constant and a(·) and b(·) are some functions of the parameters φ and θi, respectively. By choosing certain combination of a(·) and b(·), we can rewrite Eq.(2.20) into the pmf or pdf of different distributions [Agresti, 2015]. The quantities b(θi) and a(φ) are also essential to the GLM in the sense that they determine the expected value and variance of the random component, that is E [Yi] = b′ (θi) (2.21) Var (Yi) = b ′′ (θi) a (φ) . (2.22) In GLM, we assume the observed values of inputs xi = (xi0, xi1, . . . , xih) relate to the expected value of random component, E [Yi] = µi, through a link function g(·) such that g(µi) = ηi (2.23) where the quantity ηi is called the linear predictor and is defined in a similar way as the linear function 7Note that θi is indexed here, that is because this parameter is observation specific in GLM setting. In other words, it can be expressed as a function of the observed inputs θi = θ(xi)[Agresti, 2012]. 11 in Eq.(2.5), i.e. as the linear combination of the unknown parameters wj and their related predictors Xj ηi = h∑ j=1 wjxij (2.24) where xij is the observed values for predictors Xj , j = 1, . . . , h. Eq.(2.23) also suggests that the functions used in linear models – such as sigm(·) mentioned above – are equivalent to the inverse of the link function g−1(·), which maps the values of a linear combination of inputs and model parameters to the expected response [Agresti, 2015]. As a member of the exponential family, the pmf of a Bernoulli distributed random variable can be easily transformed into the format specified in Eq.(2.20). Suppose that we have a sample which (Yi = yi|xi) ∼ Bernoulli(pi) with the pmf defined as f(yi; pi) = p yi i (1− pi) 1−yi (2.25) which is the same as f(yi; pi) = exp (log f (yi; pi)) = exp (yi log pi + (1− yi) log (1− pi)) = exp ( yi log ( pi 1− pi ) + log (1− pi) ) . (2.26) The equation above is the same as Eq.(2.20) with a(φ) = 1, c(yi, φ) = 1 and natural parameter θ and the function b(θ) take the following form θi = log ( pi 1− pi ) (2.27) b (θi) = − log (1− pi) = − log ( 1− exp (θi) 1 + exp (θi) ) = log (1 + exp (θi)) . (2.28) In addition, as followed from Eq. (2.21) and (2.22) above, we have µi = exp (θi) 1 + exp (θi) = pi (2.29) Var [Yi] = exp(θi) (1 + exp(θi)) 2 = pi(1− pi) (2.30) As mentioned previously, the logistic regression model is linear in log-odds, meaning that the linear predictor ηi = logit(pi). Hence, using the result of Eq.(2.29), we can define the link function g(·) as g (µi) = ηi = log ( pi 1− pi ) = θi (2.31) The link function defined above is also called the logit link. In other words, the LR is equivalent to a GLM that uses logit link to model a Bernoulli distributed random component. As a sidenote, Eq.(2.31) also indicates that the logit link is also a canonical link, i.e. a link function g(·) that transform the mean µi to the natural parameter θi [Agresti, 2015]. 12 2.2.4 Classification using LR Concepts such as linear predictor and link function discussed in the previous section are closely related to the latent variable model for the classification of binary responses. In such model, we assume that there is an unobserved continuous response y∗i that for each observation i = 1, . . . , n satisfies [Agresti, 2015] y∗i = h∑ j=1 wjxij + i = ηi + i (2.32) where ηi is the linear predictor, while i is the error term (or bias) described in Eq.(2.1) and has a distribution with zero mean and the cumulative distribution function (cdf) F. By imposing a threshold τ as decision rule, we can construct a linear classifier with the following formulation yi = { 1 if y∗ > τ 0 if y∗ ≤ τ (2.33) Figure 2.3: Data of size n = 50 simulated from a logistic model; Blue dots are the original values; Red dots are the predicted values; Black solid line is the decision boundary with the threshold set to pˆ = sigm(X; θˆ) = 0.5, or equivalently, τ = 0. Graphically, imposing τ means that we are drawing a vertical decision boundary at the point that y∗ = τ . We then classify observations to the left of this line as 0 and those to the right as 1 [Murphy, 2012], as shown in Figure 2.3. The choice of the threshold τ is arbitrary. One common approach is to set the predicted probability cut-off, pˆ0, to 0.5. In the case of LR, this is equivalent to setting τ = 0. Another common approach is to set pˆ0 = y¯, i.e. the fraction of 1 in the data [Agresti, 2015]. In the following chapters of this thesis, the term y¯ is also referred as rarity. Meanwhile, the classifier in Eq.(2.33) connects to the probability model p(xi) = Pr(yi = 1|xi) in the 13 sense that Pr(yi = 1|xi) = Pr(y∗i > τ |xi) = Pr (ηi + i > τ |xi) = 1− Pr (i ≤ τ − ηi|xi) = 1− F (τ − ηi|xi) = 1− F  τ − h∑ j=1 wjxij ∣ ∣ ∣ ∣xi   . (2.34) Since the choice of τ is arbitrary, this quantity is unrelated to our observed data. The result does not lose its generality if we set τ = 0. Meanwhile, because F (z) = 1−F (−z), as followed from the definition of cdf, we have [Agresti, 2015] Pr(yi = 1) = F   h∑ j=1 wjxij   , and F−1 ( Pr (yi = 1) ) = h∑ j=1 wjxij = ηi. (2.35) This shows that the inverse of the cdf F−1 returns the linear predictor ηi, making it equivalent to the link function g(·) described in previous section. Hence, the latent variable model for classification can be treated as a version of GLM. And LR is the case when F is the cdf of the standard logistic distribution, that is F(z) = ez 1 + ez = sigm(z), and F−1 (z) = g(z) = logit(z). (2.36) Followed from the result above, we can obtain the pdf of the standard logistic distribution as f(z) = ez/(1 + ez)2. As Figure 2.4 shows, the logistic density is symmetry. Hence, one important feature of the LR is that, following a change in the inputs xi, the probability pi approaches to 1 at the same rate as it approaches to 0 [Agresti, 2015]. Figure 2.4: Probability density of standard logistic distribution 14 2.3 Support Vector Machine (SVM) 2.3.1 Classification using a separating hyperplane In this section, we discuss the classification method that uses a separating hyperplane as decision bound- ary. The hyperplane is defined as a flat affine subspace of dimension p− 1 in a p-dimensional space. In other words, if we have a plane (i.e., R2), then the hyperplane is a straight line (i.e., R1) that not neces- sarily passes through the origin [James et al., 2013]. The concept of a separating hyperplane is essential to the support vector machines (SVM). Although the approach of using a hyperplane for classification might be seen completely different from the previously mentioned LR at first glance, these two methods are – as shown in the following sections – closely related in many aspects. Suppose that we have a training data set S consisting n ordered pairs (x1, y1), (x2, y2), . . . , (xn, yn), where the dimension of the feature spaces xi is equal to the number of predictors h. Assume that our data S is linearly separable, that is to say, there exists at least one separating hyperplane that perfectly separates the training observations (i.e., classification with zero error). We can then model such separating hyperplane using a linear function similar to Eq.(2.5), hence defining a separating hyperplane with respect to training set S by the following equation [Cristianini & Shawe-Taylor, 2000; Friedman et al., 2009] f(x;W, b) = 〈W,x〉+ b = h∑ j=1 wjxij + b = 0. (2.37) The equation above means that the separating hyperplane is basically a set of data points xi on the feature space whose linear combination – defined by the parameters (W, b) – equals to zero. Hence we can denote a separating hyperplane by the set of parameters (W, b) that defines it. From here we can see that the definition of separating hyperplanes is similar to the model formulation of LR shown in Eq.(2.14), although the inference of parameters (W, b) in the former case has a more geometrical nature: the weight vector W = (w1, . . . , wh) is a direction orthogonal to the hyperplane and the bias parameter b is a vector controlling the distance between the hyperplane and the origin, i.e. it has the ability of making parallel ”shift” in the hyperplane [Murphy, 2012]. In addition, since the hyperplane is formulated as an equation that equals to zero, it would be convenient to redefine our binary outcome variable as yi ∈ {−1, 1} [James et al., 2013]. Note that despite change in notation, the binary outcome defined as yi ∈ {−1, 1} is not different from the one defined as {0, 1}, since the value of these numbers has no real meaning – it is merely a label created to distinguish between two classes. With that said, we can construct a classifier based upon a separating hyperplane as follows [Wasserman, 2004] sgn(z) =    −1 if z < 0 0 if z = 0 1 if z > 0 (2.38) where sgn(·) is the sign function and z = f(x;W, b) = ∑h j=1wjxij + b corresponds to the latent variable y∗ in Eq.(2.33) above. The classifier defined in Eq.(2.38) works as followed: the separation hyperplane f(x;W, b) = 0 divides the feature space in one positive side and one negative side, so for a new observation x∗ whose linear combination f(x∗) > 0, we classify it to the positive side of the feature space; if f(x∗) < 0, we classify it to the negative side. This is also the reason why sgn(·) is used. In addition, we can use the magnitude of f(xi;W, b) as a measure for the confidence we have for our classification. If f(x∗  0), i.e this new observation lies far away from the separating hyperplane, then we say that we are ”confident” about 15 that our classification is correct [James et al., 2013]. (a) Maximum margin classifier, separable case (b) Support vector classifier, overlapping classes Figure 2.5: Illustration of separating hyperplane, geometric margin and slack variable. In subfigures (a) and (b), the solid line is the separating hyperplane f(xi;W, b); the black arrows are the geometric margins γ; the filled-in points/triangles are the support vectors. The red arrows in subfigure (b) are the slack variables ξ. Since there might exist more than one (or indefinitely many) separating hyperplanes for a given linearly separable data set, as subfigure 2.5a shows, we have to find a way to determine which of these hyperplane is the best one. It reveals that the choice of the optimal hyperplane is closely related to the magnitude of f(xi;W, b) we discussed previously. Note that Eq.(2.38) implies that an input in the sample xi is assign to the correct class if and only if its linear combination f(xi;W, b) has the same sign as yi. In other words, for a given training set S, we want to ensure that [Friedman et al., 2009] yif(xi;W, b) > 0, ∀i = 1, . . . , n. (2.39) The quantity yif(xi;W, b) in the constraint above is also known as the functional margin of a hy- perplane (W, b) with respect to the sample observation (xi, yi). We denote such a functional margin as γ˜i, in order to distinguish it with the geometrical margin defined in Eq.(2.41). In addition, the func- tional margin of a hyperplane with respect to the entire training set S is defined as the minimum of the functional margins across all data points in the sample, that is [Cristianini & Shawe-Taylor, 2000] γ˜ = min i=1,...,n γ˜i (2.40) As the definition above implies, the magnitude of the functional margin for a single observation, γ˜i, reflects the degree of confidence we have towards our classification of that observation. It is therefore reasonable to consider γ˜ – the functional margin of the observation which we are least confident about its classification – to be the quality measure for our classifier. Hence, of all possible candidates, the hyperplane that maximises γ˜ must be the optimal one. However, this approach is problematic since scaling up the parameters (W, b) by some factor k will undoubtedly increase the functional margin, but the hyperplane in question will remain unchanged because it is defined to be an equation equal to zero. Normalisation of the parameters (W, b) is therefore needed. This leads us to the geometric margin of a 16 hyperplane (W, b) with respect to the sample observation (xi, yi) [Herbrich, 2002] γi = yif(xi;W, b) ‖W‖ = γ˜i ‖W‖ (2.41) where the norm ‖W‖ = √ 〈W,W 〉. As in the case of functional margin, the geometric margin of (W, b) with respect to the training set S is γ = min i=1,...,n γi. (2.42) Thus, we are facing the following optimisation problem [Friedman et al., 2009] max γ,W,b γ (2.43) subject to yi   h∑ j=1 wjxij + b   ≥ γ, i = 1, . . . , n ‖W‖ = 1. In other words, we want to find the separating hyperplane that maximises the geometric margin with respect to our training data, subject to each of the observations having a functional margin at least the size of the geometric margin. Unfortunately, this optimisation problem is hard to solve, since the objective function is neither linear nor quadratic. And the constraint ‖W‖ = 1, which ensures the functional margin equals to the geometric margin, is non-convex [Herbrich, 2002]. Luckily, we can use the definition of geometric margin in Eq.(2.41) to eliminate the constraint ‖W‖ = 1 and rearrange Eq.(2.43) above as max γ˜,W,b γ˜ ‖W‖ (2.44) subject to yi   h∑ j=1 wjxij + b   ≥ γ, i = 1, . . . , n. The objective function above is still non-convex. But we can get rid of this non-convexity by imposing a scaling constraint γ˜ = 1. This is possible because the norm ‖W‖ ensures that the distance of any point to the hyperplane will not change after the re-scaling of parameters (W, b) [Murphy, 2012]. As followed from Eq.(2.41), the geometric margin which we aim to maximise can be expressed as γ = 1/‖W‖. Since maximising the quantity 1/‖W‖ is the same as minimising ‖W‖2, we have8 min W,b 1 2 ‖W‖2 (2.45) subject to yi   h∑ j=1 wjxij + b   ≥ 1, i = 1, . . . , n. To solve the (primal) optimisation problem above, we have to first transform it to an equivalent dual problem, using the Lagrange duality, and then perform dual optimisation9. The Lagrange function for 8Note that the quantity 1/2 is added for the purpose of computational convenience and does not change the optimisation result [Murphy, 2012] 9Readers are referred to the works of Bishop [2006] and Friedman et al. [2009] for a more detailed description about this subject. 17 the (primal) optimisation problem in Eq.(2.46) is [Friedman et al., 2009] LP (W, b, α) = 1 2 ‖W‖2 − n∑ i=1 αi  yi   h∑ j=1 wjxij + b  − 1   (2.46) where αi is the Lagrange multiplier. Setting the partial derivatives of the parameters W and b to zero, we obtain ∂LP ∂W = 0 ⇒ W = n∑ i=1 αiyixi (2.47) ∂LP ∂b = 0 ⇒ b = n∑ i=1 αiyi = 0. (2.48) By plugging the results above back to the Lagrange function in Eq.(2.46), we can obtain the dual form (so-called Wolfe dual) which is defined by αi only LD(α) = n∑ i=1 αi − 1 2 n∑ i=1 n∑ k=1 αiαkyiykx T i xk. (2.49) Note that the quantity xTi xk = 〈xi,xk〉, i.e. the inner product of two training observations. Finally, we have the following dual optimisation problem max α n∑ i=1 αi − 1 2 n∑ i=1 n∑ k=1 αiαkyiyk 〈xi,xk〉 (2.50) subject to αi ≥ 0 and b = n∑ i=1 αiyi = 0 i = 1, . . . , n. In other words, we first find the Lagrange multiplier α that maximises the objective function LD(α), then use this α to compute the optimal set of parameters (W, b), thereby also the optimal separating hyperplane. The solution of the optimisation problem in Eq.(2.50) can be obtained by implementing the sequential minimal optimisation algorithm (SMO) developed by Platt [1998]. In addition, this solution must also satisfy the Karush-Kuhn-Tucker conditions, in which the following constraint is included [Friedman et al., 2009] αi  yi   h∑ j=1 wjxij + b  − 1   = 0, ∀i. (2.51) This constraint implies that the optimal hyperplane (W, b) will only be determined by those training observations xi with functional margin γ˜i = 1. Such observations are called support vectors. Observations that are not support vectors will have αi = 0 and hence can not influence our choice of the optimal set of (W, b) [Friedman et al., 2009]. Once we have solved the dual optimisation problem, we obtain a separating hyperplane defined as 18 the following, using the relationship between W and αi stated in Eq.(2.47) fˆ(x; Wˆ , bˆ) = 〈 Wˆ ,x 〉 + bˆ = 〈 n∑ i=1 αiyixi , x 〉 + bˆ = n∑ i=1 αiyi 〈xi,x〉+ bˆ = 0 (2.52) which leads us to the maximum margin classifier defined by parameters (Wˆ , bˆ). In practice, however, it is nearly impossible to obtain such classifier. This is because real world data usually contain overlapping classes and hence are not linearly separable. In order to still use separating hyperplanes for classification in such cases, we have to make some compromises and allow some of the observations to be at the wrong side of the decision boundary. This leads us to the soft margin and hence support vector classifier [James et al., 2013]. 2.3.2 Soft margin and support vector classifier As mentioned previously, the maximum margin classifier can only apply to data that are linearly separa- ble10. To extend its application to data with overlapping classes, we have to allow some observations to have γ˜ < 1 and perhaps even violate the constraint in Eq.(2.39). With such relaxations, the optimisation problem in Eq.(2.45) becomes min W,b,ξ1,...,ξn 1 2 ‖W‖2 + C n∑ i=1 ξi (2.53) subject to yi   h∑ j=1 wjxij + b   ≥ 1− ξi, i = 1, . . . , n ξi ≥ 0, i = 1, . . . , n where the quantity ξi in the objective function is called the slack variable and the constant C is a pre- specified, non-negative tuning parameter [James et al., 2013]. The size of the slack variable ξi determines whether an observation is misclassified: if 0 < ξi < 1, then the observation are still at the correct side of the decision boundary even if it has a functional margin less than 1; if ξi > 1, then the observation appears at the wrong side of decision boundary. For this reason, the sum ∑n i=1 ξi sets the upper bound on the number of misclassifications. And the constraint that includes the slack variable, yif(xi) ≥ 1−ξi, is called the soft margin constraint [Murphy, 2012]. Meanwhile, the tuning parameter C controls the trade-off between the goal of maximising the margin (i.e., minimising ‖W‖2) and minimising training errors ∑n i=1 ξi [Bishop, 2006]. Violations are less tolerated when C is small, meaning that we fit the training data closely and it will result in a model that has low bias but large variance. On the contrary, large value of C means that we allow more violations to the margin, making the fitting procedure less hard and resulting in a model of low variance and high bias [James et al., 2013]. Hence, the tuning parameter C is closely connected with the bias-variance trade-off. For this reason, we usually use k-fold cross validation to choose the optimal value of C [Friedman et al., 2009]. As in the case of the maximal margin classifier, the optimisation problem above can also be rewritten to dual form using Lagrange function[Friedman et al., 2009]. The dual optimisation problem using soft margin is expressed as 10More specifically, if the training set is not linearly separable, then the algorithm that we use to compute such separating hyperplane – the perceptron algorithm developed by Rosenblatt [1958] – will not converge. 19 max α n∑ i=1 αi − 1 2 n∑ i=1 n∑ k=1 αiαkyiyk 〈xi,xk〉 (2.54) subject to 0 ≤ αi ≤ C and b = n∑ i=1 αiyi = 0 i = 1, . . . , n. And the constraint defined in Eq.(2.51) became αi  yi   h∑ j=1 wjxij + b  − (1− ξi)   = 0, ∀i. (2.55) Hence, the support vectors in the linearly non-separable case are defined to be those that satisfy yi   h∑ j=1 wjxij + b  − (1− ξi) ≥ 0, ∀i (2.56) These support vectors define the separating hyperplane (Wˆ , bˆ). And the classifier based on this hyperplane are called support vector classifier. Subfigure 2.5b shows an illustration of the relationship between the hyperplane (Wˆ , bˆ), the slack variable ξi and the support vectors. We can see that some of the support vectors are lying on the line where γ˜ = 1 (i.e., ξi = 0) with its Lagrange multiplier falling in the interval 0 < αi < C. Those support vectors that are inside the functional margin (i.e., ξi > 0) are all characterised by αi = C [Friedman et al., 2009]. 2.3.3 Extension to non-linear cases using kernels Like the maximum margin classifier, the support vector classifier described in the previous section is a linear classifier. For this reason, the support vector classifier is most suitable for data in which the relationship between the response and the predictors is – or assumed to be – linear. In general, linear classifiers perform poorly in non-linear relationships, where the decision boundary could be polynomial or circular. If we still want to use the linear classifier to model non-linear relationships, what we can do is to enlarge our data to a higher dimension. One (informal) way of doing that is to add quadratic or cubic terms to our linear model [James et al., 2013]. More formally, we introduce some function φ(·) that maps the data points in our sample to a higher dimension. Since the the non-linear relationship will become linear in the higher dimension, we can apply our intended linear classifier in the enlarged feature space and compute the decision boundary. After this is completed, we map the result back to the original feature space [James et al., 2013]. Transforming the training set to a higher dimension using φ(·) implies that Eq.(2.49), the objective function for the dual optimisation problem, becomes LD(α) = n∑ i=1 αi − 1 2 n∑ i=1 n∑ k=1 αiαkyiyk 〈φ(xi), φ(xk)〉 (2.57) 20 and the solution function fˆ(x; Wˆ , bˆ) in Eq.(2.52) becomes fˆ(x; Wˆ , bˆ) = 〈 Wˆ , φ(x) 〉 + bˆ = n∑ i=1 αiyi 〈φ(xi), φ(x)〉+ bˆ. (2.58) Although the solution above can lead us to an non-linear decision boundary, the inner product 〈φ(xi), φ(x)〉 will be very difficult to compute explicitly. As the number of predictors increases, the dimension of the feature space will also increase. In the end, we might have to deal with an feature space of infinite dimension [Friedman et al., 2009]. The solution to this dimensional problem is the so-called kernel trick, which allows us to replace the 〈φ(xi), φ(x)〉 with a kernel K (xi,x), making the computation more comprehensible for us [Murphy, 2012]. More formally, a kernel or kernel function is defined as a function K that takes two arguments x,x′ ∈ X , where X is some input space, and maps X × X → R satisfying the following condition11 K ( x,x′ ) , 〈 φ(x), φ(x′) 〉 . (2.59) In the most cases, the kernel function is symmetric (i.e., K (x,x′) = K (x′,x) and non-negative [Murphy, 2012]. Furthermore, in order to be a valid kernel, the mapping K(·, ·) must also satisfy the requirement that its corresponding kernel matrix (so-called Gram matrix ), defined as K =     K (x1,x1) · · · K (xn,x1) ... . . . ... K (x1,xn) · · · K (xn,xn)     must be positive semi-definite for all sets of inputs xi, i = 1, . . . , n. Those K(·, ·) that fulfill this re- quirement are also called Mercer kernel [Murphy, 2012]. One of the most popular mercer kernel is the Gaussian kernel, defined as K ( x,x′ ) = exp ( − ‖x− x′‖2 2σ2 ) . (2.60) In addition to that, polynomial kernels and hyperbolic tangent (Sigmoid) kernel12 are also commonly used [Bishop, 2006; Friedman et al., 2009] d-th Degree polynomial kernel: K ( x,x′ ) = (1 + 〈xi,x〉)d (2.61) Sigmoid kernel: K ( x,x′ ) = tanh(a 〈xi,x〉+ b). (2.62) Because of the space limit, I will not describe these kernels in more detail in this thesis. The readers are therefore referred to the works of Friedman et al. [2009] and Murphy [2012] for a more thorough description about different kernel functions as well as the proof of their validity. To sum up, by replacing the inner product component in Eq.(2.58) with some of the valid kernel 11Note that in from Eq.(2.49) above, we hinted the expression of the Lagrange dual function using inner product. This is the reason why: the inner product notation creates a natural way for us to incorporate the kernels into our optimisation problem. 12Note that the Gram matrix of the Sigmoid kernel is not positive (semi) definite. The reason why we still use this kernel is because it is closely connected to the neural network, an another very popular method in machine learning [Bishop, 2006]. 21 functions listed above, we can write our solution function as [Friedman et al., 2009] fˆ(x; Wˆ , bˆ) = n∑ i=1 αiyiK (xi,x) + bˆ (2.63) and classifiers that are based upon a separating hyperplane defined as this equation are called support vector machines. Note that the use of identity mapping φ(x) = x yields K (xi,x) = 〈xi,x〉, which will return the same linear decision boundary specified in Eq.(2.58). Such kernel function are called linear kernel. In practice, the linear kernel is useful when our original data is already high dimensional and hence the data is linearly separable in this high dimensional feature space, or when we want to compare the SVM output generated from different kernel functions. In the latter case, the SVM output from linear kernel can serve as a reference point [Bishop, 2006; Murphy, 2012]. 2.3.4 Obtaining calibrated probabilistic outputs for SVM As indicated in Eq.(2.38), for a new observation x∗, the SVM classifier will first return the value of linear combination f(x∗) – a real number that is proportional to the distance of x∗ to the separating hyperplane – and then assign x∗ to one of the two classes based on the sign of f(x∗) [James et al., 2013; Pedregosa et al., 2020b]. This means that the SVM classifier, unlike LR, does not produce any probabilistic outputs [Murphy, 2012; Tipping, 2001]. In addition, recall that the weights W is interpreted as a vector orthogonal to the separating hyperplane. Hence, the coefficients of estimated parameters (Wˆ , bˆ) that define the decision boundary can not be interpreted in the same way as the coefficients of a logistic model. The lack of probabilistic output is regarded as one of the major drawbacks of SVM, since a probability model expresses the uncertainty (or confidence) in the prediction and hence is essential to real-world classification tasks, especially those that involve asymmetric misclassification cost and varying class proportions13 [Tipping, 2001]. Luckily, there are methods that we can use to obtain probabilistic outputs from SVM. The simplest way of doing that is to map the SVM outputs into the range [0, 1] using the following normalisation formula [Niculescu-Mizil & Caruana, 2005] pˆi = si − smin smax − smin (2.64) where si = fˆ(xi; Wˆ , bˆ) denotes the score predicted by a SVM classifier for the observation xi in the data set and is proportional to the distance between xi and the seperating hyperplane. Under the assumption that we can interpret the SVM scores as probabilities14, the normalisation above would give us the posterior probabilities for SVM. However, such ”probabilities” are often considered as uncalibrated. According to Kull et al. [2017], a probability model for classification is considered as well-calibrated if its probabilistic predictions for a certain class match the observed distribution of that class in the data. More formally, let s(·) denote the scoring function of a classifier and s(x) = s the score produced by this scoring function15 for each of the observations. To keep things simple, we restrict ourselves to the binary classification case and the case where s ∈ [0, 1]. In this setting, the classifier is considered as 13i.e., imbalanced data and rare events, which is the subject of this thesis. 14This assumption might be doubtful since the argmax of the SVM scores is not always the same as argmax of the probabilities. In other words, the fact that a outcome receives the highest score, smax, does not necessarily imply that this outcome would have pˆ = 1 [Pedregosa et al., 2020b]. 15For instance, s(x) = sigm(x; θˆ) for LR and s(x) = fˆ(x; Wˆ , bˆ) for SVM. To avoid confusion, I will henceforth use the term s(x) to denote the scores produced by an arbitrary classifier, while the term fˆ(x) refers specifically to the scores produced by a SVM. 22 well-calibrated if the conditional probability Pr (y = 1|s (x) = s) converges to the scores s (x) = s, as the number of observations classified approaches infinity [Zadrozny & Elkan, 2002]. In other words, if the classifier gives the score s (x) = 0.8 to a number of observations, then approximately 80 percent of these observations should belong to the class y = 1 [Kull et al., 2017; Zadrozny & Elkan, 2002]. The probability calibration of different classifiers can be visualised with reliability diagrams, also known as calibration curves. With the predicted probabilities for the positive class on its x-axis and the observed frequency of the actual positive cases given the predictions on its y-axis, the reliability diagram illustrates how close a classifier’s probability predictions are to the actual probabilities observed in the data [Ferna´ndez et al., 2018; Prati et al., 2011]. Note that the variables on the reliability diagram’s x and y-axis correspond to the score s(x) and the conditional probability Pr (y = 1|s (x) = s), respectively, as mentioned above. For this reason, the convergence of Pr (y = 1|s (x) = s) and s (x) = s – hence also the ideal reliable prediction – is represented by an upward diagonal line [Prati et al., 2011]. To plot the reliability diagram for a certain classifier, one must first discretise the predicted probabilities into bins16 and then for each bins compute the mean predicted value. Plotting these values against the true faction of the positive cases in each of the bins yields the calibration curve of the evaluated classifier [Ferna´ndez et al., 2018; Niculescu-Mizil & Caruana, 2005]. Figure 2.6 shows two reliability diagrams, which illustrate the out-of-sample performance of a linear SVM trained on synthesised datasets. As shown in subfigure 2.6a, the purple line, which represents the probabilistic predictions computed via the normalisation formula in Eq.(2.64), does not converge to the diagonal line of the reliability diagram. This is more clear in the case of imbalanced classes. As shown in subfigure 2.6b, the entire purple line is under the diagonal line, indicating a poor agreement between uncalibrated probabilities and their mean observed frequencies. In both subfigures, the red line representing the calibrated probabilities shows a better convergence to the diagonal line than does the purple line. To produce calibrated probabilities, we generally apply a calibration function (or a calibrator) to the scores s(x) so that it returns a set of calibrated probability between 0 and 1 [Kull et al., 2017; Pedregosa et al., 2020a]. The calibrated probabilities in subfigures 2.6a and 2.6b were generated by a calibration method called Platt scaling, which uses the sigmoid function as the calibrator. For this reason, the Platt scaling is also called logistic calibration [Kull et al., 2017; Pedregosa et al., 2020b]. Platt scaling was originally introduced as a method for obtaining probabilistic outputs from SVM. The intuition behind Platt scaling is to interpret the SVM scores as the log-odds defined in Eq.(2.14) and thereafter map the scores into probabilities using the following sigmoid function [Murphy, 2012; Platt, 2000] Pr(y = 1|fˆ(x)) = 1 1 + exp(Afˆ(x) +B) (2.65) where fˆ(x) is the output of SVM, while A and B are parameters to the sigmoid function. According 16The convention is to discretise the predicted probabilities into 10 bins, that is, the observations with scores (or predicted probabilities) 0 ≤ s < 0.1 form the first bin, and those with scores 0.1 ≤ s < 0.2 form the second bin, and so on [Ferna´ndez et al., 2018; Niculescu-Mizil & Caruana, 2005]. The reliability diagrams in this study, however, depart from this convention and were plotted on the basis of different numbers of bins. There are two reasons for that: 1) the evalm() function from the MLeval package, the function I used to generate reliability plots, does not support the discretisation I mentioned above and is only able to divide scores into evenly sized bins; and 2) data sets used in this study are mainly data with imbalanced classes, meaning that the probabilities predicted by the classifiers included in this study would be concentrated below 0.5 (more about this in the following chapters). The second reason also implies that the calibration curve of a classifier trained on imbalanced data would be relatively ”short”. Therefore, discretising such predicted probabilities with 10 bins would result in a short calibration curve that fluctuate largely between 0 and 1. Such reliability diagram is misleading in my opinion. And to avoid this outcome, I used less than 10 bins to plot the reliability diagrams for classifiers trained on imbalanced data. 23 to Niculescu-Mizil & Caruana [2005] and Kull et al. [2017], the sigmoid transformation above can also be applied to other classifiers, such as boosted trees and Naive Bayes. In such cases, the term fˆ(x) in Eq.(2.65) above will be replaced with scores s (x) that correspond to the outputs of these classifiers. In addition, Kull et al. [2017] has also shown that the sigmoid function is exactly the right calibrator to use, assuming that the scores produced by a classifier are Gaussian distributed within each class around the class means and with equal variance. (a) Reliability diagram of a linear SVM classifier (balanced data) (b) Reliability diagram of a linear SVM classifier (imbalanced data) Figure 2.6: Reliability diagrams showing the out-of-sample performance of a linear SVM. In both subfigures (a) and (b), the purple line is the uncalibrated probabilistic predictions computed by normalising the scores of the linear SVM, while the red line is the calibrated probabilistic predictions produced by Platt scaling. The datasets used to obtain the diagrams were both simulated from a logistic model. The only thing that these two simulated data sets differ is the balance of classes: in subfigure (a), the data set used has 50 percent of its observations belong to the positive class; in subfigure (b), only 25 percent of the observations in the data set are positive instances. Additionally, both datasets contains 500 observations, but they both were split to a training and a test set using the 50:50 ratio. In practice, using Platt scaling to obtain probabilistic outputs from SVM means that we first train a SVM for our data, and then fit the sigmoid function defined in Eq.(2.65) to the SVM output. More formally, let fˆi = fˆ(xi) denote the SVM outputs, which together with the outcomes yi in the data on which the SVM is trained form the training set (fˆi, yi). Note that the outcome variable in this training set is still labelled as yi ∈ {−1, 1}. Because our objective is to obtain probabilities, we replace the 24 outcome variable yi with a variable for target probabilities, ti, which defined as [Platt, 2000] ti = yi + 1 2 . (2.66) It is not difficult to show that ti = 1 when yi = 1 and ti = 0 when yi = −1. We can therefore proceed to define a new training set (fˆi, ti), from which we find the logistic parameters A and B in Eq.(2.65) above using MLE. More specifically, we find the value of A and B that minimise the following function, which is the negative log likelihood of the training set (fˆi, ti) [Niculescu-Mizil & Caruana, 2005; Platt, 2000] arg min A,B ( − ∑ i ti log(pi) + (1− ti) log(1− pi) ) (2.67) where pi = 1 1 + exp(Afˆi +B) . (2.68) Note that the parameters A and B in the equation above must be estimated in a separate validation set. Since if we estimate (A,B) in the same data set we used to train the SVM, we will have a severe overfitting problem [Murphy, 2012; Platt, 2000]. The solution suggested by Platt [2000] is to apply a k- fold cross-validation in the training set17. In particular, we first split the training data into k evenly sized folds and then keep the first fold for validation of the SVM scores fˆi (i.e., to evaluate the classification performance of the classifier), while train the SVM on the remaining k − 1 folds. After repeating this procedure k times so that each of the k folds had been treated as validation set once, we form the training set (fˆi, ti) for the estimation of parameters (A,B) by calculating the union of the k sets of fˆi [James et al., 2013; Platt, 2000]. As a sidenote, it is also worth to point out that we must have A < 0 to guarantee that Eq.(2.65) is a monotonically non-decreasing function. This means that we also assume the observations with higher SVM scores have a higher probability of belonging to the positive class [Kull et al., 2017; Platt, 2000]. However, as a method of obtaining calibrated posterior probabilities from SVM, Platt scaling does have some drawbacks. Firstly, according to Niculescu-Mizil & Caruana [2005], Platt scaling is most effective when the distortion in the predicted probabilities18 have a shape similar to the sigmoid function. If this is not the case, such as the uncalibrated probabilities shown in subfigure 2.6a, the improvements provided by Platt scaling will be minimal. Furthermore, if the shape of the uncalibrated probabilities differ considerably from that of the sigmoid, the Platt scaling might even produce a set of ”calibrated” probabilities that are more ill-fitted to the true probabilities than the uncalibrated one [Kull et al., 2017]. Secondly, as mentioned above, the sigmoid function is the right calibrator to use if the scores within each class are Gaussian distributed with equal variance19. This would imply that Platt scaling might deliver suboptimal probabilities if the SVM was trained on imbalanced data – a circumstance where the assumption of equal variance is violated [Kull et al., 2017; Pedregosa et al., 2020a]. As shown in subfigure 2.6b, the Platt scaling had failed to provide probability estimates above the 0.5 level. Thirdly, according to Tipping [2001], Platt’s transformation of SVM score could also be poorly calibrated because there are nothing in the SVM that could justify the interpretation of the SVM scores 17Platt [2000] sets the number of folds to k = 3 in his article about Platt scaling. 18That is, the error between the uncalibrated probability estimates and the fraction of positive instances observed in the test set. 19Or at least in those cases where the ratio of scores distribution on the positive class to that on the negative class ”behaves similarly to the ratio of Gaussians with equal variance” [Kull et al., 2017, p.5058]. 25 as log-odds. For this reason, Tipping [2001] developed the relevance vector machine (RVM), a method based on Bayesian learning framework and has same functional form as SVM. In other words, the RVM can be treated as a Bayesian treatment of Eq.(2.63). According to Xu et al. [2007], the RVM not only produces probabilistic outputs, but also has nearly equal training efficiency and classification accuracy as the SVM. Since the RVM involves Bayesian statistics, which is out of the scope of this thesis, I will only use the Platt scaling to obtain probabilistic outputs for SVM. 26 Chapter 3 Rare events bias In statistical literature, the term rare events refers to those binary random variables that the outcome Y = 1 occurs much less frequent than Y = 0. In other words, the binary data is heavily skewed towards one class – the class of Y = 0 or non-events [King & Zeng, 2001b]. This type of rareness is also called relative rareness. There is another type of rareness – absolute rareness – which is basically a problem induced by small sample [Van der Paal, 2014]. The small sample problem in binary modelling is presented when our sample consists fewer than 200 observations. In this case, fitting a logistic model to the data will yield biased parameter estimates [King & Zeng, 2001b]. Likewise, the presence of rare events can also cause such biased estimates. We describe this rare event bias in more detail in the following section. 3.1 Problem description In order to understand the rare event bias intuitively, consider the following bivariate logistic model [King & Zeng, 2001b] Pr ( Y = 1|X; θˆ ) = pˆ(X; θˆ) = exp ( θˆTX ) 1 + exp ( θˆTX ) (3.1) where θˆ = (θˆ0, θˆ1) is the MLE of the logistic parameter θ = (θ0, θ1). It would be convenient to express them using the notation of weights and intercept, i.e. θˆ0 = bˆ and θˆ1 = wˆ1. We further assume that the predictor X is positively correlated to the outcome, i.e. w > 0. The conditional densities20 piX|Y (X|Y = 0) and piX|Y (X|Y = 1), respectively, will appear in the way illustrated in Figure 3.1 below. Since we have w > 0, the conditional density piX|Y (X|Y = 1) is located to the right of piX|Y (X|Y = 0) 21 [King & Zeng, 2001b]. In binary classification, the ideal decision boundary should be set somewhere between the the right tail of the conditional density piX|Y (X|Y = 0) and the left tail of piX|Y (X|Y = 1). If rare events are presented, then the estimation of the conditional density piX|Y (X|Y = 1) – hence the left tail of it – would be a difficult task for the reason that such estimation is based upon only a few observed events. On the contrary, the estimation of piX|Y (X|Y = 0) and its right tail would be relatively easy because of the abundance of non-events. As a consequence, the decision boundary would be set in the vicinity of those 20For readers that are not familiar with probability theory, the conditional density (or more generally, the conditional probability distribution) of X given Y is the probability distribution of X, a continuous random variable, given that we have observed Y = y. A term related to the conditional density is joint probability distribution, which is the probability distribution of observing a certain pair of values for the random variables X and Y , denoted as piX,Y (X = x, Y = y) [Wasserman, 2004]. 21This is because X and Y is positively correlated, meaning that increase in X will increase Pr (Y = 1). Since the observations are ordered in the X-axis, the conditional density piX|Y (X|Y = 1) must be to the right of piX|Y (X|Y = 0). 27 observed events with largest or second largest value of X (we can see this phenomenon in both Figure 3.1 and Figure 3.2), causing the probability of event occurrence, Pr (Y = 1|X), to be systematically underestimated22 [King & Zeng, 2001a, 2001b]. Figure 3.1: An illustration of rare event bias. The dotted red line is the conditional density piX|Y (X|Y = 0), and the solid red line is the conditional density piX|Y (X|Y = 1); the blue dots lying on the x-axis are the observations with outcomes Y = 1, i.e. events that occurred; the vertical black solid line illustrates the decision boundary in Figure (3.2) Regarding the computation of the probability defined in Eq.(3.1), the presence of rare events would result in: 1) that the θˆ = (wˆ, bˆ) become biased estimates of the parameters of the true model θ = (w, b); and 2) that even if θˆ = (wˆ, bˆ) are unbiased, the probability computed on the basis of such MLE, Pr(Y = 1|θˆ), would still be an inferior estimator of the true Pr (Y = 1|θ) [King & Zeng, 2001b]. Figure 3.2: An illustration of rare event bias from another perspective. The blue dots are data of size n=100 simulated from a logistic model; the orange dots are the predicted values generated by fitting a LR to the simulated data; the solid black line is the decision boundary with the threshold set to τ = 0, or equivalently, sigm(X; θ) = 0.5. The overlap between the data and the predicted values in left bottom corner shows that estimation of piX|Y (X|Y = 0) is easier and more accurate than estimation of piX|Y (X|Y = 1). The first kind of problem, i.e., the bias in the MLE, can be expressed using the general definition of 22Note that the systematic underestimation of Pr (Y = 1|X) is equivalent to the systematic overestimation of Pr (Y = 0|X) [King & Zeng, 2001b]. 28 bias [Wackerly et al., 2008] bias(θˆ) = E[θˆ − θ] = E[θˆ]− θ. (3.2) According to King & Zeng [2001b], the bias term defined above tends to directly affect bˆ, the estimate of the intercept parameter, which then passes the effect on to other parameters in the logistic model. For this reason, the bias correction purposed by King & Zeng – as we can see in next section – concerns mainly the correction of bias(bˆ). The second kind of problem is induced by the uncertainties in the estimation of true θ in general. Such uncertainty can be caused by – for instance – sampling errors, which are then then amplified by the presence rare events. As a consequence, even if the MLE of θ is corrected for possible bias (King & Zeng denote such bias-corrected MLE as θ˜), the probability function based upon such estimators, p˜(xi; θ˜), will still be suboptimal [King & Zeng, 2001b]. More specifically, falling to take into account the uncertainty in estimation will result in the conditional distribution of the latent variable y∗, which is defined in Eq.(2.33) above, has smaller variance than what it should has if we incorporated the uncertainty into our estimation. This is illustrated in the Figure 3.3, where the right tail of the distribution of y∗ with uncertainty (the blue area) is much heavier than the one without uncertainty (the red area marked with Pr(Y = 1|θ˜)). Since the area to the right of the decision boundary (dotted line located at y∗ = τ = 0) is the probability of Y = 1, we can conclude that ignoring the uncertainty in the estimation would systematically underestimate the probability that rare events occur [King & Zeng, 2001b]. Figure 3.3: An illustration showing the uncertainty in estimation. The red area is the estimated probability using the unbiased parameter estimates θ˜, while the blue area is the true probability. 3.2 Conventional solutions 3.2.1 ReLogit As mentioned above, King & Zeng [2001b] found that the rare event bias affects the intercept directly. Therefore, one correction strategy is to first estimate the bias in bˆ, expressed as E[bˆ−b], and then subtract the estimated bias term from bˆ. Successful application of this strategy will give us the bias-corrected 29 estimate b˜ = bˆ− bias(bˆ). King & Zeng [2001b] have also shown that bias(bˆ) can be approximated as bias(bˆ) = E[bˆ− b] ≈ p¯− 0.5 np¯(1− p¯) (3.3) where p¯ = 1/n ∑n i=1 pi. Because of the presence of rare events, we have an average probability p¯ < 0.5, hence must Eq.(3.3) be negative. A negative bias means that the MLE bˆ is too small. As a consequence, logistic models that use such intercept estimate tend to underestimate the true Pr(Y = 1). This is consistent with the theory mentioned in previously section [King & Zeng, 2001b]. Furthermore, Eq.(3.3) also implies that increase in sample size n reduces the bias. This seems to support McCullagh & Nelder’s [1989] claim that the bias would become negligible with a larger sample size. However, the presence of p¯(1 − p¯) shows the bias-reducing effect of larger sample diminishes as p¯ → 0, i.e. as the event becomes more rare [King & Zeng, 2001b]. Concerning the issue of underestimation caused by estimation uncertainty, King & Zeng [2001b] proposed a Bayesian solution, which eliminates uncertainty by computing the following integral23 Pr (Yi = 1) = ∫ Pr (Yi = 1|θ)piθ (θ) dθ (3.4) where piθ(θ) is the density of parameter θ. We could also interpret Eq.(3.4) as the expected value of p˜ = Pr(Yi = 1|θ˜) by replacing θ with θ˜24. Although Eq.(3.4) above can be computed using iterative simulations, King & Zeng [2001b] showed that it can also be approximated as Pr (Yi = 1) ≈ p˜i + Ci (3.5) where p˜i is the probability computed from the bias-corrected estimators θ˜ and Ci is a correction term, which is defined as Ci = (0.5− p˜i) p˜i (1− p˜i) xiVar(θ˜)xTi (3.6) In Eq.(3.6) above, the estimation uncertainty in θ˜ is captured by the variance of θ˜. We can therefore view King & Zeng’s [2001b] correction for estimation uncertainty as a methods that adds more variance to the probability model p˜i(xi; θ˜). This correction is also consistent with what it is shown in Figure 3.3. According to King & Zeng [2001b], as an estimator of the true p, p˜i +Ci is superior to both p˜i and pˆi in the sense that p˜i + Ci has the smallest mean square error (MSE), which is defined as [Wackerly et al., 2008] MSE(θˆ) = E [ (θˆ − θ)2 ] = Var(θˆ) + bias(θˆ)2. (3.7) However, as Eq.(3.7) suggests, an estimator that minimises the MSE is not necessarily unbiased. This also applies to p˜i + Ci. According to King & Zeng [2001b], if we choose to interpret Eq.(3.4) as the expected value of p˜ = Pr(Yi = 1|θ˜), as mention above, the result will be E[p˜i] = pi + Ci, meaning that the correction term Ci now become the bias between p˜ and the true p. In this scenario, the bias-corrected estimator would be p˜i −Ci. We now have two superior estimators with different characteristics: p˜i +Ci 23Using integration to alleviate estimation uncertainty is common in Bayesian statistics, see for instance Wackerly et al. [2008, Sect. 16] 24Recall that for a continuous random variable Y , the expected value of a function of Y is defined as an integral [Wackerly et al., 2008]. Since Pr(Yi = 1|θ˜) can be treated as a function of θ˜, we have E[g(θ˜)] = ∫ ∞ −∞ g(θ˜)piθ˜(θ˜)dθ˜ where pi(θ˜) is the probability density function of θ˜. 30 that is superior in terms of MSE, and p˜i−Ci that is unbiased. In choosing between these two estimators, King & Zeng [2001b] suggested that we should use p˜i + Ci in the majority of cases. 3.2.2 Skewed link function In Section 2.2.4, we mentioned that in LR, the probability pi approaches 1 at the same rate as it approaches 0. This is because the symmetric form of the logistic density, as shown in Figure (2.4). Given the special structure of the rare event data, i.e data that contains very few ones and many zeros, it is reasonable to model rare events with a GLM using link functions that are non-symmetrical and skewed towards 1 [Van der Paal, 2014]. One such candidate is the complementary log-log link, a non-symmetrical link with the characteristic of approaching 1 at a mush faster rate than the logit link do, as shown in the Figure 3.4 below. The complementary log-log link (henceforth referred as ’the Cloglog link’) has the following formulation [Agresti, 2015; McCullagh & Nelder, 1989] g(µi) = ηi = log (− log (1− pi)) (3.8) which yields the following probability model p(X) = 1− exp  − exp   h∑ j=1 wjXj + b     . (3.9) In Section 2.2.4, we discussed the latent variable formulation of the LR, showing that the logit link is tied to the logistic distribution. As for the GLM with Cloglog link, it can be shown that the Cloglog link is closely related to the Type I extreme value distribution (also known as Gumbel distribution) [Agresti, 2015]. Note that the cdf of the standard extreme value distribution is formulated as Pr(yi = 1) = pi = F (ηi) = exp(− exp(−ηi)). (3.10) By inverting the cdf above, we obtain F−1 (pi) = ηi = − log (− log (pi)) . (3.11) The expression at the right-hand side of Eq.(3.11) is known as the log-log link. We can see that this link function is reversely related to the Cloglog link defined in Eq.(3.8) above. Therefore, the GLM with Cloglog link can be viewed as the case when the error term of the latent variable y∗ has a reverse extreme value distribution [Agresti, 2015; McCullagh & Nelder, 1989]. Concerning the interpretation of the parameter coefficient of the Cloglog model, Eq.(3.9) implies that, with wj > 0, the complement probability p(Xj + 1) is equal to the complement probability p(Xj) raised to the exp(wj) power [Agresti, 2015]. With some arrangements, we can express exp(wj) as exp(wj) = log p(Xj + 1) log p(Xj) . (3.12) That is, the term exp(wj) can be interpret as the relative increase in the log probability when Xj is increased by 1, given other predictors are fixed. Fitting a GLM with Cloglog link requires the use of Fisher scoring algorithm, a method slightly different from the Newton-Raphson algorithm used in fitting logistic models [Agresti, 2015]. 31 Figure 3.4: Comparison of the logit link and complementary log-log link 3.2.3 Cost-sensitive SVM In contrast to other classifiers, SVM is expected to have an inherent advantage in handling imbalanced data. This advantage is originated from the Lagrange multipliers αi. Recall that according to Eq.(2.63), the optimal decision boundary is determined by the function fˆ(x; Wˆ , bˆ), in which the weight parameter Wˆ is replaced by αi. Because of the constraint ∑n i=1 αiyi = 0 in the optimisation problem in Eq.(2.50), the magnitude of those αi associated with training observations coming from the minority class must be larger than that of those associated with observations from the majority class. As a result, support vectors from the minority class are given more weights than those from the majority class in our computation of the optimal decision boundary [Akbani et al., 2004; Batuwita & Palade, 2013]. However, several studies have reported that running SVM on imbalanced or rare event data would still yield suboptimal decision boundaries, despite the method’s inherent weighting characteristics described above [Akbani et al., 2004; Batuwita & Palade, 2010; Veropoulos et al., 1999]. Such outcome occurs because of the tuning parameter C, which also can be interpreted as the penalty for misclassification, assigns the same cost to the classification errors in both the minority and the majority class. Recall that the optimisation problem in Eq.(2.54) has two conflicting goals: maximise the margin and minimise the training error. In an imbalanced training set, observations from the majority class are more dense than those from the minority class around the class boundary region. Therefore, in the setting that misclassifications in both classes are penalised equally, prioritising the minimisation of classification errors in the majority class and ignoring the minority class would yield the best result [Batuwita & Palade, 2013; Ferna´ndez et al., 2018]. In addition, as shown in Figure 3.2, it is common that in rare event data, the minority event class overlaps with the majority non-event class. This could amplify the overrepresentation of the non-event class around the decision boundary, leading to a decision boundary that favours the non-events even more [Ferna´ndez et al., 2018]. For these reasons, if rare events are presented in the training set, it is possible that the standard SVM results in a model that classifies all observations as non-events [Akbani et al., 2004]. One way to avoid such suboptimal decision boundaries is to construct a cost-sensitive SVM with different error costs (DEC) [Batuwita & Palade, 2013; Ferna´ndez et al., 2018]. This means that we use 32 two separate misclassification costs for each of the classes. Denoting the misclassification cost for the positive class as C+ and that for the negative class as C−, we can modify Eq.(2.53) to min W,b,ξ1,...,ξn 1 2 ‖W‖2 + C+ n∑ i|yi=+1 ξi + C − n∑ i|yi=−1 ξi (3.13) subject to yi   h∑ j=1 wjxij + b   ≥ 1− ξi, i = 1, . . . , n ξi ≥ 0, i = 1, . . . , n which corresponds to the following dual optimisation problem max α n∑ i=1 αi − 1 2 n∑ i=1 n∑ k=1 αiαkyiyk 〈xi,xk〉 (3.14) subject to 0 ≤ α+i ≤ C + , 0 ≤ α−i ≤ C − and b = n∑ i=1 αiyi = 0 i = 1, . . . , n. where α+i and α − i are the Lagrange multipliers or weights for the positive and negative class, respectively [Batuwita & Palade, 2013]. According to Akbani et al. [2004], a good role of thumb is to set misclas- sification costs according to the minority-to-majority ratio. That is, assuming the positive class is the minority class, the ratio C+/C− should be equal to the fraction of positive class to the negative class [Akbani et al., 2004; Batuwita & Palade, 2013]. 3.3 Sampling methods for alleviating rare events bias Although this thesis concerns the capability of different statistical methods (and their modifications) in handling rare event data, the reader should bear in mind that it is also possible to address the problems caused by the rare events using data-level preprocessing, or more specifically, sampling methods. In contrast to the modifications described in Section 3.2, which also known as the internal methods, the sampling methods are often termed as external methods [Batuwita & Palade, 2013]. The aim of such methods is to use various sampling techniques to generate a more balanced data set. These techniques includes: 1) undersampling, which means that we balance the data by creating a subset of the original data through randomly dropping observations that belong to the majority class; 2) oversampling, which means that we increase the observations in the minority class using replication or interpolation; and 3) hybrids, which combines the two sampling techniques above [Ferna´ndez et al., 2018]. Since sampling methods alter the structure of the original data, fitting a model to such altered data is often accompanied by adjustments in some of the parameter estimates [King & Zeng, 2001a, 2001b]. Methods for oversampling and undersampling are common in econometrics. The choice-based sam- pling, for instance, is a method of drawing independent random samples from the classes Y = 1 and Y = 0 separately. Another example is the case-cohort study, a method of collecting all ones in the data set and randomly selecting zeros [King & Zeng, 2001a]. In the case of rare events, the case-cohort format is equivalent to randomly dropping the non-events while keeping all the events [King & Zeng, 2001a], i.e. a version of undersampling methods. According to King & Zeng [2001a, 2001b], if we fit a logistic 33 regression model to such undersampled data, we have to adjust for the intercept estimate25, bˆ. They therefore proposed the following adjustment bˆ− log (( 1− κ κ )( y¯ 1− y¯ )) (3.15) where κ is fraction of events in the population, and y¯ is the fraction of events (or rarity) in the sample. As a rule of thumb, King & Zeng [2001a] suggested that the number of non-events should not exceed 2/3 or 5/6 of the data set. Synthetic minority oversampling technique (SMOTE) is another popular sampling method [Chawla et al., 2002; Ferna´ndez et al., 2018]. In contrast to the conventional oversampling technique, i.e. oversam- pling with replacement, SMOTE oversamples by generating synthetic data that assigned to the minority class. The procedure of generating new data involves the use of k-nearest neighbours (KNN) algorithm and can be informally described as followed. Before running SMOTE, we prespecify the number of ob- servation that have to be generated for a balanced data. The SMOTE-algorithm then randomly select an observation xi from the minority class, and compute the KNN of that observation in the training set26. After that, the algorithm proceeds by calculating the distances between xi and its KNN. New data points are then generated within these distances. This part of generation procedure is done by multiplying each of the k distances by a random number that ranges from 0 to 1. In sum, for a single observation xi, k synthetic data points are generated. The algorithm then repeats itself and stops when the number of the synthetic observations equals the number we prespecified [Chawla et al., 2002; Ferna´ndez et al., 2018]. To achieve the best possible classification performance, Chawla et al. [2002], the developers of SMOTE, suggest a combination of SMOTE and undersampling of the majority class. 25The fact that we only need to adjust bˆ is because that, for a logistic model, the parameter estimates associated with the predictors, wˆ1, . . . , wˆh, are statistically consistent estimates of w1, . . . , wh. The adjustment is therefore not needed. Note that this does not apply to other models such as probit and linear regression [King & Zeng, 2001a] 26The SMOTE sets k = 5 by default, meaning that the five nearest neighbour of xi is computed. 34 Chapter 4 Methodology In this chapter, I present a full list of statistical methods evaluated in this study and explain how the evaluation was performed. Generally speaking, the methodology of this thesis combines a simulation study with an empirical analysis. The purpose of the simulation study is to investigate how changes in rarity, number of predic- tors, and sample size would influence the predictive performance of the selected classifiers, while the aim of the empirical analysis is not only to investigate how different classifiers perform in real data, which seldom has the ideal distribution required by these methods (e.g. normality assumption), but also to add a practical dimension to the study. As Hand [2006] had pointed out, there is no single classifier that is universally preferable to the others. This is especially true in those cases where real data is used for comparison: depending on the underlying nature of the outcome variable, the cost of misclassification may differ from case to case. Therefore, the empirical analysis and the simulation study are of equal importance. The reader must not treat the former as merely a complement to the latter. 4.1 Statistical models and methods The statistical methods evaluated in this study can be grouped into two families: the GLM family and the SVM family, as shown in Table 4.1 below. In general, the fitting procedure of the methods from the GLM family was performed in R, while models from the SVM family were fitted in Python using the scikit-learn package (Version 0.24.0)27. More specifically, among the methods from the GLM family, the LR and Cloglog were both fitted using the glm() function in Base R, while the ReLogit model was fitted using the zelig() function provided by the Zelig or zeligverse package. As for the methods from the SVM family, the training and calibration procedures were performed using the functions SVC, GridSearchCV, and CalibratedClassifierCV combined. The function SVC was used to define the SVM model. By specifying the kernel parameter in the SVC function differently, I con- structed four SVMs with different kernels. The cost-sensitive versions of SVM were obtained by activat- ing the option class_weight = "balanced", which automatically adjust weights inversely proportional to the class frequency [Pedregosa et al., 2020b]. Both the standard and the cost-sensitive SVMs were trained using the function GridSearchCV, which also performed a 5-fold cross-validation simultaneously 27The reasons behind my choice of dividing the fitting procedures of the GLMs and SVMs into two programming languages – instead of fitting both GLMs and SVMs using one of these languages alone – are twofold. Firstly, the function that implements the ReLogit model exists only in R and STATA. And secondly, fitting the SVM requires a huge amount of time. It therefore is more convenient to train a SVM in Python, since the Python IDE allows me to run multiple session in parallel (this feature is not available for the free version of RStudio), so that I can use the waiting time more effectively. 35 to find the optimal value of parameter C among the following five candidates28: {0.001, 0.1, 1, 5, 10, 50}. The scores of the four SVMs can be obtained by using the decision_function method (included in both SVC and GridSearchCV) after the model training was completed. Finally, I used the the function CalibratedClassifierCV to implement Platt scaling and obtain the calibrated probabilistic outputs for the SVMs. Unlike Platt [2000], who used a 3-fold cross-validation for obtaining parameters (A,B), I used a 5-fold cross-evaluation to obtain more accurate estimates for the parameters. As for the other pa- rameters in CalibratedClassifierCV, the default setting was used. Running the function on its default setting assures that the Platt scaling, especially that part of the procedure concerning the estimation of parameters (A,B), is implemented in accordance with the description provided in Section 2.3.4. Table 4.1: List of methods included in this study Name Description GLM LR Standard logistic regression (or GLM with logit link) ReLogit Logistic regression for rare event, with correction de- veloped by King & Zeng [2001a, 2001b] Cloglog GLM with complementary log-log link SVM Linear SVM Standard SVM with linear kernel Gaussian SVM Standard SVM with Gaussian kernel Sigmoid SVM Standard SVM with Sigmoid kernel Polynomial SVM Standard SVM with third-degree polynomial kernel Linear SVM-DEC The cost-sensitive version of Linear SVM Gaussian SVM-DEC The cost-sensitive version of Gaussian SVM Sigmoid SVM-DEC The cost-sensitive version of Sigmoid SVM Polynomial SVM-DEC The cost-sensitive version of Polynomial SVM 4.2 Simulation study 4.2.1 Experiment setup Data sets used in the simulation study were generated in R29. These synthetic data sets vary in three aspects: sample size, rarity (i.e., the fraction of 1 in the data set) and dimensions (i.e., the number of predictors). Since both sample size and the fraction of events can incur rare event bias, as noted in Section 3.1, it is natural to include both these two aspects into the study. As for the dimensional aspect, the point here is to investigate whether the rare events bias can be amplified by increased model complexity. Additionally, all predictors in the simulated data sets are continuous, Gaussian distributed random variables with different means and variances. To generate the predictors, I first specified a covariance matrix by randomly generating numbers from a uniform distribution and then use these numbers as the covariances between different predictors. The means of different predictors are randomly selected integers. I then used the vector of means and the covariance matrix to construct a design matrix 28The number of candidates was limited to 5 because training a SVM classifier on large data sets is extremely time- consuming, not to mention performing a 5-fold cross-validation at the same time. 29Codes and data files for replication can be accessed in this link: https://github.com/LukasHMa/bsc thesis in statistics 36 using the mvrnorm() function from the MASS package. By using mvrnorm(), I ensured that all predictors are Gaussian distributed random variables. After generating the predictors, I proceed to the simulation of rare events data. For this task, I followed King & Zeng’s [2001a, 2001b] approach, by using the logistic model as the true model and testing different values of the intercept b to control the fraction of 1 in the data set. The levels of rarity presented in the simulated data sets can be informally described as: moderate (y¯ ≈ 16%), severe (y¯ ≈ 3.4%), and extreme (y¯ = 0.8%). While the values of the intercept are prespecified numbers, the vector of parameters associated with different predictors, θ, was generated randomly from a Gaussian distribution. The linear combination θTX, once being specified, was transformed to probabilities using the sigmoid function. Finally, to generate the outcome variable y, I first generated a uniformly distributed random number ranging from 0 to 1, and then compared it with the probabilities generated from the logistic model. In this setting, if the random number is smaller than or equal to the probability generated from the true model, then it will be coded as y = 0, and y = 1 otherwise. A rare events data is generated after completing all these steps. A total of seven distinctive synthetic data sets were generated. These data sets can be grouped into three different scenarios based on their size and the number of predictors used to create them. The characteristics of the seven simulated data sets are shown in Table 4.1 below. Table 4.2: Charateristics of simulated data Name Intercept Rarity (Full) Rarity (Train) Rarity (Test) Obs Predictors Scenario I small_a −3.6 16% 16.19% 15.56% 150 5 small_b −6.08 3.33% 3.77% 2.27% 150 5 Scenario II med_a −3 16.2% 16.29% 16.00% 500 10 med_b −5 3.4% 3.43% 3.33% 500 10 med_c −7 0.8% 0.857% 0.666% 500 10 Scenario III large_a −3.32 16.2% 16.14% 16.33% 1000 20 large_b −5 3.5% 3.43% 3.67% 1000 20 In the subsequent analysis, the data was divided into a training set and a test set. Dividing data in this way would enable us to evaluate both the in-sample and out-of-sample classification performance of different methods. I decided to reserve 70 percent of the observations for the training set, while the remaining 30 percent were used as testing. Considering the special characteristic of rare events data, the 70:30 ratio is the most appropriate choice in order to ensure that some observations from the event class were included in the test set. Additionally, to ensure that the training set has the structure similar to the original data set, I applied stratified sampling when selecting observations for the training set. Generally speaking, stratified sampling is a method that selects samples by first dividing the original data into several mutually exclusive, exhaustive strata – usually performed on the basis of certain categorical variables in the original data – and then drawing random samples within each of these strata [Levy & Lemeshow, 2008]. Since I only have continuous predictors in my simulated data sets, the stratification in my case was performed on the basis of the outcome variable. This means that the events and non-events in the training set were sampled simultaneously but independently from each other, as in the case of case-control studies. After the sampling for the training set is completed, the remaining observations form the test set. 37 4.3 Empirical analysis 4.3.1 Data Table 4.3: Variables in the MID data Variables Descriptions Militarised dispute y = 1 if the dyad (pair of countries) was engaged in a militarised interstate dispute Contiguous y = 1 if the pair of countries are neighbours Allies y = 1 if the pair of countries are allies Foreign policy Score measuring similarity between two countries in their foreign policy portfolio Balance of Power Score measuring the balance of the military power between two countries Max. democracy The maximum degree of democracy in the pair of countries, indicat- ing which of the two countries has the higher degree of democracy Min. democracy The minimum degree of democracy in the pair of countries, indicat- ing which of the two countries has the lower degree of democracy Max. trade The maximum degree of trade dependency in the pair of countries, indicating which of the two countries has the higher degree of trade dependency Min. trade The minimum degree of trade dependency in the pair of countries, indicating which of the two countries has the lower degree of trade dependency Year since dispute The number of years since the last dispute Major Power y = 1 if the pair of countries includes a major power The data I used for the empirical analysis is a version of Militarized Interstate Dispute Data(MID) compiled by King & Zeng [2001a, 2001b]. In general, the MID data is associated with the Correlation of War project and has many versions. The version of MID data used in King & Zeng [2001a, 2001b] was based upon MID Version 3.0, which is presented in the non-directed dyad-year30 format [Ghosn et al., 2004]. King & Zeng’s version of MID consists 303,814 dyads (i.e., pair of countries) covering the period 1946–1992. The outcome variable is the event ”militarised interstate dispute”, which is coded as y = 1 if two particular states had engage in military conflict and y = 0 otherwise. A total of 1,042 military conflicts were recorded in this period, implying a rarity of 0.3423 percent. Other than variables that already existed in the original MID data, King & Zeng had also included variables that are typical for the research in peace and conflict studies, such as the balance of military power, degree of democracy and trade dependency etc. A full list of variables included in King & Zeng’s [2001a] version of MID data is provided in Table 4.331. 30The use of dyad-year data is very popular in the quantitative analysis of international conflicts. The main difference between the non-directed dyad-year and the directed one is the latter codes the same military interstate dispute twice. In other words, if Country X and Country Y engaged in a military conflict in a certain year, then this conflict is recorded once in Country X, and once in Country Y. The non-directed dyad-year data does not make such distinction. For this reason, the directed dyad-year data could be very useful in studying the deliberate initiation of military conflict [Bennett & Stam, 2000]. 31Variables in the original data set are unlabeled. The labels of the variables in Table 4.3 were recreated based on King & Zeng [2001a, 2001b]. There are more variables included in the data set than what Table 4.3 shows, but because King & Zeng [2001a, 2001b] did not used all variables in the original data (they had even created several new variables from the original data) in their analysis, there is no way for me to determine what these variables are. For this reason, the variables not included in King & Zeng [2001a, 2001b] were omitted from Table 4.3. The original data set can be obtained from Gary King’s dataverse, through the following link: https://doi.org/10.7910/DVN/RNSU7V. 38 4.3.2 Initial preparation before analysis As mentioned in Section 2.3 above, the fitting procedure of SVM is equivalent to solving a dual optimi- sation problem, and it is well documented in the statistical literature that such optimisation procedure is exceedingly time consuming (see for instance Platt [1998]; Tsang et al. [2005] and Pedregosa et al. [2020b]). With a data set consisting over 300,000 observations, the amount of time required for training a SVM would be substantial, not to mention the additional time needed for the subsequent tuning of the parameter C, which is performed using cross-validation. A reasonable choice is therefore to drop 90 percent of the observations in the MID data and only use the remaining 10 percent for the empirical analysis. As in the case of the simulation study, I implemented stratified sampling to ensure that the 10 percent subset data contains the same statistical information, most importantly the event rarity, y¯, as the full data. As showed in Table 4.4, the y¯ in the 10 percent subset data is nearly identical to that in the full data. By comparing the Table 4.4 and Table A.1 in Appendix A, it could also be shown that the 10 percent subset data generally maintained the structure of the full data, indicating that the stratified sampling is a reliable strategy in creating subsets of the original data. For this reason, I implemented the stratified sampling once again in the train-test-split of the empirical data, using the same 70:30 ratio as in the case of the simulation study. Table 4.4: Descriptive statistics for 10% of the MID data Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max Militarised dispute 30,381 0.0034 0.0584 0 0 0 1 Contiguous 30,381 0.0402 0.1963 0 0 0 1 Allies 30,381 0.0885 0.2840 0 0 0 1 Foreign policy 30,381 0.7728 0.1726 −0.0450 0.6667 0.9098 1.0000 Balance of Power 30,381 0.3633 0.2985 0.0002 0.0935 0.6011 1.0000 Max. democracy 30,381 3.2838 7.2816 −10 −6 10 10 Min. democracy 30,381 −5.0190 5.6587 −10 −9 −5 10 Max. trade 30,381 −4.8227 2.5771 −8.6840 −8.6840 −2.8145 0.2954 Min. trade 30,381 −5.7636 2.5249 −9.5361 −9.5361 −3.7720 −0.8995 Year since dispute 30,381 17.6006 11.5144 0 8 26 46 Major Power 30,381 0.0837 0.2769 0 0 0 1 4.4 Evaluation metrics 4.4.1 Conventional evaluation metrics for classification performance When evaluating the classification performance of a classifier, a common approach is to summarise the classification result in a confusion matrix. Table 4.5 below shows a typical confusion matrix, where the rows shows the predicted class labels and the columns the true labels [Agresti, 2015; James et al., 2013]. Using the information provided by the confusion matrix, we can compute a number of different evaluation metrics. Nearly all evaluation metrics used in this study are derived from the confusion matrix. We can start with measures that are frequently used in the statistical literature. These include sensitivity, specificity, receiver operating characteristic curve (ROC) and area under the ROC curve (AUC). 39 Table 4.5: A confusion matrix T ru e la b el Predicted label True Positive False Negative False Positive True Negative Sensitivity and specificity Sensitivity, also known as recall and true positive rate, shows the prediction accuracy of the positive class. It is defined as the fraction of correctly classified ones to true ones (i.e., true positives plus false negatives), that is [Agresti, 2015] Sensitivity = TP TP + FN = Pr(Yˆ = 1|Y = 1) (4.1) On the contrary, specificity shows the prediction accuracy of the negative class and defined as the fraction of correctly classified zeros to true zeros (i.e., true negatives plus false positives), that is [Agresti, 2015] Specificity = TN TN + FP = Pr(Yˆ = 0|Y = 0) (4.2) ROC/AUC In Section 2.2.4, we mentioned that we could obtain different classification results by altering the value of the threshold τ (or pˆ0 if expressed as predicted probability). Hence, the sensitivity and specificity for a certain classifier might vary depending on the threshold which we selected. This variation can be visualised using a ROC curve, which displays the true positive rate (i.e., sensitivity) and the false positive rate (i.e., 1 − specificity, also known as Type I error) for all possible thresholds [James et al., 2013]. With the true positive rate on the y-axis and the the false positive rate on the x-axis, the perfect classification performance is illustrated by the top-left corner of the ROC space. A good classifier should therefore have its ROC curve bent towards this point as close as possible [Ferna´ndez et al., 2018]. In addition, when evaluating a certain classifier, we only consider those classifiers whose ROC curve lies above the upward diagonal line in the ROC space, since this line represents the expected performance of a ”no information” classifier, that is, a classifier that makes random predictions [James et al., 2013; Kelleher et al., 2020]. Other than evaluating visually, we can also quantify the results shown in the ROC curve by calculating the integral of the curve. This yields the area under the ROC curve (AUC), a numeric measure that ranges from 0 to 1. Note that since AUC = 0.5 corresponds to the upward diagonal line in the ROC space, a good classifier should therefore have a AUC score above this value [James et al., 2013] or, following from the recommendation of Kelleher et al. [2020], a AUC score above 0.7. 40 4.4.2 Evaluation metrics for imbalanced classification According to Ferna´ndez et al. [2018], the conventional evaluation metrics for classification performance, such as the overall accuracy and the ROC/AUC, could be misleading under the presence of class imbal- ance and rare events. As mentioned in Section 3.1, the abundance of zeros in the data set makes the estimation of piX|Y (X|Y = 0) very easy. This would cause the overall classification accuracy and the ROC curve to be dominated by the success in classifying non-events, while the contribution of successful classifications of the event class to these two metrics is merely marginal. Since we are more interested in finding a classifier that can correctly identify the event class in rare event modelling, the use of overall accuracy and ROC curve as evaluation metrics would be inappropriate [Ferna´ndez et al., 2018]. When choosing the evaluation metrics for classification performance, it is useful to distinguish be- tween metrics for nominal class prediction and for probabilistic prediction. The former evaluates a classifier by comparing the number of predicted class labels to the true labels, while the latter compares the probabilities generated from the classification model to the true outcomes [Ferna´ndez et al., 2018]. My choice of evaluation metrics in this thesis was based upon the recommendations of Ferna´ndez et al. [2018]. Among nominal metrics, Ferna´ndez et al. [2018] suggested the use of precision and recall,F1-score and balanced accuracy (BA); Among probabilistic metrics, they suggested the use of Brier score and precision-recall curve. Detailed descriptions about these metrics are provided below. Precision and recall Precision and recall can be calculated using the information provided by the confusion matrix. As mentioned above, recall is merely an alternative name for sensitivity and hence is defined as in Eq.(4.1). The precision is defined as the fraction of correctly classified ones to the total number of observations that predicted as one (i.e., true positives plus false positives), that is Precision = TP TP + FP = Pr(Y = 1|Yˆ = 1). (4.3) As mentioned above, when modelling rare events data, we pay more attention to the classification rate of the positive event class than the negative non-event class. Metrics such as precision and recall contain only information about the positives, making them appropriate for our task [Ferna´ndez et al., 2018; James et al., 2013; Murphy, 2012]. F1 score Other than precision and recall, we may also be interested in the trade-off between these two metrics, or more specifically, the trade-off between correctness (i.e., Pr(Y = 1|Yˆ = 1)) and coverage (i.e. Pr(Yˆ = 1|Y = 1)) in the prediction of ones [Ferna´ndez et al., 2018]. One way of evaluating this trade-off is to calculate the F1 score, which is defined as F1 = 2 · Precision ·Recall Precision + Recall (4.4) The F1 score is also known as the harmonic mean of precision and recall. It has the range [0, 1], with F1 = 1 as the perfect trade-off between the two metrics. Therefore, a good classifier should have a F1 score close to 1 [Ferna´ndez et al., 2018; Murphy, 2012]. 41 Precision-recall curve Trade-off between precision and recall can also be visualised graphically using the precision-recall curve. Similar to the ROC curve, the precision-recall curve illustrates the variations in precision and recall caused by our choice of thresholds by plotting the two measurements against each other, with the precision on the y-axis and the recall on the x-axis. The top-right corner of the plot symbolises the perfect trade-off between the two metrics. Therefore, a good classifier should have its precision-recall curve bent towards that corner [Ferna´ndez et al., 2018; Murphy, 2012]. Since the choice of threshold could only be meaningful when put into a context, the precision-recall curve is only presented in the empirical analysis. Balanced accuracy The balanced accuracy (BA) or average class accuracy is defined as the arithmetic mean of sensitivity and specificity, as shown in the following equation [Brodersen et al., 2010; Ferna´ndez et al., 2018] BA = Sensitivity + Specificity 2 = 1 2 ( TP TP + FN + TN TN + FP ) (4.5) In other words, the BA shows the average of prediction accuracy obtained on either class [Brodersen et al., 2010]. The BA is equivalent to the overall accuracy when the classes are balanced. Under class imbalance or rare events, the BA tends to be smaller than the overall accuracy. Because of its sensitivity to class imbalance, BA is considered a more appropriate measure than the overall accuracy in evaluating imbalanced classification [Brodersen et al., 2010; Ferna´ndez et al., 2018; Kelleher et al., 2020]. Similar to other measures for prediction accuracy, the BA has the range [0, 1], with 1 as the perfect score. Note that only classifiers that achieved BA > 0.5 are worth to consider, since BA = 0.5 implies that the classifier evaluated always assigns the observations to one class [Kelleher et al., 2020]. Brier score Brier score is a commonly used metric in evaluating the probabilistic output of a certain classifier. It is usually formulated as the MSE between the true outcome and the predicted probability [Ferna´ndez et al., 2018] BS = 1 n n∑ i=1 (pˆi − yi) 2 (4.6) where pˆi is the predicted probability of the i-th observation. The MSE format also implies that the Brier score is a loss function, meaning that a good classifier should have a low Brier score [Ferna´ndez et al., 2018]. 42 Chapter 5 Results from the simulation study 5.1 Scenario I The classification performance of the GLMs and SVMs in this scenario are summarised in Table 5.1 and 5.2, respectively. Data sets in this scenario are characterised by a small sample size and a few number of predictors. The class imbalance of the data sets included in this scenario can be described as moderate (y¯ = 16%) and severe (y¯ = 3.33%). As shown in Table 5.1 below, the GLMs using the alternative threshold pˆ0 = y¯ achieved a better in- sample performance in recall, F1 score, and BA than did those using the conventional threshold pˆ0 = 0.5. As for the out-of-sample performance, the results are mixed. Additionally, at least in the small_a data set, the GLMs using the alternative threshold had a lower precision in general than did the GLMs using the conventional threshold. Among the GLMs evaluated in the small_a data set, the ReLogit model using the conventional threshold had a better in-sample classification performance than did the other two GLMs that used the same threshold. However, in terms of out-of-sample prediction, the GLM with Cloglog link had the best result in all evaluation metrics except for the Brier score. In addition, the Cloglog model also performed better than the ReLogit when using the alternative threshold. One might be tempted to draw the conclusion that the Cloglog model is the best GLM in this data set, but such a conclusion is premature, considering the fact that all GLMs have the same performance in recall regardless of the threshold they used and that the out-of-sample Brier score of the Cloglog model is the highest among the three GLMs evaluated. Therefore, the results from Table 5.1 provide little evidence for the claim that the Cloglog model is better than other GLMs in a data set characterised by a small sample size and a moderate class imbalance. As for the SVMs, Table 5.2 shows that all standard SVMs, except for the one with third-degree polynomial kernel, did not detect a single event in both the training and the test set. This is an unexpected result and seems to contradict the theoretical result in Akbani et al. [2004] about the inherent advantage of SVM over other classifiers32 in handling imbalanced data. Even if the SVM with polynomial kernel – the only standard SVM that managed to detect some events in this case – achieved better precision (both in-sample and out-of-sample) than the GLMs, the classifier still had a much lower recall (or sensitivity) than its GLM counterparts. The cost-sensitive SVMs performed much better than the standard versions. They all managed to 32The original sentence in Akbani et al. [2004, p.42] is ”This [the inherent weighting through α] shows why SVM does not perform too badly compared to other machine learning algorithms for moderately skewed datasets”. Since machine learning classifiers are more sophisticated and flexible than the traditional classification methods, I expected this also holds for the GLMs. 43 Table 5.1: Classification performance the GLMs (Scenario I) In-sample Out-of-sample Precision Recall F1 BA Brier Precision Recall F1 BA Brier Data: small_a LR 0.5833 0.4118 0.4828 0.6775 0.094748 0.3333 0.2857 0.3077 0.5902 0.167916 (0.4286) (0.8824) (0.5769) (0.8275) (0.2000) (0.4286) (0.2727) (0.5564) ReLogit 0.6364 0.4118 0.5000 0.6832 0.094953 0.3333 0.2857 0.3077 0.5902 0.166617 (0.4167) (0.8824) (0.5660) (0.8219) (0.1765) (0.4286) (0.2500) (0.5301) Cloglog 0.5833 0.4118 0.4828 0.6775 0.094000 0.4000 0.2857 0.3333 0.6034 0.172697 (0.4412) (0.8824) (0.5882) (0.8332) (0.2000) (0.4286) (0.2727) (0.5564) Data: small_b LR 0 0 0 0.5000 0.034375 0 0 0 0.4767 0.040913 (0.1000) (0.7500) (0.1765) (0.7426) (0.0476) (1) (0.0909) (0.7674) ReLogit 0 0 0 0.5000 0.035614 0 0 0 0.4767 0.040754 (0.0678) (1) (0.1270) (0.7304) (0.0294) (1) (0.0571) (0.6163) Cloglog 0 0 0 0.5000 0.034299 0 0 0 0.4767 0.046425 (0.1000) (0.7500) (0.1765) (0.7426) (0.0476) (1) (0.0909) (0.7674) Note: Values without brackets are the classifications results obtained using the conventional threshold pˆ0 = 0.5. Values in brackets are classifications results obtained using the threshold pˆ0 = y¯. The best values in each column given to the specified thresholds are in boldface. correctly classify some events in this data set, especially the one with Gaussian kernel, which managed to classify 94 percent of the true events in-sample. However, in terms of out-of-sample prediction, the performance of the Gaussian SVM-DEC dropped sharply, indicating a possible overfitting in the model training procedure. Since the Gaussian SVM-DEC also has the highest value of C among the cost-sensitive SVMs, meaning that the method was fitted less closely to the data than did the others, the issue of overfitting might be originated from the Gaussian kernel. Meanwhile, the out-of-sample performance of the sigmoid SVM-DEC is even better than its in-sample performance. Nevertheless, this could be attributed partly to chance and partly to the small sample size – with only 7 true events in the test set, one additional correctly classified event would impose a large impact on the sensitivity. The data set small_b is characterised by small sample size and severe class imbalance. In the presence of an event rarity equal to 3.33 percent, all GLMs that used the conventional threshold failed to identify observations from the event class in the training set and hence also those in the test set. This has also occurred in the standard SVM. Meanwhile, comparing with these two versions of classifiers, the GLMs using the alternative threshold and the cost-sensitive SVMs had much better in-sample classification performance. Among these classifiers, all cost-sensitive SVMs and the Relogit model have even achieved the perfect classification performance. However, both GLMs with alternative threshold and cost-sensitive SVMs showed poor performance in terms of precision. There are also some signs of severe overfitting for the Gaussian SVM-DEC, whose F1 score dropped from 0.73 to 0. Compared to the Gaussian version, the polynomial SVM-DEC had a rather moderate drop in the F1 score, from 0.73 to 0.5. However, as in the previous case, this drop in out-of-sample F1 score might also be caused by the small sample size, since the test set in this setting only contains one event. 44 Table 5.2: Classification performance the SVMs (Scenario I) In-sample Out-of-sample Precision Recall F1 BA Brier Precision Recall F1 BA Brier Data: small_a Linear SVM C = 0.001 0 0 0 0.5000 0.102113 0 0 0 0.5000 0.148378 Gaussian SVM C = 0.001 0 0 0 0.5000 0.096858 0 0 0 0.5000 0.133019 Sigmoid SVM C = 0.001 0 0 0 0.5000 0.133561 0 0 0 0.5000 0.130851 Polynomial SVM C = 0.1 1 0.29 0.45 0.6471 0.098911 0.33 0.14 0.2 0.5451 0.139052 Linear SVM-DEC C = 0.1 0.44 0.88 0.59 0.8332 0.101795 0.17 0.43 0.24 0.5169 0.141892 Gaussian SVM-DEC C = 10 0.59 0.94 0.73 0.9081 0.103642 0.14 0.14 0.14 0.4925 0.135437 Sigmoid SVM-DEC C = 5 0.15 0.47 0.23 0.4853 0.134791 0.25 0.71 0.37 0.6598 0.132932 Polynomial SVM-DEC C = 0.001 0.56 0.29 0.38 0.6243 0.104309 0.33 0.14 0.20 0.5451 0.126985 Data: small_b Linear SVM C = 0.001 0 0 0 0.5000 0.036713 0 0 0 0.5000 0.023466 Gaussian SVM C = 0.001 0 0 0 0.5000 0.032298 0 0 0 0.5000 0.024147 Sigmoid SVM C = 0.001 0 0 0 0.5000 0.036110 0 0 0 0.5000 0.022768 Polynomial SVM C = 0.001 0 0 0 0.5000 0.037755 0 0 0 0.5000 0.020313 Linear SVM-DEC C = 50 0.11 1 0.21 0.8480 0.036468 0.06 1 0.11 0.8023 0.022785 Gaussian SVM-DEC C = 5 0.57 1 0.73 0.9853 0.037441 0 0 0 0.4767 0.023346 Sigmoid SVM-DEC C = 0.001 0.04 1 0.07 0.5000 0.037380 0.02 1 0.04 0.5000 0.023031 Polynomial SVM-DEC C = 5 0.57 1 0.73 0.9853 0.039142 0.33 1 0.50 0.9767 0.019630 Note: The best values in each column are in boldface. 5.2 Scenario II The classification performance of the GLMs and SVMs in this scenario are summarised in Table 5.3 and 5.4, respectively. Compare to Scenario I, data sets in this scenario have more predictors – hence increased dimensions – and larger sample size. The class imbalance presented in the three data sets in this scenario can be described as moderate (y¯ = 16.2%), severe (y¯ = 3.4%), and extreme (y¯ = 0.8%). Concerning the classification performance of the GLMs, Table 5.3 shows that the GLMs only managed to identify some events in the data set med_a, which is characterised by moderate class imbalance, while failed to do so in others. In data set med_a, all three GLMs have achieved a better in-sample and out-of- sample precision than what they did in the data set small_a. Their performances in all other metrics, however, have decreased in both the training and the test set. It is likely that this drop of classification performance was caused by the increased model complexity in the data set med_a, since the rarities of the data sets med_a and small_a only differ in 0.2 percentage points. As in Scenario I, it is hard to tell which of these three GLMs is superior, since they all produced identical performance in out-of-sample prediction using the conventional threshold. Using the alternative threshold, the Cloglog model managed to achieve a slightly better out-of-sample performance in precision, F1 score and BA. From Table 5.3, we can also observe that the LR and ReLogit models have identical classification performance. These two methods only differed in the Brier score, with King & Zeng’s [2001a] ReLogit model performing slightly better than the LR. In addition, the results from data sets med_b and med_c seem to suggest that the GLMs using the conventional threshold are not suitable for modelling events that occur less than 5 percent of the time. There are some improvements in their classification performance if the alternative threshold is used, most noticeably in the data set med_b. As for data set med_c, only the ReLogit model managed to achieve a out-of-sample BA above 0.5. As for the SVMs, their classification results in Scenario II are to some extent different from those 45 Table 5.3: Classification performance the GLMs (Scenario II) In-sample Out-of-sample Precision Recall F1 BA Brier Precision Recall F1 BA Brier Data: med_a LR 0.6818 0.2632 0.3798 0.6196 0.107824 0.6667 0.0833 0.1482 0.5377 0.110802 (0.3419) (0.7018) (0.4598) (0.7195) (0.3333) (0.7083) (0.4533) (0.7192) ReLogit 0.6818 0.2632 0.3798 0.6196 0.108114 0.6667 0.0833 0.1482 0.5377 0.110790 (0.3228) (0.7193) (0.4457) (0.7129) (0.3091) (0.7083) (0.4304) (0.7034) Cloglog 0.6842 0.2281 0.3421 0.6038 0.108631 0.6667 0.0833 0.1482 0.5377 0.110705 (0.3276) (0.6667) (0.4393) (0.7002) (0.3617) (0.7083) (0.4789) (0.7351) Data: med_b LR 0 0 0 0.5000 0.031599 0 0 0 0.5000 0.031944 (0.0826) (0.7500) (0.1488) (0.7271) (0.0571) (0.4000) (0.1000) (0.5862) ReLogit 0 0 0 0.5000 0.031931 0 0 0 0.5000 0.032661 (0.0692) (0.9167) (0.1287) (0.7394) (0.0500) (0.6000) (0.0923) (0.6034) Cloglog 0 0 0 0.5000 0.031634 0 0 0 0.5000 0.031884 (0.0833) (0.7500) (0.1500) (0.7286) (0.0556) (0.4000) (0.0976) (0.5828) Data: med_c LR 0 0 0 0.5000 0.008453 0 0 0 0.5000 0.008799 (0.0566) (1) (0.1071) (0.9280) (0) (0) (0) (0.3725) ReLogit 0 0 0 0.4986 0.012813 0 0 0 0.4966 0.017448 (0.0124) (1) (0.0245) (0.6556) (0.0090) (1) (0.0179) (0.6300) Cloglog 0 0 0 0.5000 0.008396 0 0 0 0.5000 0.009002 (0.0566) (1) (0.1071) (0.9280) (0) (0) (0) (0.3691) Note: Values without brackets are the classifications results obtained using the conventional threshold pˆ0 = 0.5. Values in brackets are classifications results obtained using the threshold pˆ0 = y¯. The best values in each column given to the specified thresholds are in boldface. in Scenario I. In the data set med_a, there were two standard SVMs that managed to identify some events: the SVM with Gaussian kernel and then one with polynomial kernel. The in-sample precision and recall of these two standard SVMs have also increased, indicating that the kernelised SVMs might be better in handling complex data than did the GLMs. However, the performance of these two SVMs in out-of-sample recall remained on a relatively low level, with the best of these two – the Gaussian SVM – slightly surpassing the 20 percent level. The cost-sensitive SVMs produced a better performance than the standard SVM in this data set, especially in terms of recall. The cost-sensitive SVM with sigmoid kernel has even reached the maximum score of recall in both the training and the test set. However, as indicated by its result in BA, the sigmoid SVM-DEC achieved the maximum performance in recall simply because it always assign the observations to the positive class. This renders the sigmoid SVM-DEC as an unreliable method in rare event modelling. Additionally, the results in Table 5.4 also show that the strong performance of the cost- sensitive SVMs comes at the price of increased false positives and hence low precision. As a consequence, in this data set, none of the cost-sensitive SVMs have an out-of-sample F1 score that equaled or surpassed the 0.5 level. In other words, the cost-sensitive SVMs did not deliver a prediction performance superior to the best-performing GLMs (in this case, GLMs that used the alternative threshold) in data set med_a. In the data sets med_b and med_c, the standard SVMs had all failed to identify the events, while its 46 Table 5.4: Classification performance the SVMs (Scenario II) In-sample Out-of-sample Precision Recall F1 BA Brier Precision Recall F1 BA Brier Data: med_a Linear SVM C = 0.001 0 0 0 0.5000 0.114565 0 0 0 0.5000 0.116475 Gaussian SVM C = 10 0.89 0.42 0.57 0.7054 0.088602 0.62 0.21 0.31 0.5923 0.116688 Sigmoid SVM C = 0.001 0 0 0 0.5000 0.136337 0 0 0 0.5000 0.134420 Polynomial SVM C = 1 0.86 0.33 0.48 0.6615 0.102705 0.75 0.12 0.21 0.5585 0.119579 Linear SVM-DEC C = 0.001 0.32 0.67 0.44 0.6985 0.111129 0.35 0.71 0.47 0.7272 0.113676 Gaussian SVM-DEC C = 1 0.50 0.68 0.58 0.7756 0.094952 0.38 0.54 0.45 0.6875 0.110362 Sigmoid SVM-DEC C = 5 0.16 1 0.28 0.5000 0.131892 0.16 1 0.28 0.5000 0.131209 Polynomial SVM-DEC C = 1 0.44 0.74 0.55 0.7763 0.107354 0.31 0.62 0.41 0.6776 0.114890 Data: med_b Linear SVM C = 0.001 0 0 0 0.5000 0.033220 0 0 0 0.5000 0.032595 Gaussian SVM C = 0.001 0 0 0 0.5000 0.031901 0 0 0 0.5000 0.032023 Sigmoid SVM C = 0.001 0 0 0 0.5000 0.033148 0 0 0 0.5000 0.032273 Polynomial SVM C = 0.001 0 0 0 0.5000 0.032928 0 0 0 0.5000 0.032607 Linear SVM-DEC C = 1 0.10 0.83 0.18 0.7835 0.032331 0.05 0.4 0.09 0.5724 0.031994 Gaussian SVM-DEC C = 50 0.80 1 0.89 0.9956 0.032854 0 0 0 0.4897 0.032144 Sigmoid SVM-DEC C = 5 0.03 1 0.07 0.5000 0.033148 0.03 1 0.07 0.5000 0.032273 Polynomial SVM-DEC C = 50 0.86 1 0.92 0.9970 0.032890 0 0 0 0.4759 0.032301 Data: med_c Linear SVM C = 0.001 0 0 0 0.5000 0.008518 0 0 0 0.5000 0.006661 Gaussian SVM C = 0.001 0 0 0 0.5000 0.008559 0 0 0 0.5000 0.006767 Sigmoid SVM C = 0.001 0 0 0 0.5000 0.008529 0 0 0 0.5000 0.006678 Polynomial SVM C = 0.001 0 0 0 0.5000 0.008475 0 0 0 0.5000 0.006716 Linear SVM-DEC C = 1 0.15 1 0.26 0.9755 0.008421 0 0 0 0.4631 0.006679 Gaussian SVM-DEC C = 50 1 1 1 1 0.007916 0 0 0 0.4765 0.006723 Sigmoid SVM-DEC C = 1 0.01 1 0.02 0.5000 0.008502 0.01 1 0.01 0.5000 0.006700 Polynomial SVM-DEC C = 5 0.60 1 0.75 0.9971 0.008424 0 0 0 0.4765 0.006689 Note: The best values in each column are in boldface. cost-sensitive versions varied in classification performance. The problem of overfitting in the Gaussian and polynomial SVM-DEC might be augmented as the event class becomes more rare. Classification results for the data sets med_b and med_c shows that both classifiers have achieved high F1 scores in the training set, indicating great harmony in the precision-recall trade-off. In the test set, however, all nominal evaluation metrics suddenly drop to zero. This would imply that the model resulted from the Gaussian and polynomial SVM-DEC have no practical value in these data sets, since they were unable to predict any future outcomes. The classification performances of linear SVM-DEC and of sigmoid SVM-DEC appeared to be relatively stable between the training and the test set. However, they are still poor results, especially considering the large number of out-of-sample false positives resulted from these two cost-sensitive SVMs. 5.3 Scenario III The classification performance of the GLMs and SVMs in this scenario are summarised in Table 5.5 and 5.6, respectively. In Scenario III, the number of observations included in the data sets have doubled from that in Scenario II. And the number of predictors have increased to 20. The event rarity in the 47 data sets included in this scenario can be described as moderate (y¯ = 16.2%) and severe (y¯ = 3.5%). Similar to the previous scenarios, the GLMs using the conventional threshold failed to identify events in the data set that characterised by severe rarity. In the data set large_b, all GLMs that used the threshold pˆ0 = 0.5 obtained zeros in in-sample classification. For this reason, the unusual out-of-sample performance of the Cloglog model seems to occur merely by chance. The classification performance of these GLMs was improved when the threshold pˆ0 = y¯ was used, with the ReLogit model outperforming the other two models in terms of recall. However, the models’ low scores in precision showed that this strong performance comes at the cost of increased false positives. In the large_a data set, the in-sample classification performances of the GLMs have decreased compared to their performances in similar data sets in previous scenarios, suggesting that the learning capacity of GLM are negatively correlated to the dimensions of data. However, the out-of-sample recall, F1 score, and BA for all GLMs are slightly higher in the data set large_a than that in med_a, a phenomenon which we have not observed between med_a and small_a. It is highly unlikely that the difference in rarity caused this phenomenon, since the difference in the event rarity presented in the test sets of small_a, med_a, and large_a are less than 1 percentage point. Meanwhile, as shown in Table 5.5, all three GLMs delivered similar classification performance in the large_a data set. The LR and the ReLogit model had even delivered identical classification performance when using the convectional threshold. Table 5.5: Classification performance the GLMs (Scenario III) In-sample Out-of-sample Precision Recall F1 BA Brier Precision Recall F1 BA Brier Data: large_a LR 0.5000 0.1150 0.1871 0.5464 0.115578 0.4444 0.1633 0.2388 0.5617 0.129612 (0.3089) (0.7080) (0.4301) (0.7015) (0.2946) (0.6735) (0.4099) (0.6794) ReLogit 0.5000 0.1150 0.1871 0.5464 0.115565 0.4444 0.1633 0.2388 0.5617 0.129394 (0.3004) (0.7257) (0.4249) (0.7001) (0.2917) (0.7143) (0.4142) (0.6878) Cloglog 0.5600 0.1239 0.2029 0.5526 0.115502 0.4118 0.1429 0.2121 0.5515 0.131194 (0.3160) (0.6991) (0.4353) (0.7039) (0.2897) (0.6327) (0.3974) (0.6649) Data: large_b LR 0 0 0 0.5000 0.030129 0 0 0 0.5000 0.032901 (0.0784) (0.6667) (0.1404) (0.6943) (0.0633) (0.4545) (0.1111) (0.5992) ReLogit 0 0 0 0.5000 0.030469 0 0 0 0.4983 0.033487 (0.0669) (0.7917) (0.1234) (0.6998) (0.0696) (0.7273) (0.1270) (0.6785) Cloglog 0 0 0 0.5000 0.030013 0.5000 0.0909 0.1539 0.5437 0.032978 (0.0804) (0.6667) (0.1435) (0.6980) (0.0750) (0.5455) (0.1319) (0.6447) Note: Values without brackets are the classifications results obtained using the conventional threshold pˆ0 = 0.5. Values in brackets are classifications results obtained using the threshold pˆ0 = y¯. The best values in each column given to the specified thresholds are in boldface. As for the standard SVMs, the results are similar to that in Scenario II. In the large_b data set, all standard SVMs failed to identify the events. Meanwhile, in the large_a data set, the Gaussian and polynomial SVMs were once again the only two standard SVMs that managed to detect some observations from the event class. As in the med_a data set, in this data set, the Gaussian and polynomial SVMs demonstrated the flexibility of kernelised SVMs in handling increased model complexity, outperforming 48 GLMs that used the conventional threshold. However, this did not lead to more correctly predicted events in the test set. For both the Gaussian and polynomial SVMs, only 6 percent of the events in the test set were correctly predicted, fewer than the GLMs with pˆ0 = 0.5 threshold, not to mention those that uses the threshold pˆ0 = y¯. Table 5.6: Classification performance the SVMs (Scenario III) In-sample Out-of-sample Precision Recall F1 BA Brier Precision Recall F1 BA Brier Data: large_a Linear SVM C = 0.001 0 0 0 0.5000 0.126496 0 0 0 0.5000 0.132109 Gaussian SVM C = 10 0.97 0.32 0.48 0.6584 0.110489 0.38 0.06 0.11 0.5207 0.127116 Sigmoid SVM C = 0.001 0 0 0 0.5000 0.135370 0 0 0 0.5000 0.136658 Polynomial SVM C = 1 0.91 0.28 0.43 0.6390 0.100920 0.50 0.06 0.11 0.5246 0.126924 Linear SVM-DEC C = 50 0.31 0.75 0.44 0.7177 0.118064 0.30 0.69 0.42 0.6916 0.126674 Gaussian SVM-DEC C = 50 0.70 1 0.82 0.9591 0.099250 0.30 0.47 0.37 0.6291 0.129519 Sigmoid SVM-DEC C = 5 0.16 1 0.28 0.5000 0.123883 0.16 1 0.28 0.5000 0.126332 Polynomial SVM-DEC C = 50 0.76 1 0.86 0.9693 0.107780 0.26 0.39 0.31 0.5883 0.131405 Data: large_b Linear SVM C = 0.001 0 0 0 0.5000 0.032870 0 0 0 0.5000 0.035180 Gaussian SVM C = 0.001 0 0 0 0.5000 0.031444 0 0 0 0.5000 0.034210 Sigmoid SVM C = 0.001 0 0 0 0.5000 0.033114 0 0 0 0.5000 0.035322 Polynomial SVM C = 0.001 0 0 0 0.5000 0.031198 0 0 0 0.5000 0.035038 Linear SVM-DEC C = 0.001 0.09 0.75 0.17 0.7456 0.032171 0.09 0.73 0.16 0.7235 0.034741 Gaussian SVM-DEC C = 0.001 0 0 0 0.5000 0.031822 0 0 0 0.5000 0.034916 Sigmoid SVM-DEC C = 0.001 0 0 0 0.5000 0.031779 0 0 0 0.5000 0.034595 Polynomial SVM-DEC C = 10 0.86 1 0.92 0.9970 0.032152 0.05 0.09 0.06 0.5109 0.035161 Note: The best values in each column are in boldface. Regarding the cost-sensitive SVMs, the results we observed in Table 5.6 are also similar to that in the previous scenarios. They achieved a better classification performance than the GLMs and the standard SVMs in the data sets large_a, although the improvements compared to the result of GLMs using the alternative threshold are merely marginal. We can also observe some degrees of overfitting in Gaussian and polynomial SVM-DEC, as well as the phenomenon that the sigmoid SVM-DEC assigns all observations to the positive class. Similar results were also obtained in the the data sets large_b. Generally, in the presence of moderate class imbalance, the kernelised cost-senstive SVMs did not perform much better than the non-kernelised (linear) version in terms of out-of-sample metrics. Under the presence of severe class imbalance, as in the case of large_b, the linear SVM-DEC even performed better than those kernelised SVM-DECs in the test set, despite the large number of false positives. 49 Chapter 6 Results from the empirical analysis 6.1 Classification performance Through the simulation study, we have obtained some knowledge about the classification performance of GLMs and SVMs under the presence of various levels of class imbalance. In the empirical analysis, I investigated whether the classification patterns of different methods observed in the simulation study still hold. The MID data set has a event rarity around 0.34 percent and contains 10 predictors. For this reason, I expected that the classification results obtained from the MID data would be similar to those obtained from the med_b7 data set in Scenario II. Table 6.1: Classification performance the GLMs (MID) In-sample Out-of-sample Precision Recall F1 BA Brier Precision Recall F1 BA Brier LR 0.5000 0.0137 0.0267 0.5068 0.003198 0.5000 0.0323 0.0606 0.5161 0.003343 (0.0295) (0.8767) (0.0570) (0.8886) (0.0283) (0.8065) (0.0547) (0.8589) ReLogit 0.3333 0.0137 0.0263 0.5068 0.003201 0.5000 0.0323 0.0606 0.5161 0.003355 (0.0271) (0.8767) (0.0526) (0.8841) (0.0255) (0.8065) (0.0494) (0.8506) Cloglog 0.2000 0.0137 0.0256 0.5068 0.003200 0.3333 0.0323 0.0588 0.5160 0.003336 (0.0295) (0.8767) (0.0572) (0.8888) (0.0284) (0.8065) (0.0548) (0.8561) Note: Values without brackets are the classifications results obtained using the conventional threshold pˆ0 = 0.5. Values in brackets are classifications results obtained using the threshold pˆ0 = y¯. The best values in each column given to the specified thresholds are in boldface. The classification performance of the GLMs and the SVMs are shown in Tables 6.1 and 6.2, respec- tively. The results show that the GLMs using the conventional threshold managed to identify some of the events in both the training and the test set, although the low recall score indicates that all these GLMs had low sensitivity. The in-sample and out-of-sample BA scores of these GLMs show that they are only marginally better than a classifier that always assigns observations to one class. As expected, the GLMs using the alternative threshold achieved a better result in recall at the cost of decreased precision. The BA for such GLMs had also improved. But the low F1 score still indicates a poor trade-off between the ability of detecting the actual events and the ability of making the correct decision. The LR and the ReLogit models yielded nearly identical results as expected. Using the conventional threshold, the GLM with Cloglog link seems to be an inferior model relative to LR and ReLogit in this data set because of its poor performance in both in-sample and out-of-sample precision. However, when the alternative threshold was adopted, the classification performances of the three GLMs listed in 6.1 hardly differed. 50 Regarding the SVMs, the standard SVMs had, as expected, all failed to identify the events. The cost- sensitivity versions achieved better performance than both the standard SVMs and the GLMs using the conventional threshold. As shown in Table 6.2, the cost-sensitive SVMs in general managed to identify more than 75 percent of the actual military disputes in the training set and more than 60 percent of the same outcome in the test set. However, such high accuracies in identifying actual conflicts come at the price of the large number of false positives and hence a low F1 score both in-sample and out-of-sample. Table 6.2 also shows that the sigmoid SVM-DEC achieved a better classification performance in the test set than in the training set. Since all other cost-sensitive SVMs showed the reversed case, it is unlikely that such anomaly was caused by the data structure of MID. Therefore, this phenomenon would imply that the sigmoid SVM-DEC tends to classify the observation randomly when rare events are presented in the data. Meanwhile, the issue of overfitting in the Gaussian and the polynomial SVM-DEC is also presented in the empirical analysis, so does the relative stable performance of the linear SVM-DEC across the training and the test set. Since I did not split the MID data by time periods as in Beck et al. [2000], but randomly select observations from the event and the non-event class separately, we can also eliminate the possibility that the reductions in out-of-sample recall for SVMs with Gaussian and polynomial kernels were caused by exogenous changes in the real world after a certain point of time. In addition, I had also attempted to prevent overfittning by tunning the parameter C when fitting the SVMs to the MID data using GridSearchCV. As shown in Table 6.2, the optimal value of C chosen for the Gaussian and the polynomial SVM-DEC is the largest of the five candidates, hence reaching the maximum level of violations to the training set I am prepared to tolerate. Still, the issue of overfitting remained. Thus, the results in Table 6.2 provide some evidence for the claim that Gaussian and polynomial SVM-DEC tend to overfit the data. Table 6.2: Classification performance the SVMs (MID) In-sample Out-of-sample Precision Recall F1 BA Brier Precision Recall F1 BA Brier Linear SVM C = 0.001 0 0 0 0.5000 0.003349 0 0 0 0.5000 0.003336 Gaussian SVM C = 0.001 0 0 0 0.5000 0.003418 0 0 0 0.5000 0.003390 Sigmoid SVM C = 0.001 0 0 0 0.5000 0.003418 0 0 0 0.5000 0.003388 Polynomial SVM C = 0.001 0 0 0 0.5000 0.003413 0 0 0 0.5000 0.003382 Linear SVM-DEC C = 0.001 0.03 0.90 0.05 0.8925 0.003287 0.03 0.87 0.05 0.8813 0.003409 Gaussian SVM-DEC C = 50 0.07 0.99 0.13 0.9721 0.003203 0.04 0.61 0.08 0.7819 0.003312 Sigmoid SVM-DEC C = 0.001 0.01 0.78 0.01 0.6962 0.003382 0.01 0.84 0.01 0.7283 0.003355 Polynomial SVM-DEC C = 50 0.04 0.99 0.07 0.9482 0.003390 0.03 0.74 0.05 0.8245 0.003395 Note: The best values in each column are in boldface. 6.2 Probabilistic estimates As mentioned in the Section 2.3.4, the SVMs do not produce any probabilistic outputs, such as the regression output of GLM shown in Table A.2. The probabilistic evaluation metrics, such as the Brier scores presented in the tables above, were computed using Platt scaling. To compare the probabilistic output of the SVMs with that of the GLMs in a fair manner, I used the the reliability diagram (Figures 6.1–6.3), the ROC curve (Figures 6.4 and 6.5), and the precision-recall curve (Figures 6.6–6.8). 51 6.2.1 Reliability diagram Comparing Figure 6.1 with Figures 6.2 and 6.3, it is clear that the probabilistic outputs produced by the GLMs are superior than those by the SVMs. The three curves in subfigures 6.1a and 6.1b all demonstrated the tendency of converging towards the diagonal line, which, as mentioned in Section 2.3.4 above, represents the perfect match between the actual probabilities and the predicted probabilities. We can also observe that the probability estimates of LR and that of ReLogit are nearly identical to each other, while the predicted probabilities from the Cloglog model extend far more beyond the 0.5 threshold than do the predicted probabilities from LR and ReLogit. This is coherent with the theoretical claim that the Cloglog link approaches to 1 at a faster rate than it approaches to 0.33 (a) GLM (In-sample) (b) GLM (Out-of-sample) Figure 6.1: Reliability diagram of the GLMs applied to the MID data. As for the standard SVMs, Figure 6.2 shows that only the linear SVM managed to produce a calibration curve similar to that of the GLMs, while the kernelised versions all ended in some points near the origin, meaning that the classifier predicted nearly zero probability for all observations in the data set. In contrast to the standard SVMs, the cost-sensitive SVMs did produce predicted probabilities 33As a side note, Figure 6.1 also shows that the curves for the LR, ReLogit, and Cloglog model lie under the diagonal line, indicating that the predicted probabilities from these models are too high. It is worth to mention that this has nothing to do with the overestimation of the (conditional) probability of event occurrence, Pr (Y = 1|X), which we discussed in Section 3.1. The reason is simple: as mentioned in Section 2.3.4, the y-axis of the reliability diagram is the observed frequency of positive cases conditional to the predictions, Pr (y = 1|s(x) = s), which is not the same as Pr (Y = 1|X), which is conditional to the predictors, X. Therefore, the calibration curves of the LR, ReLogit, and Cloglog model shown in Figure 6.1 do not contradict the theoretical results presented in Section 3.1. 52 (a) Standard SVM (In-sample) (b) Standard SVM (Out-of-sample) Figure 6.2: Reliability diagram of the standard SVMs applied to the MID data. considerably greater than zero, most noticeably, the probabilities predicted by the Gaussian SVM-DEC, which, as the reliability diagram for in-sample prediction in subfigure 6.3a shows, surpassed the 0.5 threshold. However, in terms of out-of-sample predictions, the probability estimates produced by cost- sensitive SVMs are still inferior to those produced by GLMs. In addition, subfigures 6.3a and 6.3b also shows that the calibration curves of all four classifiers diverge from the diagonal line in various extents. Comparing the results shown in Table 6.2 with those in Figure 6.2, we might see some indications of inconsistency between the classification performance of SVM measured in nominal terms and the method’s probabilistic outputs provided through Platt scaling. From Table 6.2, we know that all standard SVMs had failed to detect events in the MID data and classified all observations as non-events. The probabilistic outputs of the kernelised standard SVMs can be regarded as coherent with this result. However, this is not the case for the linear SVM. The calibration curves presented in Figure 6.2 imply that, using the predicted probabilities produced by Platt scaling, the linear SVM is capable of identifying some events in both the training and the test set – if we set the threshold to pˆ0 = y¯. However, as we have seen from Table 6.2, this implication is not consistent with the nominal metrics. 53 To explain this inconsistency, recall that the predicted probabilities produced by Platt scaling are converted from the SVM scores, which in turn are related to the distance of each of the observations to the separating hyperplane. In the case of the linear SVM, it seems that the Platt scaling had given those observations that lie in close distance to the separating hyperplane – probably support vectors – a probability of belonging to the event class higher than zero. As mentioned in Section 2.3.4, this relies on the assumption that the distance of an observation to the separating hyperplane – the ”confidence” we have for our classification – reflects the probability of that observation belonging to the positive class. However, this assumption might not always hold, for the simple reason that ”confidence” is not equivalent to probability. The fact that an observation lies in close distance to the separating hyperplane could only imply that we are uncertain about our classification of that observation (i.e., about whether the decision boundary is correctly placed), it does not necessarily imply that this observation has a higher probability of belonging to the other class. It is probably here the nominal classification results of the base classifier and the predicted probabilities produced by Platt scaling begin to diverge. (a) Cost sensitive SVM (In-sample) (b) Cost sensitive SVM (Out-of-sample) Figure 6.3: Reliability diagram of the cost sensitive SVMs applied to the MID data. 54 6.2.2 ROC curve and AUC As mentioned in Section 4.4.2, the use of ROC/AUC has many shortcomings in evaluating different classifiers’ classification performance under class imbalance. However, the ROC curve is still a convenient tool for assessing the reliability of the probability estimates generated from Platt scaling. In Figures 6.4 and 6.5, I plotted the ROC curves of the standard and of the cost-sensitive SVMs, respectively, and compared these curves alongside the ROC curves of the GLMs. The AUC was also computed at the same time. (a) GLMs vs. Standard SVMs (In-sample) (b) GLMs vs. Standard SVMs (out-of-sample) Figure 6.4: The ROC curve of the GLMs and the standard SVMs applied to the MID data. Figure 6.4 shows that the standard SVMs – after being converted to probabilistic models through Platt scaling – are generally inferior than the GLMs. In both the training and the test set, the ROC curves of all four standard SVMs are located under those of the GLMs. The curves of the kernelised SVMs are also not distant from the diagonal line, which represents the performance of a ”no information” classifier that classifies observations randomly. For this reason, the AUCs of these standard SVMs are also lower than those of the GLMs. The Gaussian SVM, in particular, delivered a AUC score slightly above 0.5 in the test set, implying a performance nearly the same as a ”no information” classifier. The 55 only standard SVM that achieved ROC/AUC similar to those of the GLMs is the linear SVM. Since the MID data has 10 predictors, I had expected that the kernelised SVMs are better than the linear SVM in handling data with this level of model complexity. Therefore, the result that the linear SVM performs better than its kernelised counterparts is truly striking. One possible explanation of this could be the inconsistency between the nominal classification results and the probability predictions provided through Platt scaling mentioned in the previous section. As shown in the reliability diagrams in Figure 6.2, the linear SVM is the only standard SVM that managed to obtain – through Platt scaling – probability predictions that are not zeros. For this reason, the linear SVM also became the only standard SVM that is capable of classifying some observations to the event class in the MID data set, and hence capable of achieving a better ROC/AUC than its kernelised counterparts. (a) GLMs vs. Cost sensitive SVMs (In-sample) (b) GLMs vs. Cost sensitive (out-of-sample) Figure 6.5: The ROC curve of the GLMs and the cost-sensitive SVMs applied to the MID data. In Table 6.2 above, we observe that the cost-sensitive SVMs generally showed a better classification performance than the standard SVMs. As shown in Figure 6.5, this is still the case after these cost- sensitive SVMs were converted to probabilistic models through Platt scaling. Although the ROC curve for sigmoid SVM-DEC, as the standard SVM using the same kernel, is still located under the ROC 56 curves of GLMs, the curves of all the other cost-sensitive SVMs are either located above or at the same level as the curves for GLMs. The Gaussian and the polynomial SVM-DEC, in particular, have achieved a AUC score equal to 0.99 and 0.96, respectively, in the training set. The AUC scores of these two cost-sensitive SVMs dropped below 0.9 eventually, probably because of the overfitting issue linked to the base classifiers using the same kernel. Still, the AUC scores of these two cost-sensitive SVMs are considered as high and hence indicate strong prediction performance. However, the reader must not forget that the ROC/AUC results could be misleading when rare events are presented in the data. In this case, the high AUC score of the Gaussian and the polynomial SVM-DEC might be resulted from the abundance of non-events, which masked the high number of false positives resulted from these two cost-sensitive SVMs and hence led to a low false positive rate that, in turn, produced a seemingly good sensitivity-specificity trade-off. 6.2.3 Precision-Recall curve As mentioned in Section 4.4 above, the precision-recall curve illustrates the trade-off between precision and recall for different classification methods, with good classifiers having their curves bent towards the top-right corner. Like the ROC curve, the precision-recall curve also requires probabilistic outputs. As in the previous cases, both the standard SVMs and the cost-sensitive SVMs were converted to probabilistic models using Platt scaling. Figures 6.6–6.8 show the precision-recall curves for all classifiers evaluated in this study. The results shown in these figures are unequivocal: all classifiers have their precision-recall curve bent to the left bottom corner – the precise opposite of what a ”good” classifier ought to be. For the GLMs, Figure 6.6 shows that there are almost no differences between the three models in terms of the precision-recall trade-off, as their curves overlap each other. Among the standard SVMs, the linear SVM once again outperformed its kernelised counterpart in this category – an expected result given to what we have observed in the reliability diagrams and the ROC curves. However, compared with that of the GLMs, the precision-recall curve of the linear SVM shows a rather poor trade-off between the two metrics, not to mention the curves of the kernelised versions, which are nearly horizontal lines that lie at the bottom of the diagram. Additionally, Figure 6.8 shows that for SVMs, the trade-off between precision and recall slightly improved when the cost-sensitive versions were used. However, despite the improvements, the curves of all four cost-sensitive SVMs show no better performance than the curves of the GLMs. The issue of inconsistency between the nominal metrics and the probabilistic outputs is also presented in the precision-recall curves. Comparing the curves of the GLMs with the classification results presented in Table 6.1, it is evident that the nominal metrics and the probability estimates are coherent with each other. This, however, seems not to be the case for the SVMs. For instance, in Table 6.2, the linear SVM showed no better classification performance in terms of precision and recall than its kernelised counterparts, but in Figure 6.7, the linear SVM appears to outperform other SVMs in terms of precision- recall trade-off. In sum, the inconsistent results shown in this section and the above sections cast doubt on whether the predicted probabilities produced through Platt scaling truly reflect the classification performance of the SVMs. Nevertheless, even if such probabilities are consistent with the classification performance of the base classifier, the precision-recall curves in Figures 6.7 and 6.8 still show that both the standard SVMs and the cost-sensitive version had a poor precision-recall trade-off, indicating not only that these probability estimates are not reliable for rare event prediction, but also that the SVMs might not be an appropriate method to model rare events. 57 (a) GLM (In-sample) (b) GLM (Out-of-sample) Figure 6.6: Precision-recall curve of the GLMs applied to the MID data. 58 (a) Standard SVM (In-sample) (b) Standard SVM (Out-of-sample) Figure 6.7: Precision-recall curve of the standard SVMs applied to the MID data. 59 (a) Cost sensitive SVM (In-sample) (b) Cost sensitive SVM (Out-of-sample) Figure 6.8: Precision-recall curve of the cost-sensitive SVMs applied to the MID data. 60 Chapter 7 Discussions 7.1 Empirical implications 7.1.1 Modelling rare events using SVM The results from the simulation study and the empirical analysis show that the standard SVMs, in the majority of cases, did not achieve a better classification performance than did the GLMs. In Scenario III of the simulation study and the empirical analysis, the standard SVMs had even delivered a weaker performance in out-of-sample predictions than did the GLMs using the conventional threshold, not to mention those using the alternative one. This implies that the standard SVMs might be more sensitive to imbalanced or rare event data than Akbani et al. [2004] had suggested. In addition, under the presence of moderate class imbalance, the standard SVMs tend to have low out-of-sample sensitivity – an undesirable characteristic in the modelling of rare events. As mentioned in the introduction, rare events (or black swan events) often have a major impact in the real world, implying that the cost of false negatives (i.e., misclassifying the events as non-events) could be much greater than the cost of false positives. This is especially true in forecasting financial crises, or in the context of the MID data – forecasting military conflicts. In the latter case, using a classifier that has low sensitivity would most probably lead to signals indicating the break of war be ignored. Such outcome is not only undesirable from a military point of view, but also from a social one because of the high costs associated with warfare [Beck et al., 2000]. As for the cost-sensitive SVMs, the results from the simulation study and empirical analysis show that they achieved a better classification performance than the GLMs and the standard SVMs under the presence of moderate class imbalance. However, as the events become rarer, the out-of-sample classification performance of the cost-sensitive SVMs dropped sharply due to the rising number of false positives. Although the cost-sensitive SVMs have much higher sensitivity than the standard versions – a characteristic that could be beneficial in forecasting military conflicts – the method also created a large number of false positives when applied to the the MID data. This raises some concerns about using the cost-sensitive SVMs in conflict research. Meanwhile, the tendency of overfitting occurs in some kernelised cost-sensitive SVMs, especially those with Gaussian and polynomial kernels. This issue might also cast some doubts on using such methods in other real-world applications. 7.1.2 The reliability issue of probabilities produced by Platt scaling The results from the empirical analysis in this study indicate that the probabilistic outputs produced by Platt scaling might have a reliability issue, as these probabilities are not consistent with the nominal 61 classification performance of the SVMs. Additionally, the results from ROC/AUC and from the precision- recall curves also show that these probability estimates are inferior to that produced by the GLMs. One can argue that it was the Platt scaling and the inconsistency it induced that caused the weak performance of the SVMs in ROC/AUC and prediction-recall curves. Although we can not entirely rule out the possibility of this scenario, it might still be imprudent to attribute the weak performance of SVMs to Platt scaling alone, since the inferior probability estimates shown in the ROC and the precision-recall curves are most likely caused by the rare events presented in the data set, which affected both the base classifier (i.e., the SVMs) and the calibrator (i.e, the Platt scaling). As mentioned in Section 2.3.4, Platt scaling is essentially a calibrator that takes the SVM scores, fˆi, as inputs, and then transformed fˆi to probabilities by fitting a sigmoid to the training set (fˆi, ti). Note that the scores fˆi in this training set are selected through cross-validation. As described in Section 3.2.3, the SVM has a tendency of favouring the majority class when applied to a data set in which the imbalanced classes or rare events are presented. In the case of MID data, this would lead to only those fˆi with a negative sign (which represent observations that were classified as non-events) are passed on to the training set (fˆi, ti). Fitting a sigmoid to this training set would most certainly yield a set of inferior probability predictions that are not capable of predicting a single event – ”garbage in, garbage out”. Therefore, the inferior probability estimates shown in the ROC and the precision-recall curves have more to do with the base classifiers’ inability of handling imbalanced data, as implied by the results of the simulation study. However, we must also acknowledge that the Platt scaling might also suffer from imbalanced classes. In Section 2.3.4, we mentioned that the method’s use of the sigmoid function to calibrate probabilities (or scores produced by the base classifier) is only justified if the scores within each class are Gaussian distributed with equal variance. For the MID data, which has an event rarity around 0.34 percent, the equal variance requirement is unlikely to be satisfied. We had also mentioned that the Platt scaling is most effective when the score distortions have a sigmoidal shape. However, as shown in Figure B.1 in the appendix, this is not the case for the MID data. Thus, it is not unexpected that the probability estimates produced by Platt scaling become inferior to those produced by the GLMs. Therefore, regarding the question whether we can use the probabilistic outputs produced by Platt scaling in rare event modelling, the results of this study suggest that such outputs might be unreliable if the base classifier is sensitive to imbalanced dataset. Considering the fact that the SVM outputs are generally hard to interpret and that statistical inference is an important part of quantitative social science research, it would be appropriate to avoid using the SVM and the Platt scaling in building a prediction model for rare events. 7.2 Methodological limitations There are several methodological limitations in this study. Firstly, the data sets in the simulation study were all generated using the LR as the true model. In other words, I did not consider other options, such as the probit model and linear discriminant analysis, in simulating rare event data. Additionally, rare event data could also be simulated using stochastic methods such as the Markov process [Bucklew, 2004]. For this reason, the classification performance of the GLMs and the SVMs on data simulated using the above-mentioned techniques might differ from the results presented in this thesis. Secondly, the simulated data sets in this study assumed a linear relationship between the outcome variable and the predictors. The case of non-linear relationship is not included in this study. As men- tioned in Section 2.3.3, the SVMs, especially the kernelised versions, might achieve a better classification 62 performance on data characterised by such non-linear relationship. Thirdly, even though the thesis aims to provide evidence for a wider application of the use of SVMs in social science research, the reader should still interpret the results of the empirical analysis with caution. Such caution is also advised for future extrapolations of the results of the empirical analysis to other social science disciplines, as the empirical analysis of this study is based upon a data set typical for peace and conflict research. And the relationships between the outcome variable and the predictors presented in the MID data set might not be necessarily representative for data in other fields such as economics and finance. In addition, the signal-to-noise ratio, which is a measure of the amount of information contained in the data, might also vary in data typical to different social science disciplines. 63 Chapter 8 Conclusions Using both simulated and real-world data, this thesis is intended to provide more insight to the research concerning the use of machine learning techniques in social science. More specifically, I evaluated the classification performance of the SVMs under the presence of class imbalance and rare events, and compared it with the classification performance of the traditional GLMs. The main conclusion of this study is that the use of SVMs – both the standard version and the cost-sensitive version – in rare event modelling could not be motivated in any case. There are several reasons for this conclusion. Firstly, the standard SVMs showed no better performance than the GLMs in classifying rare events. The results from both the simulation study and the empirical analysis also showed that the standard SVMs have a relatively low sensitivity, a characteristic that brings more harm than benefit to the prediction of rare events in real applications. Secondly, as in the case of the standard SVMs, the cost-sensitive SVMs could not be a satisfactory classifier in rare event modelling either – for the reason that they generally tend to overfit the training data, leading to a poor out-of-sample precision and hence a poor F1 score. Other than overfitting, cost- sensitive SVMs also result in a large number of false positives, which occur in connection with severe and extreme class imbalance. Thirdly, as a machine learning classifier, the SVM has already an inherent drawback in model in- terpretability in comparison to the GLMs. In addition, the fact that the SVMs do not produce a prob- abilistic output makes issue with interpretability even more problematic, as the probability estimates are important for quantitative analysis in social science. The Platt scaling is designed for mitigating this lack of probabilistic output. However, as the empirical analysis of this thesis suggests, the predicted probabilities produced by Platt scaling are not consistent with the nominal evaluation metrics and might suffer from class imbalance as well. Hence, such probabilities are not reliable. The reliability issue of probabilities produced by Platt scaling could be a further argument against the use of SVMs in rare event modelling. In sum, the results from both the simulation study and the empirical analysis provided little support for the use of the SVMs in rare event modelling in general, not to mention in social science research. Perhaps a better way of addressing the rare event bias is to create a more balanced data set, by applying the sampling method described in the Section 3.3. However, the results of this thesis do not suggest that we should completely abandon the internal methods. They are rather a call for more research focused on the development of novel methods that addresses the problems associated with rare events, as well as a call for more studies concerning the application of machine learning methods in the field of social science. 64 References Agresti, A. (2012). Categorical data analysis (3rd ed.). Hoboken, N.J.: Wiley. Agresti, A. (2015). Foundations of linear and generalized linear models. Hoboken, New Jersey: John Wiley & Sons. Akbani, R., Kwek, S., & Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets. In European conference on machine learning (pp. 39–50). Batuwita, R., & Palade, V. (2010). Efficient resampling methods for training support vector ma- chines with imbalanced datasets. In The 2010 International Joint Conference on Neural Networks (IJCNN) (pp. 1–8). Batuwita, R., & Palade, V. (2013). Class imbalance learning methods for support vector machines. In H. He & Y. Ma (Eds.), Imbalanced learning: Foundations, algorithms, and applications. Wiley- IEEE Press. Beck, N., King, G., & Zeng, L. (2000). Improving quantitative studies of international conflict: A conjecture. American Political science review , 21–35. Bennett, D. S., & Stam, A. C. (2000). Research design and estimator choices in the analysis of interstate dyads: When decisions matter. Journal of conflict resolution, 44 (5), 653–685. Bishop, C. M. (2006). Pattern recognition and machine learning. New York, NY: Springer. Brodersen, K. H., Ong, C. S., Stephan, K. E., & Buhmann, J. M. (2010). The balanced accuracy and its posterior distribution. In 2010 20th International Conference on Pattern Recognition (p. 3121- 3124). Bucklew, J. A. (2004). Introduction to rare event simulation. New York: Springer. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16 , 321–357. Cristianini, N., & Shawe-Taylor, J. (2000). Linear learning machines. In An introduction to support vector machines and other kernel-based learning methods (p. 9–25). Cambridge University Press. Ferna´ndez, A., Krawczyk, B., Garc´ıa, S., Galar, M., Herrera, F., & Prati, R. C. (2018). Learning from imbalanced data sets. Springer. Friedman, J., Hastie, T., & Tibshirani, R. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). New York: Springer. Ghosn, F., Palmer, G., & Bremer, S. A. (2004). The mid3 data set, 1993—2001: Procedures, coding rules, and description. Conflict management and peace science, 21 (2), 133–154. Greene, W. H. (2018). Econometric analysis (Global edition; 8th ed.). Harlow, Essex: Pearson. Hain, D., & Jurowetzki, R. (2020). Introduction to rare-event predictive modeling for inferential statis- ticians – a hands-on application in the prediction of breakthrough patents. (https://arxiv.org/ abs/2003.13441[Accessed: 2020-12-27]) Hand, D. J. (2006). Classifier technology and the illusion of progress. Statistical science, 1–14. Herbrich, R. (2002). Learning kernel classifiers : theory and algorithms. Cambridge, Mass.: MIT Press. 65 James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: With Applications in R (1st ed.). New York: Springer. Kelleher, J. D., Mac Namee, B., & D’Arcy, A. (2020). Fundamentals of machine learning for pre- dictive data analytics : algorithms, worked examples, and case studies (2. ed. ed.). Cambridge, Massachusetts: The MIT Press. King, G., & Zeng, L. (2001a). Explaining rare events in international relations. International Organiza- tion, 55 (3), 693–715. King, G., & Zeng, L. (2001b). Logistic regression in rare events data. Political analysis, 9 (2), 137–163. Kull, M., Silva Filho, T. M., Flach, P., et al. (2017). Beyond sigmoids: How to obtain well-calibrated probabilities from binary classifiers with beta calibration. Electronic Journal of Statistics, 11 (2), 5052–5080. Levy, P. S., & Lemeshow, S. (2008). Sampling of populations : methods and applications (4th ed.). Hoboken, N.J.: Wiley. McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). London: Chapman & Hall. Murphy, K. P. (2012). Machine learning: a probabilistic perspective. Cambridge, MA: MIT press. Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning (pp. 625–632). Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., . . . E´douard Duch- esnay (2020a). 1.16. Probability calibration. (https://scikit-learn.org/stable/modules/ calibration.html[Accessed: 2021-02-21]) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., . . . E´douard Duch- esnay (2020b). 1.4. Support Vector Machines. (https://scikit-learn.org/stable/modules/ svm.html[Accessed: 2020-12-27]) Platt, J. (1998). Sequential minimal optimization: A fast algorithm for training support vector machines (Tech. Rep. No. MSR-TR-98-14). Redmond, Washington, United States: Microsoft Research. Platt, J. (2000). Probabilities for sv machines. In A. J. Smola, P. Bartlett, B. Scho¨lkopf, & D. Schuurmans (Eds.), Advances in large margin classifiers. Cambridge, MA: The MIT Press. Prati, R. C., Batista, G. E., & Monard, M. C. (2011). A survey on graphical methods for classifica- tion predictive performance evaluation. IEEE Transactions on Knowledge and Data Engineering , 23 (11), 1601–1618. Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review , 65 (6), 386. Taleb, N. N. (2007). The black swan: The impact of the highly improbable (Vol. 2). Random house. Tipping, M. E. (2001). Sparse bayesian learning and the relevance vector machine. Journal of machine learning research, 1 (Jun), 211–244. Tsang, I. W., Kwok, J. T., & Cheung, P.-M. (2005). Core vector machines: Fast svm training on very large data sets. Journal of Machine Learning Research, 6 (Apr), 363–392. Van der Paal, B. (2014). A comparison of different methods for modelling rare events data (Unpublished master’s thesis). Ghent University. Veropoulos, K., Campbell, C., & Cristianini, N. (1999). Controlling the sensitivity of support vector machines. In Proceedings of the international joint conference on AI (Vol. 55, p. 60). Wackerly, D. D., Mendenhall, W. I., & Scheaffer, R. L. (2008). Mathematical statistics with applications (7th ed.). Southbank: Thomson Learning. Wasserman, L. (2004). All of statistics : a concise course in statistical inference. New York: Springer. Xu, X., Mao, Y., Xiong, J., & Zhou, F. (2007). Classification performance comparison between RVM 66 and SVM. In 2007 International Workshop on Anti-Counterfeiting, Security and Identification (ASID) (pp. 208–211). Zadrozny, B., & Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 694–699). Zeng, L. (1999). Prediction and classification with neural network models. Sociological methods & research, 27 (4), 499–524. 67 Appendix A Additional tables Table A.1: Descriptive statistics for the full MID data Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max Militarised dispute 303,814 0.0034 0.0585 0 0 0 1 Contiguous 303,814 0.0414 0.1991 0 0 0 1 Allies 303,814 0.0903 0.2866 0 0 0 1 Foreign policy 303,814 0.7726 0.1736 −0 0.7 0.9 1 Balance of Power 303,814 0.3644 0.2989 0.0002 0.0949 0.6018 1.0000 Max. democracy 303,814 3.2815 7.2791 −10 −6 10 10 Min. democracy 303,814 −5.0347 5.6620 −10 −9 −5 10 Max. trade 303,814 −4.8147 2.5843 −8.6840 −8.6840 −2.7884 0.2954 Min. trade 303,814 −5.7591 2.5303 −9.5361 −9.5361 −3.7558 −0.7376 Year since dispute 303,814 17.6841 11.5093 0 8 26 46 Major Power 303,814 0.0831 0.2760 0 0 0 1 68 Table A.2: Model for Militarised Interstate Dispute Dependent variable: Militarised Interstate Dispute LR ReLogit Cloglog (1) (2) (3) Contiguous 3.686∗∗∗ 3.665∗∗∗ 3.557∗∗∗ (0.302) (0.303) (0.304) Allies −0.642∗ −0.614∗ −0.591 (0.385) (0.385) (0.371) Foreign policy 0.387 0.358 0.388 (0.860) (0.860) (0.825) Balance of Power 0.733 0.747 0.732 (0.517) (0.517) (0.489) Max. democracy 0.083∗∗∗ 0.082∗∗∗ 0.076∗∗∗ (0.021) (0.021) (0.020) Min. democracy −0.095∗∗∗ −0.091∗∗∗ −0.087∗∗∗ (0.026) (0.026) (0.025) Max. trade 0.090 0.104 0.118 (0.224) (0.224) (0.218) Min trade −0.077 −0.093 −0.108 (0.232) (0.232) (0.226) Years since dispute −0.078∗∗∗ −0.076∗∗∗ −0.076∗∗∗ (0.024) (0.024) (0.023) Major power 1.950∗∗∗ 1.935∗∗∗ 1.794∗∗∗ (0.329) (0.329) (0.312) Constant −7.468∗∗∗ −7.375∗∗∗ −7.387∗∗∗ (0.836) (0.837) (0.813) Observations 21,267 21,267 21,267 Log Likelihood −328.019 −328.019 −329.285 Akaike Inf. Crit. 678.037 678.037 680.570 Note: Parameters are estimated based upon training data (70% of the observations). Numbers in brackets are robust standard errors. ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01 69 Appendix B Additional figures (a) Scores produced by the standard SVMs, normalised (b) Scores produced by the cost-sensitive SVMs, normalised Figure B.1: Reliability diagram of the uncalibrated probabilities (i.e., normalised classification scores) produced by the standard SVMs and the cost-sensitive SVMs. 70