Detection of software incidents from large log material with the use of unsupervised machine learning Master’s thesis in Applied Data Science DIMITRIOS ANASTASIADIS JAKUB LENART Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2022 Master’s thesis 2022 Detection of software incidents from large log material with the use of unsupervised machine learning DIMITRIOS ANASTASIADIS JAKUB LENART Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2022 DIMITRIOS ANASTASIADIS JAKUB LENART © DIMITRIOS ANASTASIADIS, 2022. © JAKUB LENART, 2022. Supervisor: Niklas Zechner, Department of Swedish, multilingualism, language tech- nology, University of Gothenburg Advisor: Daniel Dalevi, Centiro Solutions AB Examiner: Richard Johansson, Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg Master’s Thesis 2022 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2022 iv Detection of software incidents from large log material with the use of unsupervised machine learning DIMITRIOS ANASTASIADIS JAKUB LENART Department of Computer Science and Engineering University of Gothenburg, Chalmers University of Technology Abstract Computer systems generate log files, which contain information on the various op- erations performed by these systems. This information can support the process of error/failure detection and debugging. Therefore, anomalies can be spotted in the system through its produced log material. The task of anomaly detection can be treated as a binary classification of log files, with the two classes being anomalous and non anomalous. Due to the sheer volume of data and the complexity of the task, it is not possible for it to be performed manually by humans, thus creating the need for automation. Centiro, a Swedish software company, has decided to follow a machine learning approach for automating the task of software incident detection. In this thesis, we apply four machine learning models in order to detect anomalies. These are namely the Local Outlier Factor (LOF), the Isolation Forest (IF), the Principal Component Analysis (PCA) and the LSTM-Autoencoder. We make use of four publicly available datasets as well as a dataset gathered from the produced logs of the computer systems of the company. Preprocessing of the data and selec- tion of the appropriate features are two tasks that needed to be carefully performed for the successful implementation of the models. Precision, Recall and F-Score were used as evaluation metrics to measure the performance of the models on the differ- ent datasets. The model with the best and most stable overall performance on the publicly available datasets is the LSTM-Autoencoder, therefore we decided to apply it on the data of the company in order to detect any possible software incidents. Keywords: binary classification, log, anomaly detection, machine learning, Local Outlier Factor, Isolation Forest, PCA, LSTM-Autoencoder. v Acknowledgements We would like to take this opportunity to express our gratitude to our academic su- pervisor Niklas Zechner at the Department of Swedish, Multilingualism, Language Technology for his guidance and valuable inputs throughout this thesis project. Also, we would like to thank Centiro for providing us with the opportunity as well as the resources to work on this thesis. Specifically, we want to thank our supervisor Daniel Dalevi at Centiro who was responsible for initiating this thesis project with us and Viktor Ingemarsson for his important contribution and support through- out our project. We would also like to express our appreciation to our examiner Richard Johansson at the Department of Computer Science and Engineering for his time, feedback and input on the thesis. Finally, special thanks to our families that supported us throughout our studies. Dimitrios Anastasiadis & Jakub Lenart, Gothenburg, June 2022 vii List of Acronyms Below is the list of acronyms that have been used throughout this thesis listed in alphabetical order: AE Autoencoder CCFD Credit Card Fraud Detection DAGMM Deep Autoencoding Gaussian Mixture Model IIS Internet Information Services IF Isolation Forest LOF Local Outlier Factor LSTM Long Short-Term Memory LSTM-VAE Long Short-Term Memory Variational Autoencoder FN False Negative FP False Positive PCA Principal Component Analysis RNN Recurrent Neural Network SKAB Skoltech Anomaly Benchmark SMD Server Machine Dataset SVM Support Vector Machine SWaT Secure Water Treatment TP True Positive ix Contents List of Acronyms ix List of Figures xiii List of Tables xv 1 Introduction 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Research Questions & Novelty of the Project . . . . . . . . . . . . . . 2 1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Background 5 2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Supervised Machine Learning . . . . . . . . . . . . . . . . . . 5 2.1.2 Unsupervised Machine Learning . . . . . . . . . . . . . . . . . 6 2.1.3 Semi-supervised Machine Learning . . . . . . . . . . . . . . . 6 2.2 Log files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Classification Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4.1 Local Outlier Factor . . . . . . . . . . . . . . . . . . . . . . . 8 2.4.2 Isolation Forest . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.3 Principal Component Analysis . . . . . . . . . . . . . . . . . . 11 2.4.4 Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.5 Long Short-Term Memory . . . . . . . . . . . . . . . . . . . . 12 2.5 Performance Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . 14 2.5.1 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5.2 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5.3 F-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.6.1 Unsupervised cluster evolution approach . . . . . . . . . . . . 15 2.6.2 Real-time anomaly detection with unsupervised methods . . . 16 2.6.3 Anomaly Detection in Access Logs . . . . . . . . . . . . . . . 16 2.6.4 System Log Analysis for Anomaly Detection . . . . . . . . . . 17 2.6.5 USAD: Unsupervised Anomaly Detection on Multivariate Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3 Methods 19 xi Contents 3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1.1 Features of the dataset of the company . . . . . . . . . . . . . 20 3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.1 Data scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.2 Anomaly rate adjustment . . . . . . . . . . . . . . . . . . . . 23 3.2.3 One-hot encoding . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.4 Data Windows . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.1 Train-Test Ratio . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.2 Local Outlier Factor . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.3 Isolation Forest . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.4 PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.5 LSTM-Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4 Evaluation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5 Used Software and Hardware . . . . . . . . . . . . . . . . . . . . . . . 33 4 Results 35 4.1 SWaT dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 SKAB dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3 SMD dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4 CCFD dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5 Consistency check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.6 Additional Precision-Recall Curve test . . . . . . . . . . . . . . . . . 44 4.7 Dataset of Centiro . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5 Discussion 49 5.1 Approaches & Results . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6 Conclusion 53 Bibliography 55 xii List of Figures 2.1 Screenshot of Graylog’s function. . . . . . . . . . . . . . . . . . . . . 7 2.2 Example of anomaly detection with one feature. . . . . . . . . . . . . 8 2.3 Anomaly Detection with the use of Local Outlier Factor. . . . . . . . 10 2.4 Anomaly Detection with the use of Isolation Forest. . . . . . . . . . . 11 2.5 Dimensionality Reduction with the use of PCA. . . . . . . . . . . . . 12 2.6 Autoencoder architecture. . . . . . . . . . . . . . . . . . . . . . . . . 13 2.7 LSTM architecture. [29] . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1 Example of One-Hot Encoding. . . . . . . . . . . . . . . . . . . . . . 24 3.2 Example of choosing the number of principal components based on the percentage of cumulative variance. . . . . . . . . . . . . . . . . . 28 3.3 Example of loss distribution plot for choosing the classification thresh- old. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Division of the test set into 10 subsets and iterative evaluation on each one of them. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1 Performance results on the SWaT dataset. . . . . . . . . . . . . . . . 40 4.2 Performance results on the SKAB dataset. . . . . . . . . . . . . . . . 41 4.3 Performance results on the SMD dataset. . . . . . . . . . . . . . . . . 42 4.4 Performance results on the CCFD dataset. . . . . . . . . . . . . . . . 43 4.5 Performing consistency check on the AE model. . . . . . . . . . . . . 44 4.6 Precision-Recall Curve test on the standardized SWaT dataset, with- out the use of time windows. . . . . . . . . . . . . . . . . . . . . . . . 45 4.7 Precision-Recall Curve test on the standardized SWaT dataset, with the use of time windows. . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.8 AE output on 5 seconds time windows. . . . . . . . . . . . . . . . . . 47 4.9 AE output on 30 seconds time windows . . . . . . . . . . . . . . . . . 48 4.10 AE output on 1 minute time windows. . . . . . . . . . . . . . . . . . 48 xiii List of Figures xiv List of Tables 3.1 sc-status codes and their corresponding messages . . . . . . . . . . . 21 4.1 Performance results on SWaT dataset. . . . . . . . . . . . . . . . . . 36 4.2 Performance results on SKAB dataset. . . . . . . . . . . . . . . . . . 37 4.3 Performance results on SMD dataset. . . . . . . . . . . . . . . . . . . 38 4.4 Performance results on CCFD dataset. . . . . . . . . . . . . . . . . . 39 4.5 Performance results on the dataset of the company. . . . . . . . . . . 47 xv List of Tables xvi 1 Introduction Anomaly detection is a task that has been researched and investigated thoroughly over the past few years, since both the industry and the academia have valued its increasingly high importance. This is because spotting anomalies on time can help in the mitigation or even prevention of major damage or dysfunction in the sys- tem. Therefore, several ways and methods have recently been applied by various companies and research teams in order to detect anomalies that occur in computer systems in real time [1]. The need for automation of this process is certain, as the sheer amount of data produced and managed every day by the computer systems of the companies is not amenable to manual manipulation performed by humans, due to time consumption and complexity of the task. Centiro, a software company based in Borås, Sweden, provided us with log data in order for us to detect possible anomalies in their computer systems in the framework of this thesis project. Logs that are produced by the computer systems contain use- ful runtime information that can be analysed in order to extract relevant knowledge about any possible anomalies that might have occurred over a certain period of time [2]. For this reason, Centiro has taken a machine learning approach to gain insights and detect anomalies that can be spotted in the log material of the system. Selecting and preprocessing the appropriate log files are two crucial tasks for the correct training and testing of the machine learning models that are used for anomaly detection. Logs from different applications might be able to provide us with different knowledge regarding the detection of software incidents. Furthermore, the selective use of different features of the logs might increase or decrease the performance of the machine learning models, as the features are not equally correlated to the fact of a log being anomalous or not. The selection of machine learning techniques is also of high importance, as some models yield in a better performance than others, depending on the dataset and the selected features that are used [1]. Therefore, in the framework of this thesis project, we investigate various alternatives in order to find an appropriate and efficient way of detecting anomalies from large log material. 1.1 Problem Statement Anomalies may occur in several stages of an operation that is running in a computer system. The information that can be found in the log files of a computer system is able to provide us with insights about the root cause of the problem, thus helping 1 1. Introduction us get one step closer to being able to detect a software incident on time or even prevent it from happening in the first place. When it comes to software companies like Centiro, being able to detect anomalies in real time is important for the correct function of the system and the successful operation of the company. The main aim of this thesis project is the research and development of a model that will be able to spot the potential anomalies by making use of Centiro’s produced log material. In order to achieve that, we initially use publicly available datasets by applying four different machine learning models on them, and then we apply the best performing one on the dataset of the company. This is a contribution in the field of Machine Learning with the use of four different approaches. The idea behind the development of such a system would be to provide an accurate detection of the anomalies of the system so that the company could have a clear insight and try to fix or avoid system disruption or any damage that could be caused in the future. This task requires effort in analysing the log files and extracting information that might be useful for the process of spotting anomalies. This effort is focused mainly on collecting the appropriate dataset, preprocessing the relevant log information, developing the models for the task, applying them and finally measuring and evaluating their performance. 1.2 Research Questions & Novelty of the Project This thesis addresses the question of whether we can detect software incidents based on the content of Internet Information Services (IIS) log files using techniques of un- supervised machine learning. This question is answered by performing tests on the log data of Centiro. Another research contribution of this project is the comparison among four algo- rithms on the task of anomaly detection on a combination of five datasets that have different features and have been preprocessed in four different ways. Four of these datasets are publicly available and the other one is the dataset of the company. The four models are the Local Outlier Factor (LOF), the Isolation Forest (IF), the Prin- cipal Component Analysis (PCA) and the LSTM-Autoencoder (LSTM-AE). From that comparison, we can have a clear overview of the model with the most univer- sally stable performance that can work decently for a variety of datasets. Additionally, by performing the mentioned tests, we also investigate the change in the performance of the models while the amount of anomalies included in the dataset changes. Furthermore, another element of novelty of this thesis is the investigation of the usability of the feature sc-status in anomaly detection. This feature is present in all IIS logs and we investigate whether it could lead us to detect anomalies that are triggered by specific applications in the system. The sc-status is a protocol status code further explained later. 2 1. Introduction Lastly, a more general yet interesting question that this project deals with is how machine learning models are utilized in order to detect software incidents. This question is largely addressed based on a literature review and on our own imple- mentation of machine learning models on publicly available datasets as well as the dataset of Centiro. 1.3 Limitations This thesis work is limited as far as time and resources are concerned. The amount of log data produced every day by the software of the company is extremely large and therefore, it is computationally impossible to deal with all of it using only the available hardware resources. As a result, we deal with a subset of the data. This may lead to possibly lower performance results than if we took all the data into account. Also, specific features of the data were selected experimentally and based on a literature review. The logs of the company are not labeled and the process of labeling them would be labour intensive and would require a lot of time. For that reason, we were unable to obtain a sufficient labeled dataset from the com- pany. Therefore, we used four publicly available labeled datasets that are suitable for anomaly detection tasks. We do this in order to observe and evaluate the perfor- mance of the models. Then, we apply the best performing model on the unlabeled dataset of the company. All four datasets come from different sources and differ in terms of dimensionality and anomaly rate, however, they do share some characteris- tics that allow us to compare them. Each of the four datasets is a multivariate time series dataset with only numerical values, containing outliers defined as anomalies. The diversity between these datasets can make the comparison more interesting as well as universal. Furthermore, we do not know what anomaly rate can be expected in the data of the company. This depends, among others, on how we divide the data into time windows, which will be further explained in the following sections. The log dataset is highly unbalanced, in the sense that the number of anomalous logs is very small compared to non-anomalous ones. Nevertheless, since we are only considering a subset of the full data, the anomaly rate may be larger. We evaluate the performance of the four models on the publicly available datasets in order to select the model that will be used for detecting anomalies in the dataset of the company. The purpose of using four datasets is to demonstrate empirically that the chosen model works well on different types of datasets. Therefore, it is reasonable to apply this model to the final dataset of the company, which we are unable to evaluate without the assistance of company experts. 3 1. Introduction 4 2 Background The task of unsupervised anomaly detection can be performed in several ways. In this chapter, we present the theory behind it as well as the proposed methods. General concepts of machine learning and its categories are presented in Section 2.1. In Section 2.2, the task of anomaly detection as well as its basic features and the definition of anomaly are described. Next, in Section 2.3 lies the description of log files and what they are used for in general as well as more specifically in the case of the company of Centiro. A general description of the algorithms and models that are used can be found in Section 2.4 and in Section 2.5 their evaluation metrics and ways. Finally, in Section 2.6, some related work is described. 2.1 Machine Learning Machine Learning is a subfield of Computer Science that is highly associated to Artificial Intelligence (AI) and it is increasingly used in various tasks over the past years. From social media services and language translation to Medicine and Life Sciences, machine learning is involved in numerous applications over a wide range of fields [3]. Consequently, it is applied in several methods and solutions in the field of anomaly detection, which is the main research object of the present paper. Machine learning includes algorithms and statistical models that can be used in order to perform certain tasks without having to be explicitly programmed [4]. Pre- diction and early detection of incidents (e.g. health disorders) can be achieved through machine learning models that are trained on relevant available data. Ma- chine learning can be divided into three main categories; Supervised, Unsupervised and Semi-Supervised. This division is based on the presence or absence of data that is tagged with one or more labels (labeled data). Unsupervised machine learning is used for the purposes of the present project. 2.1.1 Supervised Machine Learning Supervised machine learning requires a training dataset that has been labeled [5]. This means that in order to perform supervised learning, a data sample with correct classification labels is needed [6]. An observation regarding the anomaly detection task with the use of supervised machine learning is that the anomaly class is usually much smaller than the non-anomaly one and therefore the dataset is highly unbal- anced between the two classes. The general concept of supervised machine learning 5 2. Background includes the application of past knowledge gained from previous labeled data to new data [5]. 2.1.2 Unsupervised Machine Learning In contrast with supervised machine learning, performing unsupervised machine learning does not require the use of a labeled dataset. When it comes to the task of anomaly detection, we can assume that normal events occur much more frequently than anomalous ones. In any other case we risk the occurrence of numerous false alarms (false positives) [7]. A common approach of unsupervised machine learning is clustering and there are many unsupervised models that follow this technique. An example case of clustering can be about dividing the data points into several clusters according to their distance from the centers of the clusters [5]. There are more methods and approaches of unsupervised machine learning and later on we analyze the ones used for anomaly detection in the framework of the present thesis. 2.1.3 Semi-supervised Machine Learning Semi-supervised machine learning combines both supervised and unsupervised learn- ing. That is because small amounts of labeled data might be used in semi-supervised learning, but generally speaking, most of the data is unlabeled. Semi-supervised learning was introduced in order to overcome the disadvantages of both supervised and unsupervised machine learning. The drawback of supervised learning is that it requires large amounts of labeled training data in order to accurately classify the test data and the drawback of unsupervised learning is that clustering unknown data can be sometimes relatively inaccurate. This is why semi-supervised is used in some cases where it is reasonable for the models to learn based on a small amount of labeled training data [8]. 2.2 Log files Generally speaking, logs are generated records that derive from a number of sources and keep track of the history of the tasks that have been performed. Software developers often use the log files in the debugging/troubleshooting process of ap- plications. In case of a system of small scale, it is relatively easy for a human to read the produced logs and understand if the behavior of the system is the expected one. However, the system of a company might be of a larger scale and therefore the size of the log material can make it impossible for one to read and process in order to figure out if the system behaves as it is supposed to. Apart from that, the complexity of the contents of the log files might be increased when dealing with a computer system of a big company. Thus, the need for automation of this process has grown bigger over the past few years, leading to innovative ways of log data manipulation that often include the use of machine learning. Some preprocessing and transformation of the data might be necessary in most cases in order for the computer to be able to process it [9]. 6 2. Background The most typical form of logs is as timestamp based data that contain information about the various processes that took place on specific instances [5]. They usually include both messages and numerical values in the different fields of information that they contain. As far as the specific logs of Centiro are concerned, their structure may vary from one application to another. The log files that are used in the framework of the present thesis project derive from the Internet Information Services (IIS). This is a modular network server application from Microsoft. The IIS web server runs on Windows systems and it provides a platform that hosts and manages web applications and serves the requests of HTML pages [10]. The log files that we use contain various fields of information, such as timestamp, source code, level, time-taken and IP ad- dresses of the client and the server. The platform used for accessing and quering in Centiro’s log file database is Graylog. This platform is used to store and perform real-time search and analysis in large amounts of log material produced every day by the company’s computer systems [11]. Graylog makes use of a three-tier architecture and a scalable storage that has its basis on Elasticsearch and MongoDB [12]. Figure 2.1: Screenshot of Graylog’s function. 2.3 Anomaly Detection Generally speaking, anomaly is called an abnormal instance that indicates any be- havior different than the expected one [5]. The importance of anomaly detection lies in the fact that by predicting or spotting anomalies, various errors on applications can be fixed in short periods of time, thus helping the mitigation of the total damage that could be dealt on the system’s performance. More specifically in the framework of the present thesis project, an anomaly in a computer system is an instance that does not follow the normally expected patterns. Also, we can assume that anomalies occur in a very small percentage of the total 7 2. Background instances. They can be spotted on logs produced from different applications and they are usually connected to operations that have been executed in a way different than what they were expected. Anomalies might be of various levels of importance, according to which they can cause various degrees of damage in the system and affect its interaction with humans. Anomaly detection can take place based on different machine learning techniques; supervised, unsupervised and semi-supervised learning. In this project, unsupervised learning techniques are being utilized and they are described in later sections. The following figure shows a simple example of anomaly detection based on the value of one feature plotted against time. Figure 2.2: Example of anomaly detection with one feature. 2.4 Classification Models Four unsupervised machine learning models were used in this thesis. These are the Local Outlier Factor(LOF), the Isolation Forest (IF), the Principal Component Analysis (PCA) and the Long Short-Term Memory Autoencoder (LSTM AE). These models follow different methods of classifying rows of the dataframes, which makes some of them more suitable for training and testing on specific datasets than others. The task is binary classification as one row (or a group of rows) of the dataset can be classified either as anomalous or non-anomalous. 2.4.1 Local Outlier Factor Local Outlier Factor works as a density-based outlier detection algorithm which is able to spot outliers (anomalies) by calculating the local deviation of a given data point. This can be used among others, for outlier detection in unbalanced datasets. Determining whether a data point is an anomaly depends on the density between 8 2. Background the data point and its neighbors. In particular, the lower the density, the higher the chance for the data point to be classified as anomaly [13]. The key-definitions of the LOF algorithm are namely the k-distance of a data point p, the k-nearest neighbors of p, the reachability distance of p with respect to o, the local reachability density of p and the LOF score of p [14]. In order to define the k-distance, we need to explain how to distance between two data points p and o is calculated. This distance is calculated by using a Euclidean n-dimensional space with the following formula [14]: √√√ d(p, o) = √∑n (pi − oi)2 (2.1) i=1 Let the k-distance(p) be the distance of a data point p to the k-th nearest neighbor, where k is a positive integer. The k-nearest neighbors of p are all the data points up to this k-distance(p) [5]. As for the reachability distance of p with respect to o, it is defined with the following formula [14]: reachdistk(p, o) = max (kdistance(o), d(p, o)) (2.2) Regarding the local rechability density (LRD) of a data point p, we define m as the minimum number of data points and we calculate the LRD of a data point as follows [14]: ∑ ( ) = 1 ( o∈Nm(p) reachdistm(p, o) LRDm p / (2.3)|m(p) )| In the above equation, we make use of the average reachability distance based on the m number of nearest neighbors of the data point p. Taking all the above into account, the LOF score is calculated as follows [14]: ∑ LRDm(o) ( n∈Nm(p) LRD (p)LOFm p) = m (2.4)|m(p)| If we get a LOF score that has a value of approximately 1, this is an indication that the data point is comparable to its neighbors and therefore it is not considered as an outlier. If the value is below 1, this is an indication that the data point is an inlier. On the other case where the values are larger than 1, we have an indication of outlier (anomaly) [15]. 9 2. Background Figure 2.3: Anomaly Detection with the use of Local Outlier Factor. 2.4.2 Isolation Forest We use the term "isolation" here to describe the process of separation of a (po- tentially anomalous) data point from the rest of the data points. Anomalies form generally a small percentage of the total samples of a dataset and they have some differences, therefore we are able to separate them from the rest of the normal data points [16]. When detecting anomalies with an Isolation Forest, the data is sub-sampled and processed within some tree structures (isolation trees). These trees perform random cuts in the values of the features of the data points. These features are selected randomly. The data points that go through the branches of the isolation trees and end up deeper into the trees have a smaller chance of being anomalous. Data points that travel in shorter branches of the trees are more likely to be anomalous [17]. This happens because anomalies will have shorter paths through this random parti- tioning as they have more distinguishable values of their features and therefore they are more likely to be isolated earlier than the rest of the data points [16]. Therefore, when an isolation forest produces short paths for isolating some specific data points, then these data points have high chances of being anomalies and thus, they are clas- sified as such. We can see that the aggregated length of the tree branches gives us the so-called “anomaly score” of a given data point which serves as a measure of anomaly for this data point [17]. Concluding, in the process of anomaly detection with the use of Isolation Forest, at first isolation trees are being built and then the data points pass through the isolation trees that have been formed and an anomaly score for each data point is obtained [16]. This anomaly score is used to classify whether each data point is an anomaly or not. 10 2. Background Figure 2.4: Anomaly Detection with the use of Isolation Forest. 2.4.3 Principal Component Analysis Principal Component Analysis (PCA) is a method that can be used for dimensional- ity reduction. Given a dataset, PCA is based on determining the principal directions of the data distribution. The construction of the data covariance matrix as well as the calculation of the dominant eigenvectors are needed in order to determine the principal directions. The eigenvectors are considered as the principal directions, as they are the ones holding the most important information among all the vectors [18]. In other words, PCA is a process, through which the principal components of a collection of data points are computed. These principal components are a sequence of p unit vectors. The i-th vector is orthogonal to the first i-1 vectors and it is the direction of a line that is the best fit for the data. Considering the average squared distance from the data points to the line, we consider as the best fit the line that minimizes this distance. PCA uses the computed principal components in order to change the basis on the data [19]. It is able to reduce the big amount of correlated features to a much smaller amount of uncorrelated principal compo- nents [20]. Based on a literature review, there are certain cases where PCA takes into account just the first few principal components and ignores the rest of them [19]. However, in this thesis project, PCA is used for anomaly detection. Therefore, it is enhanced with a function that computes the anomaly scores of the data points of a given dataset. Based on [21], this algorithm reduces the dimensionality of the data and at the same time it tries to minimize the reconstruction error. Therefore, it attempts to capture the most valuable information of the original features in order to be able to reconstruct the original features from the reduced features as accurately as possible. However, PCA is not able to capture all the information contained in the original features as they move to a lower dimensional space. Thus, some error is observed in the reconstruction process and this error is called reconstruction error. The data points that occur the least often are very likely to yield the largest reconstruction error and to be considered as anomalies. The reconstruction error of each data point is computed by summing the squared differences between the original feature matrix and the reconstructed matrix. We also refer to the reconstruction 11 2. Background error of a data point as its anomaly score. Figure 2.5: Dimensionality Reduction with the use of PCA. 2.4.4 Autoencoder Autoencoder (AE) is a machine learning model working in an unsupervised manner which is a feed-forward neural network with an encoder-decoder structure [23]. The main objective of the AE is to train the model to replicate input vectors {x(1), x(2), . . . , x(m)} as output vectors {x̂(1), x̂(2), . . . , x̂(m)} while minimizing the reconstruction error, which is determined by the difference between the input and output data [24]. The Autoencoder model consists mainly of two phases. The first is an encoding step. This phase is in charge of mapping the input data to the model’s hidden layer. During this process, the model reduces the dimensionality of the input data, resulting in a latent representation of the data. The second phase of the model is decoding whose objective is to decode the mapping from the hidden layer to the output layer. Moreover, the model increases the dimensions of the transformed data to its original size during the decoding phase [25]. The following formula can be used to explain this procedure: X̃ = D(E(X)) (2.5) Where X is the input data, E denotes an encoding step, D denotes the above- described decoding step, and X̃ indicates the model’s output. The model’s overall goal is to train E and D to minimize the deviation between X and X̃ [25]. An Autoencoder model, in particular, might be considered as a solution to the following optimisation problem: min || X −D(E(X)) || (2.6) DE In Figure 2.1, the architecture of the AE model is presented. According to [25], an Autoencoder that consists of more than one hidden layer can be also called a deep Autoencoder. The AE that is implemented in this project consists of hidden layers that are built based on Long Short-Term Memory architecture, which is further explained in the next section. 2.4.5 Long Short-Term Memory Long Short-Term Memory, or LSTM for short, is a type of recurrent neural network (RNN) that was developed in order to address the problem of long-term dependency 12 2. Background Figure 2.6: Autoencoder architecture. caused by vanishing gradients. The above-mentioned issue was encountered during the process of training RNN [26]. While the model is being trained, the weights of the network are being updated. In the situation of a vanishing gradient, the gra- dient decreases gradually, until it eventually becomes vanishingly small, preventing the weights from changing their values and so, stopping the neural network from further training [27]. LSTM was a solution to this issue and its main objective was to control the information flow within the neurons of the network. Long Short-Term Memory introduces a gating mechanism that controls the process of adding, stor- ing, and deleting information from the iteratively propagated cell state [28]. In the gating mechanism, three gates are being used. Forget gate, input gate, and output gate. In Figure 2.7, the architecture of the LSTM is presented. Figure 2.7: LSTM architecture. [29] In Figure 2.7, Ct stands for cell state, ht for hidden state, which is an LSTM output, and ft, it, and ot denoting forget, input, and output gates, respectively. The forget gate is in charge of determining how much information should be kept and how much should be forgotten from the network. The decision of how much of the information 13 2. Background is kept is based on the sigmoid function which results in values between 0 and 1. LSTM unit remembers more from the past if the sigmoid result is near 1. Similarly, the closer the outcome is to 0, the less memory from the past is retained by the LSTM unit [30]. The following formula presents the operation of forget gate: ft = (Wf · [ht−1, xt] + bf ) (2.7) The input gate, on the other hand, is in charge of determining how much information will be added to the cell state. The sigmoid function, like in the forget gate, is applied to the new input and previous state to make this decision. The result is multiplied by C̃t, yielding a new vector of candidate values that can be added to the current cell state [31]. Both operations can be described by formulas as follow: it = (Wi · [ht−1, xt] + bi) (2.8) C̃t = tanh(WC · [ht−1, xt] + bC) (2.9) Before reaching the last gate old cell state has to be updated. This operation is simply done by multiplying Ct−1 by ft and then adding it multiplied by C̃t [29]. This step can be shown with the formula as follows: Ct = ft ∗ Ct − 1 + it ∗ C̃t (2.10) The output gate is the LSTM unit’s third and final gate. This gate affects the ht value and determines the LSTM output. In that unit, the sigmoid function is also applied and works the same as in the previous gates. Firstly, the output gate is activated, and based on its result, ht is computed. ht is computed by multiplying the output gate’s output by the previously updated cell state [29]. Both operations can be described with the following formulas: ot = (Wo · [ht−1, xt] + bo) (2.11) ht = ot ∗ tanh(Ct) (2.12) 2.5 Performance Evaluation Metrics In order to measure how well the models perform in the task of detecting anomalies, we make use of some well-known evaluation metrics. In this project, we consider Precision (also called positive predictive value), Recall (also known as sensitivity) and F-Score (also called F1-Score or F-Measure). We avoid using Accuracy as an evaluation metric in this task of anomaly detection, since the datasets that are used are highly unbalanced. We already know that anomalies make up a very small percententage of the total dataset. Therefore, in an example of a dataset that consists of 99.5% non anomalies and only 0.5% anomalies, a dummy classifier that would classify every incident as non anomaly would end up yielding an accuracy of 99.5%. At first glance, that would seem like an outstanding performance, but it does not actually provide us with any useful knowledge about the anomalies that are to be detected. 14 2. Background 2.5.1 Precision Precision takes into account the True Positives (TP) of the predictions and the summation of TP with False Positives (FP). More specifically, when computing Precision, the number of successful predictions of anomalies (i.e., TP) is counted and divided with the total number of predictions (i.e., TP+FP). In other words, Precision penalizes FPs, which are the incidents that the model wrongly classified as anomalies (false alarms) [32]. Precision is computed with the following formula: TP Precision = (2.13) TP + FP 2.5.2 Recall Recall is computed by counting the number of successful predictions of anomalies (i.e., TP) in proportion to the total number of anomalies (i.e., TP+FN). Therefore, the key difference between Precision and Recall is that Recall penalizes FNs, which are the incidents that the model wrongly classified as non anomalies even though they are anomalies [32]. Recall is computed with the following formula: TP Recall = (2.14) TP + FN 2.5.3 F-Score F-Score is computed as the product of Precision with Recall divided by the sum of Precision and Recall and all this multiplied by two. F-Score is the harmonic mean of Precision and Recall and therefore provides a more combinatorial view of the performance’s evaluation. F-Score has a highest possible value of 1.0 and a lowest possible value of 0.0. In case the F-Score is 1.0, Precision and Recall are both perfect (both have a value of 1.0) and in case the F-Score 0.0, either Precision or Recall is equal to 0.0 [33]. F-Score is computed with the following formula: Precision ∗Recall FScore = 2 ∗ (2.15) Precision+Recall 2.6 Related Work Based on the literature review conducted for the needs of this thesis, we observed that there has been extensive research that achieved valuable progress in the field of anomaly detection over the past few years. Here we describe the methods and results of some papers that are relevant to our thesis work. 2.6.1 Unsupervised cluster evolution approach In the paper [34], M. Landauer et al present an online anomaly detection approach with the utilization of forecasting models that can spot instances that differ from 15 2. Background normally expected behaviors. There is a clustering model that creates log line clusters based on several static cluster maps. In that way it detects the transitions among the different clusters. They evaluate the models by making use of metrics that are security related. In the approach that is described in this paper, there are some log lines that do not align with their expected periodicity, correlation and average frequency. These are called contextual anomalies. With this approach, the lines that were different and dissimilar could be detected and if they were occurring once, then they were treated as outlier. Apart from that, the changes in the system’s behavior over time were treated as temporal anomalies. It is worth mentioning that any knowledge of the log data’s structure or the percentage of anomalies is not needed beforehand as this approach is self-learning, similarly to the one we are applying in this thesis with the use of unsupervised learning models. 2.6.2 Real-time anomaly detection with unsupervised meth- ods Ahmad et al., in the paper [35], present an unsupervised approach of detecting anomalies in real time using streaming data. They compare various online unsu- pervised algorithms and models with the use of a dataset that is called Numenta Anomaly Benchmark. This dataset has been formed from data gathered from real- world data streams. 11 different methods were compared in this paper. Some of these methods were namely the Relative Entropy, the Bayesian Changepoint, the Etsy Skyline, the Hierarchial Temporal Memory, the Sliding Threshold and the Twitter ADVec. The highest evaluation metrics were achieved by the Hierarchial Temporal Memory model, which uses neural networks and estimates likelihood of the anomalies. The high latency requests’ frequency, the temperature of the ma- chine system and the utilization of the client’s CPU were some numeric data that the models were tested and evaluated on. It can be concluded from this paper that the need for efficient anomaly detection algorithms is increasing, since the data stream numbers are also getting increased and an automation of the process is needed, as humans are not able of detecting anomalies manually in such sheer amounts of data. 2.6.3 Anomaly Detection in Access Logs In the paper [36], M. Thrashini et al. reported a comparison of supervised learning models against unsupervised learning ones regarding the task of anomaly detection. The data that was used in this project derived from web access logs. They analyzed various web access log files in order to be able to spot attacks that might seem as anomalous incidents in the system. Detecting intrusions in real time is crucial for the effective protection of the system and therefore developing a model that would be able to do so is an important task. In this paper, access logs were considered to be containing some useful indications of attacks and that is the main reason they were chosen for this task. Naive Bayes Multinomial Text was the method that achieved the highest evaluation results. That was an supervised learning approach. Regarding the unsupervised one, a clustering model that made use of K-means clustering algorithm was utilized. 16 2. Background 2.6.4 System Log Analysis for Anomaly Detection H. Shilin et al., in the paper [37], which is an experience report, describe some methods of anomaly detection and their use in different cases and provide a com- parison of six state-of-the-art log-based anomaly detection methods, with three of them being supervised and three unsupervised. The three supervised learning meth- ods that were evaluated are Logistic Regression, Decision Trees and Support Vector Machines (SVM). Furthermore, the three unsupervised learning methods that were compared are Log Clustering, Principal Component Analysis (PCA) and Invariant Mining. Taking into account the F-Scores of the methods, we can say that super- vised learning methods performed better in general. Invariant mining was reported to be the best performing method. Comparing the execution time, supervised learn- ing methods needed much less as they were executed faster than the unsupervised ones. Of course, the downside of the supervised learning methods is that their use requires a labeled dataset. Regarding the unsupervised learning methods, PCA was the fastest executed one, and it is also used for the needs of this thesis project. 2.6.5 USAD: Unsupervised Anomaly Detection on Multi- variate Time Series In this article [38] Audibert J. et al. propose a new method for unsupervised anomaly detection on multivariate time series. This method is based on the AE architecture, and the way the model learns is inspired by the Generative Adversarial Network (GAN). The main objective of this method was to use the adversarial training which allowed the model to learn how to magnify the computation of the reconstruction er- ror of inputs containing anomalies while also improving the model’s stability, which was superior to methods based on GAN architecture. The architecture of the USAD model differs from that of a typical AE mainly due to the presence of two decoder networks. The final architecture consists of two autoencoders that share a common encoder but have separate decoders. The model was trained in two phases. Both AEs were trained in the first phase to reconstruct normal inputs (not containing anomalies), but in the second phase, AEs were trained in an adversarial manner, which means that one autoencoder attempted to fool the other autoencoder. The goal here was to train the second model to recognize whether the data was real, that is, whether it was input from raw data or reconstructed output from the first AE. The model was evaluated empirically, which means that it was tested on five public and one internal dataset (the internal dataset comes from Orange). It was then compared with the performance of other models using the same datasets. The state-of-the-art models that were used in the comparison process were: IF, AE, LSTM Variational Autoencoder (LSTM-VAE), Deep Autoencoding Gaussian Mix- ture Model (DAGMM), and OmniAnomaly, a stochastic recurrent neural network model. Precision, Recall, and F1-Score were the metrics used in the comparison section. The tests revealed that USAD performs very well, achieving the highest F1-scores in the majority of used datasets, but also that AE performs exception- ally well. Audibert et al. also investigate how different parameters such as down- sampling rate, window size, anomaly percentage rate, and latent space Z dimension 17 2. Background affect model performance. The proposed model achieved the following scores on the final, internal dataset: Precision 74%, Recall 64%, and F1-score 69% and was able to detect all significant incidents that occurred in the dataset. 18 3 Methods In this chapter, we present the methodology that was followed in order to complete the tasks needed for this thesis project. In Section 3.1, the dataset selection, pro- cessing and representation of the data are presented. In Section 3.2, the selection of the used features is described. In Section 3.3, we describe the implementation of the selected machine learning models that are used for anomaly detection. In Section 3.4, the methods used for the evaluation of the models’ performance are described. Finally, in Section 3.5, we describe the software and hardware that we used in the framework of this thesis. 3.1 Dataset As mentioned in the Introduction, the datasets that were used in this thesis project derived both from online sources as well as from the produced log material of the company. The first publicly available dataset that we used is called Credit Card Fraud De- tection (CCFD) dataset. We obtained this dataset from Kaggle and it serves the purpose of credit card companies being able to spot fraudulent transactions in or- der for their customers to be protected from being charged for transactions that themselves did not proceed with [39]. Transactions that are made by cardholders in Europe in September 2013 are contained in this dataset. The dataset is highly unbalanced, since only 492 of the total 284,807 transactions are fraudulent. This is translated in an anomaly rate of 0.172%. This dataset contains numerical input val- ues coming after a transformation with the use of PCA. It contains, among others, "Time", "Amount" and "Class" features. Time is the number of seconds that elapsed between each transaction and the first one. Amount is the money amount of the transaction and Class contains the information of a transaction being fraudulent (value 1) or normal (value 0) [39]. Moving on, the second used dataset that derived from online sources is called SWaT (Secure Water Treatment) dataset. This dataset is a scaled down version of a real- world industrial water treatment plant producing filtered water. The SWaT dataset contains 11 days of continuous operation, of which 7 days are collected under nor- mal operations and 4 days that include attack scenarios. The values of the dataset were obtained from 51 sensors and actuators. Data was labelled according to nor- mal and abnormal behaviour. The anomaly rate of the SWaT dataset is 11.98% [40]. 19 3. Methods Next, we used a publicly available dataset called SMD (Server Machine Dataset). This dataset consists of data that has been gathered over the period of five weeks from the computer systems of a large Internet company. The data that has been collected for this dataset comes from 28 different machines, thus it consists of 28 different subsets of data. This means that the different subsets should be trained and tested separately. These subsets are all divided into training and testing parts, which are of equal length. The train set is the first half part of each subset of the dataset and the test set is the latter part of it. The testing parts include the labels of whether each point is an anomaly or not [41]. Finally, the last dataset that was retrieved from online sources is called SKAB (Skoltech Anomaly Benchmark). This dataset contains 35 individual csv-format data files. Each one of the files represents a single experiment and it includes a single anomaly. The data was collected as multivariate time series from sensors and gathered in the SKAB dataset. The dataset involves 11 columns (variables), the last one being the anomaly indication (0 for a non-anomalous and 1 for an anomalous data point) [42]. As far as the dataset of the company is concerned, it has been formed by collecting log data from internet-based applications that have been produced over the period of 12 hours of a weekday. We focused on applications that fall under the Internet Information Services (IIS) in order to create a dataset of logs that contain informa- tion about the source status of the operations. Also, each log contains information about the time of production (timestamp). All the logs that are collected are put in chronological order. The data that is used for training the model consists of about 24.6 million rows and the data that was used for testing the model consists of about 22 million rows, before applying the time windows division. Below, we pro- vide an instance of the dataset and its structure, firstly as seen through Microsoft Excel and secondly as seen when loaded and managed as a dataframe in the Python development environment Jupyter Notebook. 3.1.1 Features of the dataset of the company In order to detect anomalies in the dataset of the company, the proper features need to be selected. Forming a dataset that contains useful information and can be used for anomaly detection is of high importance, therefore particular attention was given to this task. At first, a timestamp is included in the dataset, which holds the information of the time that each log was created. Also, the timestamp is treated as an index after dividing the data into time windows, which is explained later, in Subsection 3.2.4. Furthermore, the id of the logs is included in the dataset. This feature holds a unique value for each log and it is used to fasten the process of identifying particular logs in the dataset. Additionally, the dataset includes a feature called cs-uri-stem, which holds the information of the files that are requested by specific applications 20 3. Methods in the system. This feature is useful for identifying the applications that could be responsible for the detected anomalies. Apart from the above, the last and most important feature of the dataset of the company is called sc-status. This is a protocol status code that is returned from a running application to the client. It contains the information of whether a request is successful. In some cases of unsuccessful requests, we are able to associate the returned sc-status code with a specific reason of failure. For instance, if the sc- status includes a code that is of the form 4XX, where X ∈ N, this is an indication of an error that occurred on the client. The following table shows the messages indicated by the specific codes of the sc-status [43]. Table 3.1: sc-status codes and their corresponding messages Status code Message 1XX Informational 100 Continue 101 Switching protocols 2XX Success 200 OK-Succeeded 201 Created 202 Accepted 203 Nonauthoritative information 204 No content 205 Reset content 206 Partial content 3XX Redirection 301 Moved permanently 302 Object moved 304 Not modified 305 Use proxy 307 Temporary redirect 4XX Client Error 400 Bad request 401 Unauthorized 402 Payment Required 403 Forbidden 404 Not Found 405 Method Not Allowed 406 Not Acceptable 407 Proxy Authentication Required 408 Not Acceptable 409 Request Time-Out 410 Conflict 411 Gone 21 3. Methods 412 Length Required 413 Precondition Failed 414 Request-URL Too Large 415 Unsupported Media Type 5XX Server Error 500 Internal server error 501 Not Implemented 502 Bad Gateway 503 Out of Resources 504 Gateway Time-Out 505 HTTP Version not supported In some cases, the sc-status code might be 0, which indicates that a possible con- nection reset could have occurred and, as a result, the application did not send the actual sc-status code. 3.2 Preprocessing In order to train the models and to detect anomalies, we needed to preprocess the datasets in a way that they can be fitted in the different models. In this project, data preprocessing is divided into three main tasks, namely data scaling, anomaly rate adjustment and one-hot encoding. 3.2.1 Data scaling Data scaling is the first task, and it is applied to all five datasets. Many studies clearly showed that different ways to scale data for anomaly detection tasks can be used. Normalization of data was performed, for instance in [38], yielding good re- sults. When the distribution of the features does not follow a gaussian distribution, normalization is recommended [44]. However, we can see in [45] that normalization is also very sensitive to outliers, which can be very impactful in anomaly detection tasks, as outliers might be considered anomalies. As a result, standardization can be also used to get satisfying results, as shown in [46]. Standardization is an especially suitable choice for multivariate time series datasets, according to Shanker et al. [47]. As a result, we decided to use both approaches in this research. This means that for each model, we first train it on normalized data and then compare the evaluation scores with the results achieved from the model trained on a standardized dataset. The Scikit-learn library was used in both scenarios, providing two scaling functions; StandardScaler and MinMaxScaler. The MinMaxScaler scales the range of the data to a fixed value between 0 and 1, or -1 to 1 if we want to allow for negative values. The operation of normalization can be expressed by following formula: X −Xmin Xnorm = (3.1) Xmax −Xmin 22 3. Methods The StandardScaler transforms the data so that the mean µ is zero and standard deviation σ is one. In the standardization formula 3.2, Z represents the standardized data points and X the original data points. X − µ Z = (3.2) σ 3.2.2 Anomaly rate adjustment The anomaly rate adjustment is only applied to one dataset, namely the SKAB dataset. This procedure was used to reduce the amount of anomalies in the SKAB dataset, which was originally roughly 53% when all subsets were combined. This problem was handled by creating a function that takes each subset of SKAB dataset, computing the rate of the anomaly in that subset, and adjusting its value to our desired rate of 8%. In simple terms, the function was going through the rows of the subsets, identifying where the anomalies begin, and then cutting everything after that, leaving only the amount of anomalies that additionally makes up 8% of the subset. With the above process, we can ensure that, after the data is split into train and test, the train set will not contain any anomalies. However, a downside of it is that some non anomalous data was lost throughout this process. Nevertheless, as the purpose of anomaly detection is to detect abnormalities as soon as possible, remov- ing a portion of the subsets after an anomaly happened will have a little impact on the performance of the models. We decided to reduce the anomaly rate of this dataset to 8% in order to have datasets with evenly increasing anomaly rates, i.e. 0.17, 4, 8, and 12 percent. 3.2.3 One-hot encoding One-hot encoding is the final task of data preprocessing. This task is also performed on just one dataset. Specifically, it is carried out on the dataset of Centiro using the Pandas library in order to compute sc− code occurrences during particular time periods (data windows). This operation results in a dataset with 21 new columns, where each one corresponded to a specific code. Figure 3.1 shows an example of a one-hot encoding operation. 3.2.4 Data Windows As part of preprocessing, the data has been divided into windows and each window includes a number of rows of the dataset. In this way, we do not classify each row, but a group of rows as anomalous or non anomalous. When the division into data windows implemented based on the timestamps of the data points, we refer to these windows as time windows. 23 3. Methods Figure 3.1: Example of One-Hot Encoding. However, this division into data windows cannot be applied in the same way in the different datasets that were used in this thesis. That is because the datasets contain different information and they are used for different tasks. Below, we describe the way that data windows are created in each of the datasets. As for the CCFD dataset, it is the only dataset where division into windows was not applied. The reason for this is that the dataset contains transactions from dif- ferent cardholders that are not connected to each other in any way. Therefore, two transactions that follow one another in the dataset are completely irrelevant to each other and they it just happened that they took place at the same time. Thus, we did not find it reasonable to split the data into time windows, but we rather run the models on the dataset row by row. Next, we have the SWaT dataset. In this dataset, we applied time windows division based on the timestamp feature. Each time window contains information of 5 sec- onds. We find it reasonable to split the data into time windows as the data points that are close to each other time-wise are also connected to each other in terms of indicating normal or abnormal behavior. Therefore, when we feed the models with the data, this treats each time window as a data point and assigns a label to it. This label occurs based on whether the model predicts that the time window is anomalous or non anomalous. Moving on to the SMD, this dataset does not contain any timestamp feature, there- fore division into time windows was not feasible. However, we divided each of the 28 subsets of the dataset into data windows by number of rows. More specifically, we included 20 rows in each window that were treated by the models as single data points and the models were evaluated based on their predictions of the windows labels. As for the SKAB dataset, this included the timestamp feature, therefore we created time windows similarly to the SWaT dataset. The difference here is that the SKAB dataset consists of 35 subsets, so each of these subsets is divided into time windows. These time windows are treated also like data points and classified as anomalous or non anomalous by the models. Lastly, regarding the dataset of the company, we divided it into time windows of 5 24 3. Methods seconds, 30 seconds and 1 minute. Applying three different amounts of time for the creation of time windows helped in the better evaluation of the model. 3.3 Model Implementation As it was described above, we have utilized four models in total for the needs of this thesis work. These are namely the Local Outlier Factor (LOF), the Isolation Forest, the Principal Component Analysis (PCA) and the LSTM Autoencoder. 3.3.1 Train-Test Ratio In this paper, two methodologies were used to partition datasets into training and test sets. To begin, two datasets, namely SWaT and SMD, were already separated into training and validation sets. As a result, splitting was not required in this case. The Scikit-learn library was used to divide Credit Card Fraud Detection and SKAB datasets. The split ratio for both datasets was 80/20, indicating that the training set contained 80% of total data and the validation set 20%. Even though we performed unsupervised learning, splitting the datasets into train and test set was useful, since our goal was to train the models on data that is clean from anomalies and evaluate them on data that is infected with anomalies. In that way, the model was trained on observing the normal behavior of the system where anomalous incidents do not take place, so that later it would be able to recognise any abnormal behavior and spot the anomalies in the test dataset. 3.3.2 Local Outlier Factor As described in 2.4.1, the LOF algorithm detects anomalies by taking into account the density between the data points and their neighbors. Here we make use of the sklearn.neighbors package of the Scikit-learn library [48] and the most important pa- rameters that we need to investigate are the n_neighbors and the contamination [49]. The n_neighbors parameter is the number of neighboring data points that the internal clustering algorithms of LOF use. The contamination parameter is the proportion of the most isolated data points (the points with the highest LOF score in the dataset) which will be considered by the model as anomalous. The contamination should be in the range (0, 0.5] [48]. Also, since we want to train the LOF model on the train dataset and predict the labels of the test dataset, we set the value of the parameter novelty to True. Regarding the number of neighbors to be considered, this normally has to be greater than the minimum number of data points that a cluster contains and smaller than the maximum number of nearest data points that can potentially be considered as anomalies. However, generally in practice, we do not possess such information [49]. Therefore, we experimentally try different values for this parameter, values that seem to be logical and appropriate for the used datasets. Keeping the default n_neighbors=20 appeared to work decently well for the four publicly available 25 3. Methods datasets, thus we decided to keep this parameter equal to 20. Since we are unable to know the proportion of anomalies in the dataset of the company beforehand, we cannot specify with high confidence the contamination parameter of the LOF model. We are only able to experimentally estimate this parameter. When working with the publicly available datasets, we already know the amount of anomalous data points in the whole dataset. Therefore, we are able to set the contamination parameter equal to the quotient of the division of this amount with the amount of total data points in the dataset. Indeed, this would generally lead to better results in the performance of the model, than if we would assign a random value to the contamination parameter or if we would attempt to empirically approach it. In our implementation, this parameter is empirically set equal to 0.01 when using the publicly available datasets. The reason behind this is that we intend to check the performance of the models under realistic conditions, where we do not have any information related to the labels or the contamination rates of the datasets. Since we are not able of having such information when it comes to the dataset of the company, then comparing the performance of the models under the same conditions seems fair and reasonable. After all, this comparison is done in order to decide for the best performing model that will be applied on the dataset of the company. As we describe later, the LOF model did not achieve the best results out of the four models, therefore we did not apply it on the dataset of the company. 3.3.3 Isolation Forest The Isolation Forest model is an ensemble model that consists of a number of isola- tion trees, as mentioned in 2.4.2. Therefore, each tree needs to be fit with a number of samples from the train dataset. In our implementation, all samples of the train dataset are used to fit all trees. We make use of the sklearn.ensemble package of Scikit-learn library and the parameter that is responsible for the number of samples that each tree will use is max_samples. Thus, we set this parameter equal to the number of samples of the train dataset. This might lead to a slightly higher execu- tion time than if we used a part of the train dataset for each tree. However, using the whole train dataset for each tree yields better performance results. Another parameter of the IF model that we need to consider is the contamination. It has the same role as in the Local Outlier Factor, which means that it represents the proportion of the anomalies in the dataset and it should be in the range (0, 0.5] [50]. The same logic that was described previously for the LOF model is also applied here. Therefore, we set the contamination parameter equal to 0.01 for this model as well. This is an empirical estimation of the proportion of anomalies that can be found on the different datasets. We can assume that anomalies make up a (very) small percentage of the whole dataset, even though this might vary from one dataset to another. 26 3. Methods 3.3.4 PCA For the implementation of the PCA algorithm, the sklearn.decomposition package of Scikit-learn library was used. The PCA algorithm is used to reduce the number of components (dimensionality) as mentioned in 2.4.3, but here the model is enhanced with a function that computes the anomaly scores of the data points. At first, we fit the PCA with the train data. Then, the model computes the anomaly scores of the test data based on the reconstruction error. Afterwards, a table (Pan- das DataFrame) of data points along with their anomaly scores is created. This table gets sorted in a descending order, starting from the point with the highest anomaly score. Then, in order to compute the Precision of the model, the first 2% of the data points of the sorted table is kept and these data points are considered as the only anomalous ones, since they have the highest anomaly scores. We re- fer to these first data points that are kept as "cutoff". The decision to select the first 2% of the data as a cutoff was based on the assumption that was previously made regarding the contamination (i.e. proportion of anomalies in the dataset). We estimated that anomalies could make up about 1% of the total data points. Therefore, it seems reasonable to use a slightly larger percentage as a cutoff, since, experimentally, we can assume that the model will most likely not predict perfectly all the anomalies. This means, that it will probably not assign the highest anomaly scores only to actual anomalous data points, but the first 1% of the sorted anomaly scores table might also include some non anomalous data points. Therefore, our estimation is that by taking a cutoff of 2%, we investigate more data points that have relatively high anomaly scores and might be actual anomalies, thus increasing the total number of TPs and providing a clearer overview of the performance of the model. Furthermore, the larger the cutoff, the higher the amount of FPs, which translates into a decrease in the Precision of the model. Hence, finding a proper balance between increased TPs and FPs is important and our estimation is that a cutoff of 2% is a reasonable choice for that matter. Regarding the evaluation of the performance of the model, apart from the Precision that was described above, we also compute the Recall and the F-Score. In order to compute the Recall, the same cutoff as previously is used and the amount of TPs is measured. However, all of the data points are taken into account when calculating the sum of TPs and FNs. Recall is the quotient of the amount of TPs divided by this sum. Finally, F-Score gets computed as it was described in 2.5.3. An important parameter of PCA that needs to be properly tuned is the number of components (n_components) that will be used by the model. However, there is not one universal tuning approach that works the best in every case (dataset). Therefore, we needed to experiment and decide on an approach that could work decently for the datasets that we work on. In fact, we ended up using two ways for estimating the number of components that PCA takes into account. More specifically, for the SWaT and the CCFD datasets, we plot the diagram of the numbers of components against their cumulative variance. As the number of 27 3. Methods components increases, the corresponding cumulative variance increases, as expected. Based on [51] and [52], we can establish a threshold of about 60% to 99% of cumu- lative variance in order to find the optimal number of components or a number of components that is close to the optimal. In our case, we empirically used a thresh- old of 95% and it yielded decently good performance results. The decision for this threshold was based on visually approaching the point above which the cumulative variance is about to be stabilized near 1.0. An example of this approach applied to the normalized SWaT dataset is shown in Figure 3.2. A red line is drawn near the point of stabilization of cumulative variance. This line corresponds to 95% of the variance and the components needed to reach this percentage are 12. This method for choosing the number of principal components is closely related to Kaiser’s stop- ping rule, according to which, only the number of features that have eigenvalues over 1.00 should be considered, as described in [52]. Figure 3.2: Example of choosing the number of principal components based on the percentage of cumulative variance. As far as the datasets SKAB and SMD are concerned, a slightly different approach was followed in order to determine the number of components. It is based on a similar logic as the previous one. However these datasets consist of a number of subsets, thus the need for automation of the process appeared. We iterate through the subsets of these datasets, therefore it would not be efficient to visually approach the stabilization point of cumulative variance and empirically set it equal to an empirically reasonable percentage, as we did before. Thus, in this case, for every subset we iteratively find the point where the cumulative variance does not have an increase of more than 0.01 between a number of components and its following one. For instance, the algorithm goes through the cumulative variance matrix of a subset of SMD. In this matrix, it observes that the cumulative variance that corresponds to 6 components is equal to 0.9238 and the one that corresponds to 7 components is equal to 0.9289. Therefore, the increase of the cumulative variance between 6 components and 7 is less than 0.01. So, the number of components used in PCA is set equal to 6. We decided on this threshold of 0.01 increase in a similar logic that was applied in the previous method. We try to estimate the point where the 28 3. Methods cumulative variance stabilizes, which means that, above that point, adding more components will not have an important increase in the variance. Therefore, the rest of the components above that point can be ignored. We empirically found that 0.01 is a reasonable threshold and decided to proceed with using it in the datasets SKAB and SMD. 3.3.5 LSTM-Autoencoder Implementing the LSTM-Autoencoder, the input data has to be reshaped, as the inputs to the layers of the LSTM architecture are expected to be 3-dimensional. The reshape function of Numpy library was utilized in order to transform the input data into a 3-dimensional array of the structure (samples, timesteps, features). The samples variable corresponds to the amount of data points and the timesteps variable defines how many past data points the model considers in addition to the current data point. Finally, the features variable is the number of features that are taken into account at each time step. The python libraries Tensorflow and Keras were used for building the layers of the LSTM-Autoencoder. The model’s architecture is made up of seven layers, five of which are hidden layers. The data is compressed in the first two hidden layers, re- ducing the feature sizes to the defined ones. This part of the model is the encoder. The fourth and fifth hidden layers of the model decompress the data, restoring the original size of the features. These layers are the mirror image of the encoder layers, meaning that the layers of the decoder are stacked in reverse order of the layers of the decoder. The third hidden layer is a RepeatV ector layer, which works as a "bridge" between the encoder and decoder modules. The purpose of this layer is to replicate the encoded feature vector as many times as the timesteps variable that was defined. It is performed in order to pass these vectors to the next layer. This corresponds to the first out of two layers of the decoder module. The last output layer is the TimeDistributed layer, which generates a vector with the same length as the number of features. This is a reconstruction of the input that was originally received from the model. The activation function used by all the layers is the Rectified Linear Unit (ReLU). The optimizer used when compiling the model is the Adaptive Moment Estimation algorithm (Adam) and the loss function of the model is based on the mean absolute error (MAE). The model gets trained for 100 epochs using the train dataset and early stopping is applied. The early stopping is responsible for terminating the training of the model, once it reaches a point, above which not any further essential improvement is being observed. This is achieved by monitoring the validation loss of the model over the training epochs. Once it is observed that this validation loss does not decrease more than 0.0001 for 15 continuous epochs, the model stops the training and it keeps the weights that yielded the lowest validation loss of the model. Finally, in order to reduce overfitting of the model, weight regularization is applied to the input layer. 29 3. Methods The last step of deciding whether a data point is an anomaly is to define a classifi- cation threshold. For this, we followed two different approaches. The logic behind the two approaches is similar and it is based on the loss distribution computed on the train set. The first approach consists of visualizing the loss distribution by plotting the loss against the density of the data. Afterwards, we manually choose the classification threshold by analyzing the loss distribution and picking a point where the density of the data becomes very close to zero. Thus, all data points above this threshold are classified as anomalies. By using this method, we can ensure that the threshold is set above the noise level of the data, so that we can avoid classifying normal data points as anomalies. This approach is followed on the datasets SWaT and CCFD. An example of this approach is shown in Figure 3.3, where the threshold is manually set equal to 0.225. When it comes to the datasets SKAB and SMD, we automate the process of deciding on the threshold. It is set at the point where 98% of the loss is reached. For finding this point, we use the quantile function provided by the Numpy library. The reason behind following a different approach for the datasets SKAB and SMD is that they consist of a number of subsets, unlike SWaT and CCFD. Therefore, we need to automate the process of finding thresholds in each of the subsets of the datasets SKAB and SMD. We iterate through the subsets and we compute the 0.98 quantile of the density for each one of them. This automation might yield slightly lower performance results than manually picking the threshold as it was done in SWaT and CCFD. Figure 3.3: Example of loss distribution plot for choosing the classification threshold. 30 3. Methods 3.4 Evaluation Method After preprocessing the data, selecting the appropriate features and training the selected models, we evaluate their performance based on the evaluation metrics Pre- cision, Recall and F-Score. It was previously mentioned that we train on the train dataset that does not contain any anomalies and evaluate on the test dataset that contains some anomalous data points. Therefore, at first, the models learn the normal behavior of the system based on the clean from anomalies train dataset. Af- terwards, when they encounter a data point of the test dataset that does not fit with the rest and it seems like abnormal, this data point gets classified as anomalous. In order to evaluate the performance of the models, we need to check how many of the data points were correctly classified and more importantly, how many anomalous data points were correctly classified (as anomalous). That is because the evaluation metrics that we use depend on the proportion of correctly classified anomalous data points (i.e., TP). As we described previously, most of the datasets were divided into data windows. Therefore, when evaluating the models, we need to take that into account. The labels of the data windows are determined by the fact of whether they contain any anomalies or not. Thus, a window that contains at least one anomalous row is la- beled as anomalous. Furthermore, two of the publicly available datasets (SMD and SKAB) consist of a number of subsets, therefore the evaluation of the models on these datasets is slightly different than on the others. The models are applied iteratively on each of the subsets and their performance is counted with the use of the mentioned eval- uation metrics. After this iteration is completed, we compute the average of the evaluation metrics of all subsets. In that way, we end up with the final evaluation metrics representing the performance of the models on the whole dataset. As far as the dataset of the company is concerned, we do not have a fast and au- tomatic way of measuring the performance of the models on it. The reason for this is that we do not have the labels of the dataset of the company in contrary to the publicly available datasets, where we were at least provided with the labels of the test set. In fact, this is a real-world scenario where thousands of log files are produced in a period of a few hours, therefore manually assigning labels for all of them would require lots of valuable time and effort from a number of people. So, the evaluation here is done in the following process. We select the model that had the most stably satisfactory performance on the publicly available datasets. By stably satisfactory, we mean the model that achieved decent evaluation metric results in all four datasets, without having big fluctuations in its performance. After selecting the mode,we fit it with data from the company that corresponds to a day, during which there were not any observed anomalies. Then, we apply the model on the dataset of the company that corresponds to another period of time, in which we want to detect anomalies. The model returns the labels that were assigned to the data points of this dataset. Afterwards, we provide the experts of Centiro with these labels, in 31 3. Methods order for them to go through the logs and manually check which of the model’s detected anomalous data points are actually anomalies. In this way, we compute the evaluation metric results of the model’s performance on the data of the company. The model that had the most stably satisfactory performance in the publicly avail- able datasets is the Autoencoder. Moreover, checking the consistency of the results is needed in order to reduce the factor of randomness in the models’ performance results that were previously observed. Additionally, we perform this check in order to measure the performance of the models on sets of data with different anomaly rates and compare their results. The consistency check that we perform takes place on the standardized SWaT dataset both with and without the use of time win- dows division. The SWaT dataset is the largest of all and the only one where we can divide the test set into ten parts and still each one of them to be big enough for a valid evaluation. Apart from the above, the SWaT dataset is the only one where each of the ten parts of the divided test set can contain a decent amount of anomalies without the need of changing the order of the data. It is a time-series dataset, therefore preserving the order of the data is important for the proper func- tioning of the model. In the publicly available datasets, we can see that anomalous data points appear in bunches next to each other after periods of normal behavior. In the datasets SWaT, SKAB and SMD, the data was gathered continuously for some period of time. For instance, when it comes to SWaT, the data was gathered continuously for 11 days. Therefore, changing the order of data could possibly de- crease the models’ possibilities of finding abnormalities. This is also the case in the log dataset of the company, where the order of the data needs to be preserved as well. After dividing the test set into ten parts, we train the models on the train set and we evaluate their performance iteratively on each of the ten sets. A visualization of this division of the test set and the iterative evaluation is shown in Figure 3.4. Figure 3.4: Division of the test set into 10 subsets and iterative evaluation on each one of them. It is observed that, generally, the performance of the models increases as the amount 32 3. Methods of anomalies in the test set increases. However, this is not a universal rule, but it is rather an observation, which can also be seen later in the Results section and specifically in Figure 4.5. By performing this consistency check, we confirmed that the AE model has the most stably satisfactory performance and therefore it is utilized for detecting anomalies on the data of the company. 3.5 Used Software and Hardware For the needs of this thesis, Python 3.7 was used and the main integrated develop- ment environment that was used is Jupyter Notebook. The python libraries that we used are NumPy, Pandas, Matplotlib, Seaborn, Scikit-learn, Tensorflow and Keras. Numpy is a library that includes high-level mathematical functions for the manip- ulation of multidimensional arrays and matrices [53]. Pandas offers support for numerical table and time series manipulation [54]. Matplotlib is a library used for data visualization in Python and it is well integrated with NumPy and Pandas. Seaborn uses some extra methods that support the creation of explanatory graphics as an extension of Matplotlib and it is well integrated for Pandas DataFrame ma- nipulation [55]. Scikit-learn (or sklearn) is a machine learning library that includes various classification, regression and clustering algorithms [56]. TensorFlow is an open-source library that is used for machine learning and artificial intelligence re- lated tasks. It is mainly used for creating and training deep neural networks [57]. Keras is a library that supports the building of deep neural networks and runs on top of TensorFlow [58]. The training and testing of models were performed on laptops provided by Centiro. Among their specifications is an Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz, a 32GB DDR4 memory and an Nvidia Quadro T1000 GPU. 33 3. Methods 34 4 Results In this section, we present the results of the four models on both the publicly avail- able datasets as well as on the dataset of the company. The performance results vary depending on the model, dataset and preprocessing method (i.e. normaliza- tion, standardization, with/without time windows division). At first, the results of applying the models on the publicly available datasets are presented. Afterwards, the performance of the selected Autoencoder model on the dataset of the company is described. The following tables present the performance results of the models on each publicly available dataset. The histograms that follow, visualize these results in order to ease the comparison of the performance of the models. As it was previously men- tioned, Precision, Recall and F-Score are the evaluation metrics that are used in this project. The most important of the three metrics that we take into account in order to evaluate the stability in the performance of the models is the F-Score. The reason behind this is that the F-Score is a harmonic mean of Precision and Recall as described in 2.5.3 and therefore, we found it to be the most proper metric for performance comparison among the models. As it was described in Section 3.4 regarding the evaluation method, we need to check the consistency of the performance results that we get in order to reduce the factor of randomness. In the subsection 4.5, we provide an example of this consistency check performed on the AE model on the standardized, divided in time windows, SWaT dataset. Later, we describe the evaluation of the AE model on the dataset of the company after dividing it into time windows. The process of its evaluation differs from the one performed on the publicly available datasets, as we are not provided with any labels for the dataset of the company. 35 4. Results Table 4.1: Performance results on SWaT dataset. LOF Case Precision Recall F-Score Standardized & 12.2% 99.5% 21.7% No time windows Normalized & No 12.2% 99.9% 21.8% time windows Standardized & 12.4% 99.2% 22.1% time windows Normalized & 12.4% 98.9% 22.1% time windows IF Case Precision Recall F-Score Standardized & 16.4% 88.3% 27.7% No time windows Normalized & No 23.6% 81% 36.5% time windows Standardized & 29.2% 75.1% 42% time windows Normalized & 23% 81.3% 35.6% time windows PCA Case Precision Recall F-Score Standardized & 98.7% 16.3% 27.9% No time windows Normalized & No 97.8% 16.1% 27.7% time windows Standardized & 97.3% 16% 27.5% time windows Normalized & 98.7% 16.2% 27.8% time windows LSTM AE Case Precision Recall F-Score Standardized & 81.4% 62.7% 71% No time windows Normalized & No 30.1% 72.2% 42.5% time windows Standardized & 82.1% 65% 72.6% time windows Normalized & 62.5% 63% 62.8% time windows 36 4. Results Table 4.2: Performance results on SKAB dataset. LOF Case Precision Recall F-Score Standardized & 48.4% 78.9% 51.7% No time windows Normalized & No 50.6% 79.2% 53.7% time windows Standardized & 49.5% 75.2% 52.1% time windows Normalized & 53.8% 78.4% 56.6% time windows IF Case Precision Recall F-Score Standardized & 25.7% 72.9% 32.2% No time windows Normalized & No 27.3% 74.3% 34.4% time windows Standardized & 20.8% 46.2% 24.6% time windows Normalized & 23.1% 49.4% 27.3% time windows PCA Case Precision Recall F-Score Standardized & 71.7% 16.6% 26.9% No time windows Normalized & No 69.2% 15.9% 25.8% time windows Standardized & 77.5% 19.4% 31% time windows Normalized & 76.5% 19.1% 30.6% time windows LSTM AE Case Precision Recall F-Score Standardized & 68.5% 41.5% 44.4% No time windows Normalized & No 66% 41.7% 43.4% time windows Standardized & 65.5% 48.5% 50.1% time windows Normalized & 66% 46.5% 48% time windows 37 4. Results Table 4.3: Performance results on SMD dataset. LOF Case Precision Recall F-Score Standardized & 69.3% 66.7% 10.6% No time windows Normalized & No 75.8% 65.1% 10.8% time windows Standardized & 55.3% 22.5% 20.7% time windows Normalized & 62.1% 19.8% 20% time windows IF Case Precision Recall F-Score Standardized & 54.4% 19.1% 21.3% No time windows Normalized & No 55.8% 20% 22.1% time windows Standardized & 53.5% 39.8% 32.3% time windows Normalized & 55.2% 39.8% 32.9% time windows PCA Case Precision Recall F-Score Standardized & 40.9% 29.3% 34.2% No time windows Normalized & No 40.8% 29% 33.9% time windows Standardized & 52.5% 31.5% 39.4% time windows Normalized & 56.4% 33.7% 42.2% time windows LSTM AE Case Precision Recall F-Score Standardized & 59% 42.5% 46.5% No time windows Normalized & No 59.9% 40.9% 41.3% time windows Standardized & 61.2% 58% 52.5% time windows Normalized & 57.5% 55.6% 45.6% time windows 38 4. Results Table 4.4: Performance results on CCFD dataset. LOF Case Precision Recall F-Score Standardized & 2.7% 0.3% 0.5% No time windows Normalized & No 5.3% 0.7% 1.2% time windows IF Case Precision Recall F-Score Standardized & 69.3% 9.5% 16.8% No time windows Normalized & No 64% 9% 15.5 time windows PCA Case Precision Recall F-Score Standardized & 5% 77.3% 9.6% No time windows Normalized & No 5% 76% 9.4% time windows LSTM AE Case Precision Recall F-Score Standardized & 11.4% 45.3% 18.2% No time windows Normalized & No 22.8% 44% 30% time windows 39 4. Results 4.1 SWaT dataset As it can be seen in Figure 4.1, the Autoencoder model yields the highest F-Score out of the four models and that is for all variations of preprocessing of the dataset. The best performance of the AE model is observed in the standardized dataset. More specifically, the standardized dataset that was divided into time windows led to slightly better results than the one that was not undergone such windows division. The second best overall performing model based on the F-Scores is the Isolation Forest. We can see that also this model performs the best on the standardized dataset that has been divided into time windows. Figure 4.1: Performance results on the SWaT dataset. 40 4. Results 4.2 SKAB dataset From Figure 4.2, we can observe that the LOF model yields slightly better results than the AE model, which has the second best performance. More specifically, the performance of the LOF model was the best on the dataset that was normalized and divided into time windows. On the contrary, the AE model yielded better results on the standardized dataset. This is the only case where LOF performed better than the rest of the models. However, the results of the AE model on this dataset are still reasonably good. Figure 4.2: Performance results on the SKAB dataset. 41 4. Results 4.3 SMD dataset It can be seen in Figure 4.3, that the AE model performs the best out of the four models for every variation of preprocessing of the dataset. The best performance of the AE model is observed on the dataset that has been standardized and divided into time windows. The model with the second best overall performance on this dataset is PCA, which achieves its best F-Score on the dataset that is normalized and divided into time windows. Figure 4.3: Performance results on the SMD dataset. 42 4. Results 4.4 CCFD dataset From Figure 4.4, we can see that the models did not achieve as high performance results as those that were previously observed on the other datasets. The peculiarity of the CCFD dataset is that it was previously undergone a transformation through a PCA algorithm. Furthermore, it has the lowest anomaly rate of all datasets, as the anomalies make up only 0.17 percent of all data points. The model that yielded the highest F-Score on this dataset is the AE model, which achieved its best results on the normalized dataset. The model with the second best performance on this dataset is the Isolation Forest. Figure 4.4: Performance results on the CCFD dataset. 4.5 Consistency check As mentioned in the last part of Section 3.4 about the evaluation method of the model, we perform a consistency check in order to ensure that the results that were observed do not include much of randomness. In the example that is shown in Figure 4.5, we plot, in ascending order, the amount of anomalies that is included in each subset of the SWaT dataset against the corresponding F-Score that the AE model yielded for that subset. The dataset that was used for this consistency check is the standardized SWaT dataset. We can observe that, generally, the bigger the amount of anomalies in the subset, the higher the F-Score that is achieved by the model. In this example, this observation is valid for almost all subsets with a few excep- tions, such as the one that appears pre-last in the left graph. It includes around 43 4. Results 10% anomalies and its corresponding F-Score is lower than that of the previous subset, which includes 7.8% anomalies. This consistency check was also performed for the rest of the models. However, the yielded results of the other models were, in their majority, worse than those of the AE and, in some cases, their F-Score was 0. Figure 4.5: Performing consistency check on the AE model. 4.6 Additional Precision-Recall Curve test In order to gain a better view of the performance of the models, we include here an additional test, which is based on the tradeoff between the yielded Precision and Recall. The area under the Precision-Recall curve represents the average Precision of the model. By doing so, we get a score that does not depend on a specific value of classification threshold. We perform this test on the standardized SWaT dataset for the same reasons that were previously mentioned in Section 3.4 regarding the Consistency check. The test was performed once on the original standardized SWaT dataset and once on the same dataset that was divided into time windows. From the definition of Precision that was described in Subsection 2.5.1, we can come to the conclusion that the Average Precision shows how accurate is a model when classifying actual anomalies as such without classifying a lot of normal data points as anomalies (FP). Therefore, the higher the Average Precision of the model, the better its ability of correctly classifying positives (anomalies). 44 4. Results Figure 4.6: Precision-Recall Curve test on the standardized SWaT dataset, without the use of time windows. From Figure 4.6, we can see that the model that yielded the best performance re- sults on the standardized SWaT dataset that has not been divided into time windows is the AE model, with an Average Precision of about 0.75. The second best per- formance was achieved by the PCA model, which yielded a slightly lower Average Precision of about 0.72. Similarly to the previous test, the LOF and IF models yielded much lower performance results with an Average Precision of 0.08 and 0.07 respectively. We can observe from Figure 4.7, that can be found in the following page, that the model that achieved the best performance results on the standardized SWaT dataset that has been divided into time windows is the PCA model, with an Average Precision of about 0.71. The AE model yielded the second best results with a slightly lower Average Precision of about 0.62. The other two models achieved relatively low performance scores, namely 0.12 and 0.07 for LOF and IF model respectively. 4.7 Dataset of Centiro The model with the most stably satisfactory performance was the LSTM-Autoencoder, as it was mentioned in Section 3.4. Therefore, this model was applied on the dataset of the company and here we describe its yielded results. From the evaluation that was performed by the experts of the company, we got some specific time periods, during which, it is valid to consider all the detected anomalies as actual anomalies. Therefore, based on that assumption, we were able to confirm 45 4. Results Figure 4.7: Precision-Recall Curve test on the standardized SWaT dataset, with the use of time windows. the correct predictions (TPs) of the model. Additionally, we were able to spot the FPs, which are the logs that the model classified as anomalies, but they are not actual anomalies in the system. However, we were unable to obtain the amount of TNs and FNs. This is because of the sheer amount of data that has been used, and therefore, the detection of TNs and FNs would require a large amount of time and effort from the company, since most of the data points were classified as non anomalies. As a result, we are able to compute the Precision of the model, but not the Recall and the F-Score. More specifically, we divided the standardized dataset of the company into time windows. The rationale for this choice is based on the fact that, in the majority of the cases, the models yielded better results on the standardized datasets. At first, we created 5-second time windows, then 30-second time windows and lastly, 1-minute time windows as mentioned in Subsection 3.2.4. Below, we refer to these time windows as data points, since they are treated as such. The results that we got are the following. For 5-second time windows, the total amount of data points is 8641. Out of these, 63 were classified as anomalies, 39 of which were found to be actual anomalies (TPs). Next, regarding the 30-second time windows, the total amount of data points is 1441. 13 were classified as anomalies and out of these, 9 were found to be actual anomalies. Lastly, as for the 1-minute time windows, the total amount of data points is 721. There were 9 detected anomalies, 6 of which were TPs. These results are pre- sented in the table below, in which the yielded Precision of each test can also be seen. 46 4. Results Table 4.5: Performance results on the dataset of the company. LSTM AE time windows division 5 seconds 30 seconds 1 minute Data points 8641 1441 721 Detected anomalies 63 13 9 True Positives 39 9 6 Precision 61.9% 69.2% 66.7% In the figures below, we present visualized model outputs, based on computed loss distribution and set anomaly threshold in time between 9:00 and 21:00. From these graphs we can clearly see that we are dealing with four peaks, around 10, 12:30, 15 and 19, which repeat among the performed tests. The threshold for the dataset of the company was decided on both the computed value from the previously mentioned quantile function as well as the manual investigation of the loss distribution with respect to the density of the data. Figure 4.8: AE output on 5 seconds time windows. 47 4. Results Figure 4.9: AE output on 30 seconds time windows Figure 4.10: AE output on 1 minute time windows. 48 5 Discussion In this section, we discuss the results that were previously presented among the different approaches that have been used. Furthermore, future work of this project is discussed. 5.1 Approaches & Results Based on the results provided in the section above, we can see that the LSTM-AE, in most of the cases, outperformed the rest of the models. The only case where the Autoencoder did not achieve the highest scores was the SKAB dataset. LOF in yielded on average about 5% better results on the SKAB dataset. The highest difference could be seen on the case where data was normalized and not divided into time windows. We suspect that this phenomenon is caused by the transformation of the data, which is an adjustment in the anomaly rate. It can be seen that for all the other datasets, Local Outlier Factor is the most poorly performing model, therefore we believe that the structure of the data after such transformation could be possibly better for the nature of this density based algorithm. Because the performance of LOF on this particular case drew our attention, we also performed one additional test where we used original structure of SKAB dataset. From that, we could see that LOF performed much worse comparing to LSTM-AE. Based on that, we could ensure our reasoning on choosing LSTM-AE as our final model that was used for IIS log dataset. In Section 3.3, we explained the reasoning behind the choice of values for the param- eters and variables that are connected to the anomaly rate of the datasets. One such parameter is the contamination in LOF and IF models. When it comes to the PCA model, the cutoff variable is connected to the estimation of the anomaly rates of the datasets. For the AE model, the classification threshold that decides on whether a data point is anomalous is also connected to our estimation of the anomaly rate. In case we would like to measure the best performance of the models, using the in- formation provided by the labels of the datasets, we would compute their anomaly rates and adjust the parameters of the models accordingly. This approach would lead to better performance results, however, it could not be applied in a real-life scenario of unsupervised learning, where the available data is unlabeled. Experi- mentally estimating the anomaly rate of the dataset of the company would require relevant information from the experts, which, in most cases, is hard to be found. When it comes to the publicly available datasets, we empirically approached the 49 5. Discussion anomaly rate to 0.01. Therefore, the models are going to expect a lower number of anomalies than the actual number for three datasets and a higher number of anoma- lies for the other one. We could experimentally estimate the anomaly rates of the datasets by running the models for different values of anomaly rate and observing their performance results. However, this would require the use of the available labels in order to compute the performance results. Previously, we mentioned the reason why we do not want to make use of the labels, therefore, we excluded this option of experimentally approaching the anomaly rate individually for every dataset. In the case of the CCFD dataset, we are dealing with a problem commonly known as extreme rare event classification. This task refers to highly unbalanced datasets, where the positive class makes up less than 1% of the total data points. This type of task is generally difficult to deal with. Deep learning has been extensively used in ex- treme rare event classification in the recent years [59]. Nevertheless, LSTM-AE still got some reasonably good results considering the above mentioned circumstances. Having gathered all the results of the performed variations of tests, we can see that preprocessing of the data, which is standardization and normalization, has little impact on the model’s performance. The only significant difference in the results between normalization and standardization can be observed on the performance of LSTM-AE applied on the CCFD dataset. Nonetheless, the majority of the tests performed better when the data was standardized. Regarding the consistency check that was described in Section 4.5, we can observe that the selected AE model contains a degree of unreliability, since its performance depends on the anomaly rate of the dataset. However, this observation is valid for the rest of the models as well. Even though the AE model is inconsistent in that matter, it still yielded better performance results compared to the rest of the mod- els. Therefore, after performing this consistency check, we became more confident of our choice of model, since the AE model had the highest performance results when tested on the ten subsets of the test set. In terms of the results of the dataset of the company, we can conclude that LSTM- AE was successful in detecting some of the actual anomalies of the system. The developed model was able to properly describe the time periods during which the anomaly occurred, reducing the amount of time required to identify the exact causes of the issue. For the company, we were able to provide the file with all the appli- cations that occurred in that precise time after establishing specified time windows of anomaly emerging. Although our model was unable to identify the particular application that caused the reported anomaly, we were able to reduce the number of applications that needed to be investigated, saving time from the experts of the company. From the provided outputs of the model for 3 different time windows divisions, it can be seen that as the time window size increases, the number of peaks in the graph decreases. We can clearly observe 6 peaks that cross the defined threshold for the 50 5. Discussion largest time window, that is 1 minute, but only 4 of them, between 12:00 and 16:00, are actual anomalies. We can conclude, based on our investigation of the results, that codes of levels 3, 4, and 5 have the biggest impact on the corresponding value of the computed loss, where codes 4 and 5 are the messages defined as error messages on the client and server sides, respectively. 5.2 Future Work The work that can be done in order to improve and evolve this project includes the steps that are described below. First of all, we find it reasonable, that the amount of data we fit in the model can affect its performance. Therefore, in the future we want to fit the model with data that corresponds to more than one day of continuously produced log material. This would require a bigger capacity of hardware resources. Apart from that, we would need to know when a period of normal behavior of the system occurred for some continuous days. This is because we want to train the model on a bigger amount of anomaly-free data and test it on data that might include anomalies. Furthermore, another step of future work would be to modify the model, so that it would identify the exact applications that triggered the anomalies. This would ease the process of its evaluation and let the company know which applications require extra attention or updates. In order to do that, another approach of preprocessing the data would be needed. To specify the exact application that was called in cer- tain log, the division of the dataset into time windows through the aggregation of the multiple data rows would not be applicable anymore. Therefore, the selection of new features would be necessary. The investigation of different models and the modification of the current ones would be one more step that can be done in future work. More specifically, modifying the architecture of the AE model could possibly lead to a change in the performance of the model. When it comes to applying different models, our next step would be to try the use of a different architecture of the Autoencoder model, which would include convolutional layers. Lastly, the development of a User Interface (UI) would allow the model to be used by more people that are not relevant to Python programming. A user-friendly ap- plication of anomaly detection could be beneficial for any organization that keeps track of the logs that are produced by its software, as its use would require little training from its technical engineers or other employees who have little knowledge about the subject. 51 5. Discussion 52 6 Conclusion This project investigated the use of four unsupervised machine learning models for the task of anomaly detection. Four models were used, namely Local Outlier Factor, Isolation Forest, Principal Component Analysis and LSTM-Autoencoder. For the purposes of this thesis, we made use of four publicly available datasets as well as a dataset provided by Centiro, the company that the project was in collaboration with. The datasets derived from various anomaly detection related backgrounds, sharing some common characteristics which made possible the comparison among the per- formances of the models on these datasets. The reason for using public datasets lies mainly in the fact that the dataset of the company was not annotated. Because of that, the models could not be tested in order to compare their performances. Therefore, the models were first tested on the four public datasets, namely SWaT, SKAB, SMD and CCFD. Then, the model with the best and most stable perfor- mance on these datasets was utilized in order to detect anomalies on the dataset of Centiro. Two variations of data scaling were used on the datasets and these are standardization and normalization. Also, most of the datasets were further divided into time windows. Performing the above described tests, we were able to choose the most successful model, which was the LSTM-Autoencoder. This model outperformed the rest in the majority of tests and, therefore, it was selected as the model to be applied on the dataset of the company. The evaluation scores that the model achieved were not the highest for all datasets, however, the AE model showed some significant differences when compared to the rest, in terms of combining performance stability and good results. Thus, we found it reasonable to use it for the final experiment. In addition, based on the performed tests, the anomaly rate of the dataset is observed to have an impact on the performance of the models. The utilization of the LSTM-AE model in the final tests on the dataset of the com- pany provided us with the time periods of detected potential anomalies. After the investigation of these time periods from the experts of the company, we were able to evaluate the performance of the model. The average yielded Precision of the model was about 66%, a result that we find reasonably good. Regarding the use of sc-status feature of the ISS logs, it was proved to be a reason- able choice for finding patterns that could lead the model to detect abnormalities in the system. However, the investigation of different features as well as preprocessing techniques would be considerable for future continuation of this project. 53 6. Conclusion 54 Bibliography [1] Omar, S., Ngadi, A. and Jebur, H.H., 2013. Machine learn- ing techniques for anomaly detection: an overview. In- ternational Journal of Computer Applications, 79(2). URL https://www.researchgate.net/profile/Salima-Benqdara/ publication/325049804_Machine_Learning_Techniques_for_ Anomaly_Detection_An_Overview/links/5af3569b4585157136c919d8/ Machine-Learning-Techniques-for-Anomaly-Detection-An-Overview. pdf [2] Zhu, J., He, P., Fu, Q., Zhang, H., Lyu, M.R. and Zhang, D., 2015, May. Learning to log: Helping developers make informed logging decisions. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering (Vol. 1, pp. 415-425). IEEE. URL https://ieeexplore.ieee.org/stamp/stamp. jsp?tp=&arnumber=7194593 [3] Jiang, T., Gradus, J.L. and Rosellini, A.J., 2020. Supervised machine learn- ing: a brief primer. Behavior Therapy, 51(5), pp.675-687. URL https://www. sciencedirect.com/science/article/pii/S0005789420300678#! [4] Mahesh, B., 2020. Machine learning algorithms-a review. International Journal of Science and Research (IJSR).[Internet], 9, pp.381-386. URL https://www.researchgate.net/profile/Batta-Mahesh/publication/ 344717762_Machine_Learning_Algorithms_-A_Review/links/ 5f8b2365299bf1b53e2d243a/Machine-Learning-Algorithms-A-Review. pdf [5] Mandagondi, L.G., 2021. Anomaly Detection in Log Files Using Ma- chine Learning Techniques. URL https://www.diva-portal.org/smash/get/ diva2:1534187/FULLTEXT02 [6] Sathya, R. and Abraham, A., 2013. Comparison of supervised and unsupervised learning algorithms for pattern classification. International Journal of Ad- vanced Research in Artificial Intelligence, 2(2), pp.34-38. URL https://www. researchgate.net/publication/273246843_Comparison_of_Supervised_ and_Unsupervised_Learning_Algorithms_for_Pattern_Classification [7] Granlund, O., 2019. Unsupervised anomaly detection on log-based time se- ries data. URL https://www.diva-portal.org/smash/get/diva2:1377830/ FULLTEXT01.pdf [8] Reddy, Y.C.A.P., Viswanath, P. and Reddy, B.E., 2018. Semi-supervised learning: A brief review. Int. J. Eng. Tech- nol, 7(1.8), p.81. URL https://www.researchgate.net/ profile/Eswara-B/publication/324050146_Semi-supervised_ 55 Bibliography learning_a_brief_review/links/5b5c02f8458515c4b24e2b15/ Semi-supervised-learning-a-brief-review.pdf [9] Wirehed, A. and Suhren Gustafsson, A., 2021. Log Classification us- ing NLP Techniques Data-Driven Fault Categorization of Multimodal Logs using Natural Language Processing Techniques. URL https: //odr.chalmers.se/bitstream/20.500.12380/302416/1/Masters_Thesis_ Adam%20Wirehed%20och%20Adam%20Suhren%20Gustafsson%20210604.pdf [10] Khalid, S., Abbas, H., Pasha, M. and Raza, A., 2012. Securing Internet Infor- mation Services (IIS) configuration files. International Conference for Internet Technology and Secured Transactions, Retrieved from: https://ieeexplore. ieee.org/abstract/document/6470913 [11] G2 Crowd. "What is Graylog?" Retrieved from: https://web.archive. org/web/20190326193941/https://www.g2crowd.com/products/graylog/ details [12] InfoWorld. "10 Splunk alternatives for log analy- sis". URL https://www.infoworld.com/article/3063614/ 10-splunk-alternatives-for-log-analysis.html [13] Cheng, Z., Zou, C. and Dong, J., 2019, September. Outlier detection using isolation forest and local outlier factor. In Proceedings of the conference on research in adaptive and convergent systems (pp. 161-168). URL https://dl. acm.org/doi/pdf/10.1145/3338840.3355641 [14] Alghushairy, O., Alsini, R., Soule, T. and Ma, X., 2020. A Review of Local Outlier Factor Algorithms for Outlier Detection in Big Data Streams. Big Data Cogn. Comput. 2021, 5, 1. URL https://doi.org/10.3390/bdcc5010001 [15] Local outlier factor (2021) Wikipedia. URL https://en.wikipedia.org/ wiki/Local_outlier_factor [16] Liu, F.T., Ting, K.M. and Zhou, Z.H., 2008, December. Isolation forest. In 2008 eighth ieee international conference on data mining (pp. 413-422). IEEE. URL https://ieeexplore.ieee.org/document/4781136 [17] Hariri, S., Kind, M.C. and Brunner, R.J., 2021, April. Extended Isolation For- est. In IEEE Transactions on Knowledge and Data Engineering, IEEE. URL https://ieeexplore.ieee.org/abstract/document/8888179 [18] Lee, Y.J., Yeh, Y.R. and Wang, Y.C.F., 2012. Anomaly detection via online oversampling principal component analysis. IEEE transactions on knowledge and data engineering, 25(7), pp.1460-1470. URL https://ieeexplore.ieee. org/document/6200273 [19] Principal component analysis (February 2022) Wikipedia. URL https://en. wikipedia.org/wiki/Principal_component_analysis [20] Imayakumar, A.A., Dubey, A. and Bose, A., 2020, February. Anomaly detec- tion for primary distribution system measurements using principal component analysis. In 2020 IEEE Texas Power and Energy Conference (TPEC) (pp. 1-6). IEEE. URL https://ieeexplore.ieee.org/document/9042509 [21] Patel, A., Hands-On Unsupervised Learning Using Python, O’Reilly. URL https://www.oreilly.com/library/view/ hands-on-unsupervised-learning/9781492035633/ch04.html 56 Bibliography [22] Powell, V. and Lehe, L., 2015. Principal component analysis. URL https: //setosa.io/ev/principal-component-analysis/ [23] Lee, D., 2017, December. Anomaly detection in multivariate non-stationary time series for automatic DBMS diagnosis. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA) (pp. 412-419). IEEE. URL https://ieeexplore.ieee.org/abstract/document/8260666 [24] Sakurada, M. and Yairi, T., 2014, December. Anomaly detection using autoen- coders with nonlinear dimensionality reduction. In Proceedings of the MLSDA 2014 2nd workshop on machine learning for sensory data analysis (pp. 4-11). URL https://dl.acm.org/doi/abs/10.1145/2689746.2689747 [25] Zhou, C. and Paffenroth, R.C., 2017, August. Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 665-674). URL https: //dl.acm.org/doi/abs/10.1145/3097983.3098052 [26] Mohammadi, M., Al-Fuqaha, A., Sorour, S. and Guizani, M., 2018. Deep learn- ing for IoT big data and streaming analytics: A survey. IEEE Communi- cations Surveys Tutorials, 20(4), pp.2923-2960. URL https://ieeexplore. ieee.org/abstract/document/8373692 [27] Vanishing gradient problem (February 2022) Wikipedia. Retrieved from: https://en.wikipedia.org/wiki/Vanishing_gradient_problem [28] Lindemann, B., Maschler, B., Sahlab, N. and Weyrich, M., 2021. A survey on anomaly detection for technical systems using LSTM networks. Computers in Industry, 131, p.103498. URL https://www.sciencedirect.com/science/ article/pii/S0166361521001056 [29] Christopher Olah. Understanding LSTM Networks. 2015. URL: http://colah. github.io/posts/2015-08-UnderstandingLSTMs/. [30] Berenji Ardestani, S., 2020. Time Series Anomaly Detection and Uncertainty Estimation using LSTM Autoencoders. URL https://www.diva-portal.org/ smash/record.jsf?pid=diva2%3A1468763&dswid=997 [31] Greff, K., Srivastava, R.K., Koutník, J., Steunebrink, B.R. and Schmidhuber, J., 2016. LSTM: A search space odyssey. IEEE transactions on neural networks and learning systems, 28(10), pp.2222-2232. URL https://ieeexplore.ieee. org/abstract/document/7508408 [32] Lee, T.J., Gottschlich, J., Tatbul, N., Metcalf, E. and Zdonik, S., 2018. Precision and recall for range-based anomaly detection. arXiv preprint arXiv:1801.03175. URL https://arxiv.org/abs/1801.03175= [33] F-score (March 2022) Wikipedia. Retrieved from: https://en.wikipedia. org/wiki/F-score [34] Landauer, M., Wurzenberger, M., Skopik, F., Settanni, G. and Filzmoser, P., 2018. Dynamic log file analysis: An unsupervised cluster evolution approach for anomaly detection. computers security, 79, pp.94-116. URL http://www. sciencedirect.com/science/article/pii/S0167404818306333 [35] Ahmad, S., Lavin, A., Purdy, S. and Agha, Z., 2017. Unsupervised real-time anomaly detection for streaming data. Neurocomputing, 262, pp.134-147. URL https://www.sciencedirect.com/science/article/pii/ S0925231217309864 57 Bibliography [36] Tharshini, M., Ragavinodini, M. and Senthilkumar, R., 2017, December. Access log anomaly detection. In 2017 Ninth International Conference on Advanced Computing (ICoAC) (pp. 375-381). IEEE. URL https://ieeexplore.ieee. org/abstract/document/8441194 [37] He, S., Zhu, J., He, P. and Lyu, M.R., 2016, October. Experience report: Sys- tem log analysis for anomaly detection. In 2016 IEEE 27th international sym- posium on software reliability engineering (ISSRE) (pp. 207-218). IEEE. URL https://ieeexplore.ieee.org/abstract/document/7774521 [38] Audibert, J., Michiardi, P., Guyard, F., Marti, S. and Zuluaga, M.A., 2020, August. USAD: unsupervised anomaly detection on multivariate time series. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery Data Mining (pp. 3395-3404). URL https://dl.acm.org/doi/abs/ 10.1145/3394486.3403392 [39] MACHINE LEARNING GROUP - ULB, Kaggle, Credit Card Fraud Detection. Retrieved from: https://www.kaggle.com/datasets/mlg-ulb/ creditcardfraud [40] Singapure University of Technology, iTrust Centre for Research in Cyber Secu- rity, Secure Water Treatment (SWaT). URL https://itrust.sutd.edu.sg/ itrust-labs_datasets/dataset_info/ [41] NetManAIOps, OmniAnomaly, GitHub. Retrieved from: https://github. com/NetManAIOps/OmniAnomaly [42] Waico Inc., SKAB, GitHub. Retrieved from: https://github.com/waico/ SKAB [43] Suneetha, K., Krishnamoorthi, R. 2009. Identifying User Behav- ior by Analyzing Web Server Access Log File. IJCSNS Interna- tional Journal of Computer Science and Network Security. URL https://www.researchgate.net/publication/255583124_Identifying_ User_Behavior_by_Analyzing_Web_Server_Access_Log_File [44] Lakshmanan, S. How, When, and Why Should You Normalize / Standardize / Rescale Your Data?. 2019. URL https://towardsai.net/p/data-science/how- when-and-why-should-you-normalize-standardize-rescale-your-data [45] Kumpulainen, P., Kylväjä, M. and Hätönen, K., 2009, September. Importance of scaling in unsupervised distance-based anomaly detection. In Proceedings of IMEKO XIX World Congress. Fundamental and Applied Metrology (pp. 6-11). [46] Su, Y., Zhao, Y., Niu, C., Liu, R., Sun, W. and Pei, D., 2019, July. Robust anomaly detection for multivariate time series through stochastic recurrent neu- ral network. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery data mining (pp. 2828-2837). [47] Shanker, M., Hu, M.Y. and Hung, M.S., 1996. Effect of data standardization on neural network training. Omega, 24(4), pp.385-397. [48] Scikit-learn (2017) sklearn.neighbors.LocalOutlierFactor URL https: //scikit-learn.org/stable/modules/generated/sklearn.neighbors. LocalOutlierFactor.html [49] Xu, Z., Kakde, D. and Chaudhuri, A., 2019. Automatic Hyperparam- eter Tuning Method for Local Outlier Factor, with Applications to Anomaly Detection, 2019 IEEE International Conference on Big Data 58 Bibliography (Big Data), 2019, pp. 4201-4207, doi: 10.1109/BigData47090.2019.9006151. URL https://ieeexplore.ieee.org/abstract/document/9006151?casa_ token=eUlbpDSojxEAAAAA:2aGAbgcd2hivYedzQ_D5ocyX8jfNmjRDx_Oc_ HR2f9ZeUAKEWKXoyWt3qq7bS-dpA0yngAoZ [50] Scikit-learn (2017) sklearn.ensemble.IsolationForest URL https: //scikit-learn.org/stable/modules/generated/sklearn.ensemble. IsolationForest.html [51] Sharma, L. N., Danadapat, S., Mahanta, A. (2012). Multiscale PCA based quality controlled denoising of multichannel ECG signals. International Journal of Information and Electronics Engineering, 2(2), 107-111. URL http://www. ijiee.org/papers/63-I080.pdf [52] Brown, J. D. (2009). Statistics Corner. Questions and answers about language testing statistics: Choosing the right number of components or factors in PCA and EFA. Shiken: JALT Testing Evaluation SIG Newsletter, 13(2), 19-23. URL https://hosted.jalt.org/test/PDF/Brown30.pdf [53] Harris, C.R., Millman, K. J., van der Walt, S., J., et al., 2020, September. Array programming with NumPy. URL https://www.nature.com/articles/ s41586-020-2649-2.pdf [54] pandas (software) (March 2022), Wikipedia. URL https://en.wikipedia. org/w/index.php?title=Pandas_(software)&oldid=1002320935 [55] Bisong, E., 2019, September, Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners. URL https://link.springer.com/chapter/10.1007/978-1-4842-4470-8_12 [56] scikit-learn (January 2022) Wikipedia. URL https://en.wikipedia.org/ wiki/Scikit-learn [57] TensorFlow (May 2022) Wikipedia. URL https://en.wikipedia.org/wiki/ TensorFlow [58] Ketkar, N., 2017, Deep Learning with Python: A Hands-on Introduction URL https://link.springer.com/chapter/10.1007/978-1-4842-2766-4_7 [59] Ranjan, C. Extreme Rare Event Classification using Autoencoders in Keras. 2020 URL: https://processminer.com/autoencoders-in-keras/ 59