Master’s thesis
Credit Card Fraud Detection by Nearest Neighbor Algorithms
Master’s thesis in Mathematical Sciences
Ramin Maghsood
Department of Mathematical Sciences
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2023

Master’s thesis 2023
Master’s thesis
Credit Card Fraud Detection by Nearest Neighbor Algorithms
Ramin Maghsood
Department of Mathematical Sciences
Chalmers University of Technology
University of Gothenburg
Gothenburg, Sweden 2023
Credit Card Fraud Detection by Nearest Neighbor Algorithms
Ramin Maghsood
© Ramin Maghsood, 2023.
Supervisor: Prof. Holger Rootzén, Department of Mathematical Sciences
Examiner: Prof. Serik Sagitov, Department of Mathematical Sciences
Master’s Thesis 2023
Department of Mathematical Sciences
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000
Cover: Description of the picture on the cover page (if applicable)
Typeset in LATEX
Gothenburg, Sweden 2023
iv
Credit Card Fraud Detection by Nearest Neighbor Algorithms
Ramin Maghsood
Department of Mathematical Sciences
Chalmers University of Technology and University of Gothenburg
Abstract
As the usage of internet banking and online purchases have increased dramatically
in today’s world, the risk of fraudulent activities and the number of fraud cases are
increasing day by day. The most frequent type of bank fraud in recent years is credit
card fraud which leads to huge financial losses on a global level. Credit card fraud
happens when an unauthorized person uses another person’s credit card information
to make purchases. Credit card fraud is an important and increasing problem for
banks and individuals, all around the world. This thesis applies supervised and
unsupervised nearest neighbor algorithms for fraud detection on a Kaggle data set
consisting of 284,807 credit card transactions out of which 492 are frauds, and which
includes 30 covariates per transaction. The supervised methods are shown to be
quite efficient, but require that the user has access to labelled training data where
one knows which transactions are frauds. Unsupervised detection is harder and, e.g.,
for finding 80% of the frauds, the algorithm classifies more 50 times as many valid
transactions as fraud cases. The unsupervised nearest neighbor distance method is
compared to methods using the distance to the center of the data for fraud detection,
and detection algorithms which combine the two methods. The L2 distance and L2
distance to zero and the combination of both distances are analyzed for unsupervised
method. The performance of the methods is evaluated by the Precision-Recall (PR)
curves. The results show that based on both area under curve and precision at 80%
recall, L2 distance to zero performs slightly better than L2 distance.
Keywords: Fraud, Bank Fraud, Credit Card Fraud, Fraud Detection, Nearest Neigh-
bor Algorithms
v

Acknowledgements
I would like to express my deep appreciation and sincere gratitude to my supervisor
Prof. Holger Rootzén for his patience, encouragement and support. I would have
never completed this thesis without his guidance and help. His great ideas and
insightful comments were essential to me to finish my master’s thesis.
I would like to thank my examiner Prof. Serik Sagitov for his support in this thesis.
I would also like to thank Prof. Hjalmar Rosengren and Prof. Stefan Lemurell for
their helpful support during my studies at the department of Mathematical Sciences.
Last but not least, I would like to thank my lovely parents and my great sister for
their endless support and inspiration during my studies.
Ramin Maghsood, Gothenburg, March 2023
vii

Contents
List of Figures x
List of Tables xii
1 Introduction 1
2 Background 3
2.1 Anomalies and Anomaly Detection . . . . . . . . . . . . . . . . . . . 3
2.2 Financial Fraud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Credit Card Fraud . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Theory 5
3.1 Supervised Machine Learning . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Unsupervised Machine Learning . . . . . . . . . . . . . . . . . . . . . 6
3.3 Nearest Neighbor Anomaly Detection . . . . . . . . . . . . . . . . . . 7
3.4 The Supervised Nearest Neighbor Method . . . . . . . . . . . . . . . 7
3.5 The Unsupervised Nearest Neighbor Method . . . . . . . . . . . . . . 7
3.6 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Methods 11
4.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Exploratory data analysis . . . . . . . . . . . . . . . . . . . . . . . . 11
4.3 Training and Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.4 Standardizing Variables . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.5 Computation of Nearest Neighbor Distances . . . . . . . . . . . . . . 15
4.6 Supervised and Unsupervised Anomaly Detection using Nearest Neigh-
bor Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.7 Combining different anomaly detection methods . . . . . . . . . . . . 18
5 Results 19
5.1 Supervised Nearest Neighbor Distance . . . . . . . . . . . . . . . . . 19
5.1.1 L2 norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1.2 L30 norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Unsupervised Nearest Neighbor Distance . . . . . . . . . . . . . . . . 24
5.2.1 L2 distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2.2 L2 distance to zero . . . . . . . . . . . . . . . . . . . . . . . . 25
ix
Contents
5.2.3 Combination of L2 distance and L2 distance to zero . . . . . . 27
5.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6 Conclusion 33
6.1 General Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.3 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Bibliography 37
A Appendix I
A.1 Inverse L2 distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
A.2 Combination of L2 distance and L2 distance to zero without any trans-
formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II
x
List of Figures
4.1 Distribution of time feature . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Time histogram of fraud and non-fraud transactions . . . . . . . . . . 12
4.3 Scatter plot of amount over time . . . . . . . . . . . . . . . . . . . . 13
4.4 Distribution of amount feature (left) and its logarithm (right) for
non-fraud transactions . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.5 Distribution of amount feature (left) and its logarithm (right) for
fraud transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.1 Histogram of distances based on L2 norm. (a) For 20% train and 10%
test sets, (b) For 80% train and 20% test data sets, (c) For 20% train
and 80% test data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2 Histogram of distances based on L30 norm for 20% train and 10% test
data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 Scatter plot of distances based on L2 norm. (a) For 20% train and
10% test sets, (b) For 80% train and 20% test data sets, (c) For 20%
train and 80% test data sets. . . . . . . . . . . . . . . . . . . . . . . . 24
5.4 Precision-Recall Curve based on L2 norm. (a) For 20% train and 10%
test sets, (b) For 80% train and 20% test data sets, (c) For 20% train
and 80% test data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.5 Scatter plot of distances based on L2 distance to zero. (a) For 20%
train and 10% test sets, (b) For 80% train and 20% test data sets, (c)
For 20% train and 80% test data sets. . . . . . . . . . . . . . . . . . . 26
5.6 Precision-Recall Curve based on L2 distance to zero. (a) For 20%
train and 10% test sets, (b) For 80% train and 20% test data sets, (c)
For 20% train and 80% test data sets. . . . . . . . . . . . . . . . . . . 27
5.7 Precision-Recall Curve based on combination of standardized L2 dis-
tance and L2 distance to zero, while the first 20% considered as train-
ing and the next 10% as test. (a) PR curve for ’OR’ condition. (b)
PR curve for ’AND’ condition. . . . . . . . . . . . . . . . . . . . . . . 28
5.8 Precision-Recall Curve based on combination of standardized L2 dis-
tance and L2 distance to zero, while the first 80% considered as train-
ing and the next 20% as test. (a) PR curve for ’OR’ condition. (b)
PR curve for ’AND’ condition. . . . . . . . . . . . . . . . . . . . . . . 28
xi
List of Figures
5.9 Precision-Recall Curve based on combination of standardized L2 dis-
tance and L2 distance to zero, while the first 20% considered as train-
ing and the next 80% as test. (a) PR curve for ’OR’ condition. (b)
PR curve for ’AND’ condition. . . . . . . . . . . . . . . . . . . . . . . 29
5.10 Precision-Recall Curve based on combination of uniform transforma-
tion of L2 distance and L2 distance to zero, while the first 20% con-
sidered as training and the next 10% as test. (a) PR curve for ’OR’
condition. (b) PR curve for ’AND’ condition. . . . . . . . . . . . . . 29
5.11 Precision-Recall Curve based on combination of uniform transforma-
tion of L2 distance and L2 distance to zero, while the first 80% con-
sidered as training and the next 20% as test. (a) PR curve for ’OR’
condition. (b) PR curve for ’AND’ condition. . . . . . . . . . . . . . 30
5.12 Precision-Recall Curve based on combination of uniform transforma-
tion of L2 distance and L2 distance to zero, while the first 20% con-
sidered as training and the next 80% as test. (a) PR curve for ’OR’
condition. (b) PR curve for ’AND’ condition. . . . . . . . . . . . . . 30
A.1 Scatter plot of distances based on inverse L2 norm.(a) for 20% train
and 10% test sets. (b) for 80% train and 20% test data sets. (c) for
20% train and 80% test data sets. . . . . . . . . . . . . . . . . . . . . I
A.2 Precision-Recall Curve based on inverse L2 distance. (a) For 20%
train and 10% test sets, (b) For 80% train and 20% test data sets, (c)
For 20% train and 80% test data sets. . . . . . . . . . . . . . . . . . . II
A.3 Precision-Recall Curve based on the combination of L2 distance and
L2 distance to zero, while the first 20% considered as training and
the next 10% as test. (a) PR curve for ’OR’ condition. (b) PR curve
for ’AND’ condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . III
A.4 Precision-Recall Curve based on the combination of L2 distance and
L2 distance to zero, while the first 80% considered as training and
the next 20% as test. (a) PR curve for ’OR’ condition. (b) PR curve
for ’AND’ condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . III
A.5 Precision-Recall Curve based on the combination of L2 distance and
L2 distance to zero, while the first 20% considered as training and
the next 80% as test. (a) PR curve for ’OR’ condition. (b) PR curve
for ’AND’ condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . III
xii
List of Tables
3.1 Confusion matrix for non-fraud and fraud . . . . . . . . . . . . . . . 9
4.1 Amount details of fraud and non-fraud transactions . . . . . . . . . . 14
4.2 Running time for computing L2 nearest neighbor distances for 3 dif-
ferent algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Running time for computing L30 nearest neighbor distances for 3 dif-
ferent algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4 Hardware and software specifications for the PC . . . . . . . . . . . . 17
5.1 Confusion matrix of the nearest neighbor algorithm with L2 norm.
(a) For 20% train and 10% test sets, (b) For 80% train and 20% test
data sets, (c) For 20% train and 80% test data sets. . . . . . . . . . . 21
5.2 Comparison of true positive, false positive and false negative between
the first four neighbors based on L2 norm. (a) For 20% train and 10%
test sets, (b) For 80% train and 20% test data sets, (c) For 20% train
and 80% test data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 Confusion matrix of the nearest neighbor algorithm with L30 norm
for 20% train and 10% test data sets . . . . . . . . . . . . . . . . . . 23
5.4 Comparison of true positive, false positive and false negative between
the first four neighbors based on L30 norm for 20% train and 10%
test data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
xiii
List of Tables
xiv
1
Introduction
In today’s world, fraud is a global problem that leads to large financial losses all
around the world. Developments in technology and digitization in recent years led
to a massive increase in the number of fraudulent activities and in misusing the
power of new technologies [7]. On the opposite side organizations and large compa-
nies are developing novel methods and effective solutions to detect fraud [4]. The
recent developments in technology present new benefits and helpful methods to the
banks, financial institutions and credit card issuers to decrease the risk of huge losses
efficiently and to monitor and detect the fraud cases.
Millions of credit card transactions are processed every second and the humans are
unable to analyse and process such amount of data to investigate the behavioral
patterns of the fraudsters [13]. This is where the credit card fraud detection using
machine learning algorithms play an essential role. Credit card fraud is divided into
two types, online and offline fraud [18]. We provide more details about credit card
fraud in section 2.3.
Credit card fraud detection is a challenging problem that banks and credit card
issuers are struggling with and are making huge efforts in implementing fraud de-
tection systems. In today’s banking system, fraud detection is often done by using
rule based methods. However, the progress and development of machine learning
techniques gives banks and financial institutions the possibility to detect an unusual
situation faster for big financial data sets.
The data set used in this thesis is the credit card transactions occurred in two days
in Sep. 2013 and is downloaded from Kaggle [12]. Further information about the
data set is provided in section 4.1. There is a number of authors that have done
analysis on this data set and it has been widely studied and discussed further.
In [16] the author examined the three algorithms, Isolation Forest (IF), Local Out-
lier Factor (LOF) and One-Class SVM to identify the best performing unsupervised
algorithms. The reason why the author used unsupervised approach for credit card
fraud detection was the unavailability of the labelled data in the real world. In
[17] the performance of supervised algorithms, K-Nearest Neighbor (KNN), Deci-
sion Tree (DT), Logistic Regression (LR) and Random forest (RF) were examined
for credit card fraud detection and the author concluded that RF had the best
performance for detection of fraud in credit card fraud transactions. In [18] the au-
thors examined comparison of the performance of three classification methods Naive
1
1. Introduction
Bayes, K-Nearest Neighbor and Logistic Regression which used for credit card fraud
detection and their results showed that LR method performed better than the two
others. More investigations of nearest neighbor algorithms for anomaly detection
are [19],[20],[21].
In this thesis, we analyze the credit card transactions to do fraud detection. We
investigate the function of nearest neighbor distance methods in credit card fraud
detection and evaluate the performance of supervised and unsupervised methods in
finding and detecting fraud cases in credit card transactions. The distance metrics
used in the nearest neighbor methods are the L2 distance, L30 distance, L2 distance
to zero, and inverse L2 distance. We conclude the thesis with the pros and cons of
supervised and unsupervised nearest neighbor distance methods.
2
2
Background
In this chapter, we give a general overview of anomalies and anomaly detection and
of the different types of financial fraud. We focus on credit card fraud and investigate
the different approaches used for credit card fraud detection.
2.1 Anomalies and Anomaly Detection
An anomaly is a deviation from normal pattern and recognizing an anomaly hence
depends on how one defines a normal behavior [9]. Once normality is defined, it is
much more easier to define what is an anomaly. Anomalies are commonly divided
in the three types point anomaly, contextual anomaly and collective anomaly [10].
A point anomaly (the simplest type of anomaly) is a point which deviates from the
majority of data points, a contextual anomaly is a point that is anomalous in regard
to its neighbors, and a collective anomaly is a group of data points which behaves
differently from other groups [3].
Anomaly detection is the process of identifying unusual data in a data set. It is
an important technique that is applied in various domains [5]. In the banking and
finance area, any set of unusual activities such as extraordinary transactions are
referring to anomalies. Anomaly detection is highly applicable and useful in banks
and financial services companies and is considered as a data analysis task in financial
domains [6]. The anomaly detection in data sets with high dimensions is becoming
one of the main financial problem. The issue of high dimensional big data affects
the performance and the accuracy of the existing anomaly detection techniques [14].
2.2 Financial Fraud
Financial fraud is happening when another person takes your money or any other
assets by means of a fraudulent activity [1]. In recent years, advancements in technol-
ogy and digitization of the real world have led to an increase in the number financial
fraud cases, and fraudsters misuse advances in technology to take advantage from
fraudulent activities. The common categories of financial fraud are bank fraud, cor-
porate fraud and insurance fraud [2]. Bank fraud occurs when any unauthorized
transactions are made in your bank account. Corporate fraud mostly happens in
forms of a financial statement fraud and securities and commodities fraud and insur-
ance fraud can occur for any kind of insurance such as car and health care insurance
[2]. The most common types of bank fraud are credit card fraud, money laundering,
3
2. Background
and mortgage fraud [10]. Many of the investigations on financial fraud detection in
recent years were conducted in the area of credit card fraud. More details about
credit card fraud are given in the next section.
2.3 Credit Card Fraud
A credit card fraud happens when an unauthorized person uses another person’s
credit card information to make purchases. Credit card fraud is a big problem that
affects many people all around the world [18]. Credit card fraud is mainly com-
mitted by a fraudster by means of obtaining the access to the cardholders physical
credit card information [17]. Nowadays the payment card and credit card issuers
are doing everything they can to protect their customers from the payment and
credit card fraud. Credit card fraud can happen in the form of online purchases
or in the shape of misusing someone else card for shopping in the stores [16]. As
a credit card holder, it is normal to be concerned about the fraudulent activities
that may happen. On the other side, credit card issuers are also concerned about
this matter because in many cases they have to pay back the loss to their customers.
Credit card issuers use rule based methods and different ways to do fraud detection.
One way is to employ the use of highly advanced fraud detection software. The soft-
ware check the transactions and based on prior information determine whether the
transaction is fraudulent or is a reliable transaction. The other way used by credit
card issuer is to look for patterns used by the credit card holders which means that if
the card holder always uses the card in a similar manner but suddenly a transaction
falls out of the card holder’s normal pattern, triggers the credit card company try
to investigate whether that transaction is valid or not [11].
Many people use their credit card to pay for lunch at a restaurant or to make a
purchase at a store while they are out of the city and these payments are out of their
normal pattern of using the credit card for purchases. In some cases, they might be
asked by their bank to validate credit card transactions and to verify whether they
have made those transactions at these particular locations or not. It is normal that
the banks are obliged to follow their rule-based regulations in particular situations.
In many cases, these transactions are false positives which means that they are
exactly the person used the credit card at the corresponding restaurant they have
never gone there before and made the purchase at the store they have never gone
there before. And in these situations, the customers are asked by the credit card
issuer to verify this matter and to confirm that the transactions is accurate or not.
If some of those transactions are invalid or be fraudulent then obviously the card
holder must have to let the credit card issuer know about and ask them to shut the
account down as early as possible and issue a new credit card [11].
4
3
Theory
In this chapter, we describe machine learning approaches used in credit card fraud
detection and in particular those used in this thesis. As we use a financial data set,
it is highly important to find out which method can be more effective and appropri-
ate for financial fraud detection and whether the used method is working properly
with the chosen data set or not. We concentrate on supervised and unsupervised
nearest neighbor distance algorithms and on the theory behind them. In supervised
anomaly detection, on a training set one knows what is an anomaly and what is not
an anomaly whereas in unsupervised anomaly detection, one doesn’t know it. To
evaluate unsupervised anomaly detection methods, one needs to have labeled data.
Labeled data is data that one knows whether it is anomaly or not anomaly.
3.1 Supervised Machine Learning
Supervised learning is the process of classifying a new data point where the la-
belled data is available. In other words, it uses labeled data to be able to train the
models for classification of a new data set. For example in areas that classification
approaches used for anomaly detection, labeled data means that we know which
instances are anomalies.
Some of the examples of classification algorithms used in fraud detection are K-
Nearest Neighbor (KNN), Support Vector Machine (SVM), Decision Tree (DT) and
Random Forest (RF).
The KNN algorithm is one of the simplest machine learning classifiers. In this al-
gorithm, a data point is classified by its nearest neighbors. The parameter k is the
number of nearest neighbors and a new point is classified based on the label of the
majority of its neighbors [29]. In paper [20] the KNN was presented as a precise
method in decreasing the number of false alert and detecting fraudulent transactions
and was introduced as an efficient algorithm for credit card fraud detection.
The Support Vector Machine algorithm was first presented by Vapnik [23]. The
method is a discriminative algorithm to find an optimal hyperplane (a decision
boundary in binary case) for separating the data space for a given labeled data set
[31]. The use of SVM as a credit card fraud detection technique in high dimensional
data sets is analysed in paper [39] and the authors concluded that this algorithm
gives better results while using small data sets.
5
3. Theory
The Decision Tree algorithm is another supervised learning method which is popu-
lar and highly used for prediction and classification problems. A decision tree is a
classification tree with main components of root node, decision node and leaf node.
The root node in the algorithm is represented as the starting point, the decision
node in the algorithm is a point where deciding to split the tree, and the leaf node
in the algorithm is denoted as the final decision [24]. An investigation on credit
card fraud detection by decision tree methods and a comparison of its performance
with SVM is presented in [25]. The author concluded that Decision Tree algorithms
perform better than SVM as the number of frauds detected by this algorithm are
far more than the other.
The Random Forest algorithm was introduced by L. Breiman [26]. It is a tree-
based algorithm used for classification and regression problems. The algorithm is
constructed from multiple decision trees and reached to a decision by maximum
number of votes [17]. A well-written investigation of random forest is presented in
[27]. The author investigated the weak performance of this algorithm in detecting
anomalies.
3.2 Unsupervised Machine Learning
The unsupervised learning method is similar to classification but does not use la-
belled data. The unsupervised based anomaly detection technique is one of the
methods used to investigate the concept of clustering for financial fraud detection
and different models of clustering techniques examined in [6]. The unsupervised
methods do not need the prior knowledge of fraudulent and non fraudulent trans-
actions in historical database, but instead detect changes in behavior and patterns
caused by unusual transactions. Some examples of unsupervised learning algorithms
are K-means, One-Class SVM, Isolation Forest (IF), and DBSCAN which are used
for anomaly detection.
The k-means algorithm is one of the most commonly used clustering algorithms
which is classifying the data points into the pre-specified number of classes, say k.
The algorithm starts by choosing k points in random as the centroids of classes,
and assigning the points to the closest cluster, then for each cluster the mean of all
points assigned to that cluster will be calculated and considered as the new centroid
and the previous steps will be repeated until no points change cluster membership
[29]. In paper [8] the clustering techniques used for financial fraud detection were
reviewed.
One-Class SVM is an unsupervised machine learning algorithm used for anomaly
detection. The main idea behind this algorithm is to classify the data set into binary
classes [16]. The one-class SVM is highly applicable and useful for imbalanced data
set where there are a large number of normal data and a few number of anomaly
data [32]. In paper [38] the authors analysed the one-class SVM as one of the best
methods used for binary class data sets.
6
3. Theory
Isolation Forest is an unsupervised algorithm which was invented by Fei Tony Lui in
2007 [13]. The methodology in IF algorithm is the same as Random Forest and the
construction of algorithm is on the concept of Decision Trees [16]. A deep analysis
of IF and the comparison of its algorithm with other unsupervised learning methods
is presented in [33].
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density
based clustering algorithm. The algorithm categorizes all data points to core points,
border points and noise points where the core points represent the minimum num-
ber of points that should be considered in each cluster, the border points represent
the points that are reachable from or via other core points, and the noise points
represent those points which are not core or border points [28].
In the next section, we study the Nearest Neighbor Distance Methods and discuss
more that why it would be a good approach for credit card fraud detection.
3.3 Nearest Neighbor Anomaly Detection
The nearest neighbor methods are the models that find a number of training sample
points that are closest to a test point and forecast the label of the test point. The
number of training points in the model can be a constant k or can be varied based
on the local density of points and the most common distance metric used is the
standard Euclidean distance [36]. The nearest neighbor distance method can be
divided into the supervised and unsupervised versions.
3.4 The Supervised Nearest Neighbor Method
In the supervised method we looked at the label of closest neighbor of each point
from test data set and used it as the predicted label for the test point. According
to the algorithm, a data point is classified based on how its neighbors are classified.
In addition to the closest neighbor, we also consider the first two, the first three and
the first four closest neighbors to predict the labels by assigning a fraud label to a
test data point if at least one of its neighbors has fraud label.
3.5 The Unsupervised Nearest Neighbor Method
In the unsupervised setting, we pretend that the true labels are not available and
look for ways to classify the test data points as fraud and non-fraud cases. For this,
we set a threshold as a cut-off point and classified all test points above that thresh-
old as fraud cases. The assumption is that the fraud cases should get the highest
average distances.
7
3. Theory
Assume that di is the distance of the test point i to its closest neighbor in training
set. Then for a given cut-off value c, we predicted the class label as follows:
= 1 if di ≥ cŶi (3.1)0 if di < c
The distance di can be any L2, L30 norms or L2 distance to zero. It can be also
the average distance of the k closest neighbors of point i.
3.6 Evaluation Metrics
There exists different methods to evaluate the performance of an algorithm used
for fraud and anomaly detection. In this thesis, we mainly focus on the standard
evaluation metrics Precision, Recall and Accuracy. These metrics are defined based
on true positive, true negative, false positive and false negative rates as
TP
Precision =
TP + FP
Recall = TP
TP + FN
TP + TN
Accuracy =
TP + TN + FP + FN
where
- True Positive (TP): The predicted label is fraud while the true label is fraud.
- True Negative (TN): The predicted label is non-fraud while the true label is
non-fraud.
- False Positive (FP): The predicted label is fraud while the true label is non-
fraud.
- False Negative (FN): The predicted label is non-fraud while the true label is
fraud.
High precision means we got low false positive rate and high recall means we got
low false negative from the algorithm. The precision measures the ratio between the
true positives and all the positives, and the recall measures the ratio of correctly
true positives identified by the algorithm. On the other hand, accuracy is the ratio
between the total number of correct predictions and all predictions. The precision
and recall values can be presented by precision-recall curve. The precision-recall
curve is a visualization of the precision values against the recall values computed for
different thresholds. It is a useful measure for a data set with imbalanced classes [37].
8
3. Theory
Another way of showing the performance of the algorithm is given by the confusion
matrix, which is a cross-table for comparing the predicted labels with the true
ones. Table 3.1 shows the confusion matrix for the Non-fraud which is considered
as negative class and the Fraud which is considered as positive class [13],[27].
Table 3.1: Confusion matrix for non-fraud and fraud
Predicted
Non-Fraud Fraud
Non-Fraud True Negative (TN) False Positive (FP)
Fraud False Negative (FN) True Positive (TP)
9
Actual
3. Theory
10
4
Methods
In this chapter, we present an analysis on the Kaggle data set. In our analysis, we
look at different portions of training and test sets. First we consider the first 20%
of the data set as training and the next 10% as test set. Then we look at the first
80% of the data as training and the rest 20% as test set. Finally the first 20% of
the data as training and the rest 80% as test data sets are chosen.
4.1 Data set
The data set used in this thesis consists of 284,807 credit card transactions, where
492 are fraudulent. It includes twenty eight features V1,..., V28 obtained from the
original features using principal components analyses (PCA). The non-transformed
features in this data set are "Time", "Amount", and "Class". The transactions are
labeled to fraud (1) and non-fraud (0) cases and presented in the feature Class as
the target variables. There are no missing values in the data set [12].
4.2 Exploratory data analysis
As V1, ..., V28 are anonymous, we focus on two main features Time and Amount
and have a deeper look at them in this section. We represent the time histogram of
fraud and non-fraud transactions and visualize the scatter plot of amount over time.
Time feature
The Time feature is the number of seconds between the first transaction and the
current transaction in the data set. Figure 4.1 shows the distribution of time feature
over two days. It can be seen most of the transactions are made during the day.
11
4. Methods
Figure 4.1: Distribution of time feature
To better understand if the transactions occurred at unusual times, the time his-
togram of fraud and non-fraud transactions is displayed separately in figure 4.2.
Figure 4.2: Time histogram of fraud and non-fraud transactions
We can see that the time feature is not distinguishing the fraud and non-fraud cases
very well. However, one can say that the number of fraud transactions are higher
during night hours.
Amount feature
The Amount feature is denoted as the amount of transaction in the data set. The
scatter plot of amount over time is shown in figure 4.3. It can be seen that there
are frauds only on the transactions with the amount less than 2500.
12
4. Methods
Figure 4.3: Scatter plot of amount over time
The distribution of amount feature and its logarithm are shown in figure 4.4 shows
for non-fraud transactions and in figure 4.5 for fraud transactions.
Figure 4.4: Distribution of amount feature (left) and its logarithm (right) for
non-fraud transactions
13
4. Methods
Figure 4.5: Distribution of amount feature (left) and its logarithm (right) for fraud
transactions
Table 4.1 shows the difference in the amount of money used in different transaction
classes. By looking at the table, one can see that 75% of fraud cases have the amount
around 105. This indicates that the fraudsters make lots of low amount transactions
frequently, which make it hard to find them for credit card issuer.
Table 4.1: Amount details of fraud and non-fraud transactions
Amount Fraud Non-Fraud
Count 492 284315
Mean 122.21 88.29
std 256.68 250.11
min 0.00 0.00
25% 1.00 5.65
50% 9.25 22.00
75% 105.89 77.05
max 2125.87 25691.16
4.3 Training and Test Set
For further analysis, we first consider a small portion of the data set where we have
20% and 10% for training and test respectively, then analyse the results for 80% as
training set and 20% as test set and the case where the training set is 20% and the
test set is 80%. The main reason for dividing the data set into the different portion
of train and test sets corresponds to different practical situations. It also depends
on how quickly one can have access to the data. For some investigators the time
of detecting fraud situations is matter and it might be better for them to take the
already existing 20% of labeled data as training and do the prediction for the next
small available 10% of the data. On the other hand, in some cases it might be better
14
4. Methods
to wait until getting the 80% of labeled data be ready and then do the investigation
for the rest of the data.
4.4 Standardizing Variables
In order to do the analysis on the data set, we need to define appropriate distance
metrics. To compute the distances correctly we need to make the range of all
variables are similar. Hence, all the variables need to be standardized. Here a
variable that is standardized has a mean of 0 and a standard deviation of 1. To do
the standardizing, we subtract the mean from the variable and then divide it by the
standard deviation. By considering the number of credit card transactions as the
number of observations xij, for i = 1, 2, ..., N where N = 284806 and the number of
features, j = 1, 2, ..., n where n = 30 is the number of columns. For each column,
the mean is calculated by ∑N
= i=1 xijx.j ,
N
and the standard deviation is calcu√lat∑ed as the following,N
= i=1(x
2
ij − x.j)
σ.j 1 .N −
Then, the standardized column variables are obtained as [13]
∗ = xij − x.jxij .σ.j
Let x be an n-dimensional vector, x=(x1,x2,...,xn). The Lp-Norm, written as ||x||p
for p>0, is defined as
∑n
||x||p = ( |xi|p)1/p.
i=1
The exponent p is commonly chosen to be 2 or infinity. In our analysis, we mea-
sured the distances based on p = 2 as the standard metric and used p = 30 as an
approximation to p =∞ as it is easier to handle the computations.
4.5 Computation of Nearest Neighbor Distances
In this section, we analyse three algorithms used in the nearest neighbor search to
find the closest data point to a test data point. The three algorithms are Brute
Force, KD Tree and Ball Tree.
Brute Force algorithm is the most accurate nearest neighbor method which con-
siders all data points and the efficiency of algorithm in small data sets is feasible
15
4. Methods
however it has computational inefficiency in data sets with high dimensions [35]. In
this method, one calculates the distance between a test data point and each point
in the training set and the point that has the lowest distance with the test point is
considered as the nearest neighbor [34]. The brute force could be an ideal method
but due to the computational complexity and large memory requirements, it would
be better to compare the results with other nearest neighbor methods.
KD Tree is a commonly used nearest neighbor algorithm where the data points are
divided into two sets at each node and it is a binary tree algorithm which finally
ends in two nodes [35]. The first step in this method is to pick randomly a feature
(depends on the number of features one has in the data set) then try to find the
median in the features, and split the data set into the same halves, then after picking
the next feature, repeat the previous steps, and at last continue the procedure until
partitioning all the data points [34].
Ball Tree algorithm is in a shape of a metric tree and in this algorithm the data
points are divided into two clusters where each cluster is surrounded by a circle [35].
The first step in this method is to take a random point, then by finding the farthest
data point from that random point, try to find the data point which is farthest to
this farthest point, then considering a line between the two farthest points and put
all the data points on this line, and trying to find the median in order to split the
space into two halves, then find the centroid in each half and look for the farthest
data point to the centroids and then drawing a circle around the centroid with the
farthest distance as a radius and continue repeating the process within the remained
circles [34].
We tried three methods Ball Tree, KD Tree and Brute force with two metrics L2
and L30 for distance calculations.
Table 4.2: Running time for computing L2 nearest neighbor distances for 3 different
algorithms
Time (seconds) 10% 20% 50%
Ball Tree 64.68 307.87 2014.46
KD Tree 37.55 125.33 477.23
Brute Force 0.87 2.78 21.56
Table 4.2 shows the interesting fast that the Brute Force was substantially faster
than the other methods.
Table 4.3: Running time for computing L30 nearest neighbor distances for 3 dif-
ferent algorithms
Time (seconds) 10% 20% 50%
Ball Tree 706.14 2803.28 16958.16
KD Tree 175.12 444.85 1038.55
Brute Force 103.16 395.07 2582.08
16
4. Methods
By Table 4.3 the Brute force algorithm was fastest for the 10% and 20% splits but
KD Tree was faster for the 50% data.
However, since the Brute force algorithm doesn’t use any approximations, and since
it was possible to use also in the case where KD Tree was faster, we hereby used the
Brute force algorithm to find the nearest neighbor distances.
We have throughout used Python programming language for the data analysis. The
main used packages are "numpy", "pandas", "scipy" and "sklearn". The main class
that we used for nearest neighbor search was "NearestNeighbors" [36]. Table 4.4
shows the computer specifications and the programming language used for data
analysis.
Table 4.4: Hardware and software specifications for the PC
Operating System RAM CPU Program
Windows 10 16GB 2.3GHz Intel Core i7 Python 3.10.7
4.6 Supervised and Unsupervised Anomaly De-
tection using Nearest Neighbor Distance
The nearest neighbor distance algorithm is one of the anomaly detection approaches.
In this thesis, an anomaly is referring to a fraudulent transaction. We used both the
supervised and unsupervised nearest neighbor distance methods for fraud detection.
In the supervised nearest neighbor distance algorithm, one finds the closest point in
the training data set to each test data point, and assign the label of this point to
the test data point. We then compared the assigned labels with their true labels to
see how well the method worked. In addition to the first closest neighbor, we also
considered using the first two, the first three and the first four closest neighbors to
predict the test labels. For this, we assigned label 1 (fraud) to a test data point, if
at least one of its neighbors had label 1. Then we evaluated the false positive and
false negative errors for each of those methods.
In unsupervised learning nearest neighbor algorithm, one can pretend that the class
labels are not available and try to look at the distances to the nearest neighbors for
clustering the data. In the unsupervised method, we take a point from the test data
set and find its nearest neighbor in the training set. One compute the L2 distance
between the test point and its nearest in the training set to predict whether the
point in the test set is anomalous or not. If the distance is very large or very small
then this could be an indication that the point is anomaly. As mentioned in Section
3.5 we set a threshold for distances and predict the label as fraud if the distance
of a test data point is larger than the threshold. The results from supervised and
unsupervised methods are presented in sections 5.1 and 5.2.
17
4. Methods
4.7 Combining different anomaly detection meth-
ods
Another possibility we have considered in unsupervised analysis is to combine L2
distance and L2 distance to zero to predict fraud and non-fraud. In order to do this,
for each point i from test set we first computed the L2 distance d1i to the closest
neighbor from training set. Then, the L2 distance to zero d2i is computed for the
same point i from test set. For each distance, we set a cut-off point range between
10% and 99.9% quantiles, defined as c1 = (a1−b1) for L2 distance and c2 = (a2−b2)
for L2 distance to zero. Finally, we combined the cut-off points and defined a com-
mon range for cut-off points between c = (mean(a1, a2) − mean(b1, b2)), and we
predicted the class label as follows:
- If at least one of the distances is greater than the cut-off point
= 1 if d1i ≥ c or d2i ≥ cŶi (4.1)0 otherwise
- If both distances are greater than the cut-off point

= 1 if d1i ≥ c and d2i ≥ cŶi (4.2)0 otherwise
The two distances d1i and d2i are not comparable due to different distributions they
have. If one has ’OR’ condition, then it is a big distance which dominates. If one
has ’AND’ condition, then it is a small distance which dominates. Therefore, it is
good to make transformation of the distances to have the same scale for the combi-
nation. In our analysis, we used standardization and uniform transformation of the
distances. We replace d1i and d2i in Equations 4.1 and 4.2 once with their standard-
ized values and once with their uniform transformed values. Then the fraud cases
are predicted by combining the two distances.
The standardization is subtraction of mean and division by standard deviation. The
transformation to uniform distribution can be defined as follows. If x is a continuous
random variable with distribution function F (x), then F (x) can be estimated from
a sample x1,x2,...,xn as
1 ∑n
F̂ (x) = 1{xi≤x}n i=1
where 
= 1 if x1 i ≤ x{xi≤x} 0 if xi > x
Thus
Y1 = F̂ (x1), ..., Yn = F̂ (xn)
has approximately a uniform distribution on [0,1].
18
5
Results
In this chapter, we describe the results obtained based on the nearest neighbor
distance algorithms. This includes both supervised and unsupervised learning ap-
proaches. In the supervised section, the analyse of L2 norm and L30 norm is de-
scribed while in the unsupervised section, the L2 distance and L2 distance to zero are
analysed. In this thesis, the L2 distance means the distance to the nearest neighbor
and the L2 norm is referring to the norm of this distance. The analysis is done on
the three data splits of 20/10, 80/20 and 20/80 for the training and test sets.
5.1 Supervised Nearest Neighbor Distance
In this section, the results of the supervised nearest neighbor distances based on L2
norm and L30 norm are presented. This includes the confusion matrix of the nearest
neighbor algorithm, and the comparison table of true positive, false positive, and
false negative errors.
5.1.1 L2 norm
The nearest neighbor distance algorithm is used based on L2 norm for the three
data splits of training and test sets. For each point from the test set, we found the
closest point from the training set based on the minimum distance. The histogram
of the distances for three portions of test sets are shown in Figure 5.1
19
5. Results
Figure 5.1: Histogram of distances based on L2 norm. (a) For 20% train and 10%
test sets, (b) For 80% train and 20% test data sets, (c) For 20% train and 80% test
data sets.
To know how many of fraud cases got a non-fraud nearest neighbor, we compared
the label of closest neighbor of each point with their true labels. The confusion
matrix for three different splits of data set for the first closest neighbor based on L2
norm is shown in Table 5.1 .
20
5. Results
(a)
Neighbor Class
Non-Fraud Fraud
Non-Fraud 28422 12
Fraud 13 34
(b)
Neighbor Class
Non-Fraud Fraud
Non-Fraud 56880 7
Fraud 21 54
(c)
Neighbor Class
Non-Fraud Fraud
Non-Fraud 227230 281
Fraud 101 234
Table 5.1: Confusion matrix of the nearest neighbor algorithm with L2 norm. (a)
For 20% train and 10% test sets, (b) For 80% train and 20% test data sets, (c) For
20% train and 80% test data sets.
In addition to use the first closest neighbor for predicting the labels of test data
points, we also looked at the first two, the first three and the first four closest neigh-
bors to predict the test labels. As we mentioned before, we assigned label 1 (fraud)
to a test data point, if at least one of its neighbors has label 1.
The comparison of true positive, false positive and false negative errors for three
data splits is shown in Table 5.2. As can be seen in this table, the false positive
error, which is the number of incorrectly assigned non-fraud labels to the fraud cases,
is increasing while we are looking at the label of far neighbors.
(a)
First Neighbor Second Neighbor Third Neighbor Forth Neighbor
True Positive 34 36 36 36
False Positive 12 27 33 41
False Negative 13 11 11 11
21
Class Class Class
5. Results
(b)
First Neighbor Second Neighbor Third Neighbor Forth Neighbor
True Positive 54 55 57 57
False Positive 7 20 28 41
False Negative 21 20 18 18
(c)
First Neighbor Second Neighbor Third Neighbor Forth Neighbor
True Positive 234 241 243 247
False Positive 281 501 730 956
False Negative 101 94 92 88
Table 5.2: Comparison of true positive, false positive and false negative between
the first four neighbors based on L2 norm. (a) For 20% train and 10% test sets, (b)
For 80% train and 20% test data sets, (c) For 20% train and 80% test data sets.
5.1.2 L30 norm
Since there are 30 features in our data set, we measured the distances based on L30
norm as an approximation of L∞ to handle the computations easier. The results
of L30 are pretty similar to L2 norm and even a little bit worse. The running time
for L30 was too much, so we just presented the outputs for the small portion of the
data set where we have 20% for training and 10% for test.
The same as before, for each point from test set, we found the closest point from the
training set based on the minimum distance. The histogram of the distances based
on L30 norm for 20% train and 10% test data sets is shown in Figure 5.2.
Figure 5.2: Histogram of distances based on L30 norm for 20% train and 10% test
data sets
22
5. Results
To know how many of fraud cases got a non-fraud closed neighbor, we compared the
label of closest neighbor of each point with their true labels. The Table 5.3 indicates
the confusion matrix for the first closest neighbor based on L2 and L30.
Neighbor Class
Non-Fraud Fraud
Non-Fraud 28417 17
Fraud 12 35
Table 5.3: Confusion matrix of the nearest neighbor algorithm with L30 norm for
20% train and 10% test data sets
It can be seen that the performance of L30 is not as good as L2 and we have more
false positive value.
First Neighbor Second Neighbor Third Neighbor Forth Neighbor
True Positive 35 36 36 36
False Positive 17 36 46 59
False Negative 12 11 11 11
Table 5.4: Comparison of true positive, false positive and false negative between
the first four neighbors based on L30 norm for 20% train and 10% test data
5.1.3 Summary
By looking at the tables 5.2 (a-c), we can see that if k is getting larger, the false
negative errors go down and the false positive errors go up. As the recall value is
presenting the true positive rate, we computed the recall values for the three data
splits (20/10, 80/20 and 20/80) to be 72.3%, 72% and 69.8% respectively and the
precision values to be 73.91%, 88.52% and 45.43% respectively.
It is more important to have high recall than having high precision for fraud detec-
tion as one doesn’t want to miss fraud cases. By considering the recall values of
20/10 and 80/20, we can conclude that the small portion of 20/10 was performing
better than the other splits.
We also computed the accuracy measurements for the three splits and achieved the
following percentages 99.91%, 99.95% and 99.83% respectively however the accu-
racy here doesn’t tell anything about the performance, since we have very many
non-fraud cases and very few fraud cases, therefore, in compare with TN, the values
of TP, FP and FN are so small so that the Accuracy will be nearly equal to one as
TN/TN=1.
In general, We can conclude that the supervised methods are quite efficient but
require that the user has access to labelled training data where one knows which
transactions are frauds.
23
Class
5. Results
5.2 Unsupervised Nearest Neighbor Distance
We describe here the unsupervised nearest neighbor L2 distance algorithm and the
L2 distance to zero algorithm applied to the 20/10, 80/20 and 20/80 subsets from
training and test sets splits. We also present the combination of L2 distance and
L2 distance to zero in this section. In addition, the inverse L2 distance is used as
another metric for fraud detection.
5.2.1 L2 distance
The nearest neighbor distance algorithm is used based on L2 norm for different
portions of training and test sets. For each point from test set, we found the closest
point from the training set based on the minimum distance. The scatter plot of
distances based on L2 norm is shown in Figure 5.3.
Figure 5.3: Scatter plot of distances based on L2 norm. (a) For 20% train and
10% test sets, (b) For 80% train and 20% test data sets, (c) For 20% train and 80%
test data sets.
24
5. Results
As can be seen in the plots, the fraud and non-fraud cases are not separating very
well and we have some non-fraud cases with a very large distance.
To classify and predict the test data labels based on unsupervised learning algorithm,
we used Equation 3.1. We set different values for the cut-off point which varied
between 10% and 99.9% quantiles of L2 distance, and predicted the label of test
data points. To be able to see the performance of the algorithm in a better way, we
computed the precision and recall values. The Precision-Recall curve based on L2
norm is shown in Figure 5.4 for three different portions of training and test sets.
Figure 5.4: Precision-Recall Curve based on L2 norm. (a) For 20% train and 10%
test sets, (b) For 80% train and 20% test data sets, (c) For 20% train and 80% test
data sets.
Figure 5.4(a) indicates that for 20% train and 10% test data sets we will get precision
around 1% to capture 80% recall. This means that we get a high false positive rate.
On the other hand Figure 5.4(b) shows that to capture 80% recall, we will get
precision around 2% while the first 80% considered as training and the rest 20% as
test. Finally we can see that in Figure 5.4(c) the precision will be around 0.2% for
80% recall, while the first 20% considered as training and the rest 80% as test.
5.2.2 L2 distance to zero
We have also compared the L2 distances with the distances to zero, to see whether
the fraud cases are distinguishable or not. The scatter plot of L2 distances to zero
is shown in Figure 5.5 for three different portions of training and test data sets.
25
5. Results
Figure 5.5: Scatter plot of distances based on L2 distance to zero. (a) For 20%
train and 10% test sets, (b) For 80% train and 20% test data sets, (c) For 20% train
and 80% test data sets.
By comparing the L2 distances with the distances to zero shown, it can be seen the
fraud and non-fraud cases are not separating very well.
The performance of the L2 distances to zero is checked by Precision-Recall Curve.
We set the cut-off point between 10% and 99.9% quantiles of distance to zero, and
computed the precision and recall values. The Precision-Recall Curve based on L2
distance to zero is shown in Figure 5.6.
26
5. Results
Figure 5.6: Precision-Recall Curve based on L2 distance to zero. (a) For 20% train
and 10% test sets, (b) For 80% train and 20% test data sets, (c) For 20% train and
80% test data sets.
It can be seen in Figure 5.6(a) that for 20% train and 10% test data sets we will
get precision around 5% to capture 80% recall. On the other hand Figure 5.6(b)
shows that to capture 80% recall, we will get precision around 3% while the first
80% considered as training and the rest 20% as test. Finally we can see that in
Figure 5.6(c) the precision will be around 1% for 80% recall, while the first 20%
considered as training and the rest 80% as test.
Additionally, we used the inverse L2 distance for fraud detection in unsupervised
method. The results are presented in Appendix A.1. Since there are several similar
transactions done very closely over the time, the inverse L2 distance values are very
large in the beginning due to dependencies over the transaction time of the test
data sets. Therefore, the inverse distance plots looked almost the same for different
cut-offs.
5.2.3 Combination of L2 distance and L2 distance to zero
Another assumption that we have considered to predict the fraud and non-fraud
cases, is to combine the L2 distance and L2 distance to zero suggested in Section
4.7. Since the distribution of L2 distance and L2 distance to zero are different in all
cases, one can see that for combination with ’AND’ condition, it’s a small distance
which dominates and for combination with ’OR’ condition, it’s a big distance which
dominates.
In order to do more reliable comparison, we considered two cases in our analyses.
First we standardized L2 distance and L2 distance to zero, and then we used uniform
transformation of them. Then, we combined the cut-off points between 10% and
99.9% quantiles of each distances and defined a common range for cut-off points
using the Equations 4.1 and 4.2 for fraud prediction.
Figures 5.7, 5.8 and 5.9 show the Precision-Recall curve for the combination of stan-
dardized L2 distance and L2 distance to zero for 20%/10%, 80%/20% and 20%/80%
27
5. Results
for training and test sets splits respectively. In each figure, part (a) represents the
’OR’ condition when at least one of the L2 distance and L2 distance to zero is greater
than the cut-off point, and part (b) indicates the Precision-Recall curve for ’AND’
condition when both distances are greater than the cut-off point.
Figure 5.7: Precision-Recall Curve based on combination of standardized L2 dis-
tance and L2 distance to zero, while the first 20% considered as training and the next
10% as test. (a) PR curve for ’OR’ condition. (b) PR curve for ’AND’ condition.
Figure 5.8: Precision-Recall Curve based on combination of standardized L2 dis-
tance and L2 distance to zero, while the first 80% considered as training and the next
20% as test. (a) PR curve for ’OR’ condition. (b) PR curve for ’AND’ condition.
28
5. Results
Figure 5.9: Precision-Recall Curve based on combination of standardized L2 dis-
tance and L2 distance to zero, while the first 20% considered as training and the next
80% as test. (a) PR curve for ’OR’ condition. (b) PR curve for ’AND’ condition.
In the three splits of 20/10, 80/20 and 20/80, the precision values at 80% recall for
’OR’ condition are 3%, 3% and 0.2% and for ’AND’ condition are 1%, 2% and 0.1%
respectively. By comparing the area under the PR curves shown in Figures 5.7, 5.8
and 5.9, and considering the precision values, it can be seen that the split of 80/20
is performing better than the others with ’OR’ condition.
In the next part, the combination of uniform transformation of L2 distance and
L2 distance to zero for the three data splits are presented. Figures 5.10, 5.11 and
5.12 show the Precision-Recall curve for this combination. In each figure, part (a)
represents the ’OR’ condition when at least one of the L2 distance and L2 distance
to zero is greater than the cut-off point, and part (b) indicates the Precision-Recall
curve for ’AND’ condition when both distances are greater than the cut-off point.
Figure 5.10: Precision-Recall Curve based on combination of uniform transfor-
mation of L2 distance and L2 distance to zero, while the first 20% considered as
training and the next 10% as test. (a) PR curve for ’OR’ condition. (b) PR curve
for ’AND’ condition.
29
5. Results
Figure 5.11: Precision-Recall Curve based on combination of uniform transfor-
mation of L2 distance and L2 distance to zero, while the first 80% considered as
training and the next 20% as test. (a) PR curve for ’OR’ condition. (b) PR curve
for ’AND’ condition.
Figure 5.12: Precision-Recall Curve based on combination of uniform transfor-
mation of L2 distance and L2 distance to zero, while the first 20% considered as
training and the next 80% as test. (a) PR curve for ’OR’ condition. (b) PR curve
for ’AND’ condition.
By looking at parts (a) and (b) for the three splits in Figures 5.10, 5.11 and 5.12,
one can see that 80/20 split is typically performing better that the other splits as
the area under the Precision-Recall curve is larger than the two other cases.
The results from the combination of distances for the three data splits without
considering any standardization and uniform transformation are presented in A.2.
It can be observed that in a combination of distances with ’AND’ condition, it’s a
small distance which dominates and in a combination with ’OR’ condition, it’s a big
distance which dominates.
30
5. Results
5.2.4 Summary
In the unsupervised method, the analysis has been done on the three data splits of
20/10, 80/20 and 20/80 for training and test sets. The distances used in unsuper-
vised method are L2 distance, L2 distance to zero and inverse L2 distance.
To predict whether a point from the test set is fraud or not, we looked at the dis-
tance of its nearest neighbor from the training set and assigned it as a fraud case
if the distance is larger than the predefined threshold. We assumed that the fraud
cases have the highest distances on average. However, the scatter plots of the three
distances illustrated in Figures 5.3, 5.5 and A.1, showed that the fraud cases are
located and mixed in the middle of the normal cases for all three portions and it is
not easy to distinguish between fraud and non-fraud cases.
The performance of unsupervised methods were evaluated by the precision-recall
curves. It is more important to have high recall than having high precision, and in
reality we might only want to miss 20 percent of the real fraud which means that
we have to have around 80 percent recall. By looking at different Precision-Recall
curves, we can see that the precision will be nearly 1 to 5 percent. Therefore, we
will get a lot of false positives that should be checked to be able to not missing
any fraud cases. In fact for the fraud detection, the 80 to 90 percent is the most
important part of Precision-Recall plots.
Based on both area under curve and the precision values at 80% recall for the L2 dis-
tance, the L2 distance to zero, and the standardized “OR” combinations are similar,
but with L2 distance to zero slightly better. Still one might want to use standard-
ized “OR” to be on the safe side. Combinations using transformation to uniformity
performed a little worse, using combinations without transforming to the same scale
just copied L2 distance and L2 distance to zero, respectively, and inverse distance
didn’t work at all.
31
5. Results
32
6
Conclusion
In this chapter, we mention our general conclusion about the supervised and un-
supervised nearest neighbor distance methods. The results from the performance
comparison of the methods and the examined approaches are discussed. Finally we
provide some suggestions about possible future research.
6.1 General Overview
As a financial investigator, having access to information about the time of detecting
fraud situations is matter and it might be better to take the already available small
portion of labeled data for training and do the prediction for the upcoming transac-
tions. On the other hand, in some cases it might be better to wait until getting the
large portion of labeled data to be ready to be able to do the investigation for the
rest of the data. In particular, the main reason for dividing the data set into the
different portion of train and test sets correspond to different practical situations.
It depends also on what kind of the data one has access to. Even though someone
has access to for example 80% of data, maybe it is better to use only 20% of the
data as the time latency for computation of the analysis in 80% is increasing.
In a fraud situation, it is more important to not miss the fraud cases. So we need to
have more recall than precision based on the algorithms. This means that we will get
more false positive which one need to check them. Of course in the real situations,
it depends on the capacity of a company for checking these many non-fraud cases.
For example, in banks, they might prefer to not miss any fraud cases and will check
all those non-fraud cases which detected as fraud to make sure that they won’t put
their costumers at risk.
The supervised methods are quite efficient but require the availability of labeled
training data i.e. one knows which transactions are frauds. However in real world,
labeled data are often not available and then one has to use unsupervised based
method. In other words, even though supervised learning is better, when there is
no access to labeled data, then one has to use unsupervised methods.
33
6. Conclusion
6.2 Discussion
In this thesis, we have conducted an investigation on methods for the detection of
fraudulent cases in credit card transactions. We have used nearest neighbor algo-
rithms based on supervised and unsupervised methods. In the supervised method,
we looked at the label of closest neighbor of each point from test data set and used
it as the predicted label for the test point. In the unsupervised version, for each
data point from the test set we looked at their distances to the closest point from
the training set and assigned them as a fraud case if the distances are larger than
the established threshold. The performance of L2 distance, L2 distance to zero and
inverse L2 distance methods were evaluated by means of precision-recall curves.
In the supervised nearest neighbor method, we made a model that find the first
nearest neighbor, the first two, the first three, and the first four nearest neighbors
in order to predict the label of the test data points. We assigned label fraud to a
test point if at least one of its nearest neighbors has label of fraud. In the unsu-
pervised version of nearest neighbor distance algorithm, the general thinking is that
the fraud cases should be far from the others or be isolated in an area. However in
our analysis, we can see the distance based algorithm couldn’t distinguish between
the fraud and non-fraud cases.
The results from the tables 5.2 (a-c), showed that if k as the number of neighbors
is getting larger, the false negative errors go down and the false positive errors go
up. Based on the precision and recall values, it is highly important to have a high
recall value than having high precision as a financial investigator since one doesn’t
want to miss fraud cases and minimizes the false negatives and maximize the true
positives.
Based on both area under curve and precision at 80% the L2 distance, the L2 dis-
tance to zero, and the standardized ’OR’ combinations are similar, but with L2
distance to zero slightly better. To be more safe and not lose any information, one
could still want to use the standardized combination method with ’OR’ condition.
Combinations using transformation to uniformity performed a little worse, using
combinations without transforming to the same scale just copied L2 distance and
L2 distance to zero, respectively, and inverse distance didn’t work at all.
If the PR curve in one method is completely over the PR curve of another method
then the curve which is located on the top is the best one. If the curves are crossed
then maybe one curve is better to the left and less good to the right. In credit card
fraud it is often the right part of the curve is important because it’s typically worse
to miss frauds. In general, a financial investigator doesn’t want to lose fraud cases
and it takes cost doing more checking to identify fraud transactions.
34
6. Conclusion
6.3 Future Research
The papers [15] and [22] introduce an approach called the reverse nearest neighbor
outlier detection method. In this approach, one looks at each point in a data set
and counts how many times that point was the neighbour of other points in the
data set. The idea is that if a point turns out rarely as a neighbor of other points,
it will may be an outlier. As future research, it would be interesting to investigate
how reverse nearest neighbor approach works for detection of fraud in credit card
transactions.
35
6. Conclusion
36
Bibliography
[1] Financial Fraud, Investopedia, https://www.investopedia.com/financial-fraud-
4689710
[2] J. West, M. Bhattacharya, R. Islam, Intelligent Financial Fraud Detection Prac-
tices: An Investigation, School of Computing & Mathematics, Charles Sturt Uni-
versity (2015)
[3] C. Mohan, K. Mehrotra, Anomaly detection in banking operations,
(2017), https://www.semanticscholar.org/paper/Anomaly-detection-in-banking-
operations-Mohan-Mehrotra/b48afd871d0f2d846a9d1d8ab682bc70da12dde3
[4] A. Banarescho, Detecting and Preventing Fraud with Data Analytics, Procedia
Economics and Finance, Vol. 32 (2015) pp. 1827-1836.
[5] M. Vasarhelyi, H. Issa, Application of Anomaly Detection Techniques to Identify
Fraudulent Refunds, (2011) https://dx.doi.org/10.2139/ssrn.1910468
[6] M. Ahmed, A. Mahmood, M. Islam, A survey of anomaly detection techniques
in financial domain, Future Generation Computer Systems, Vol. 55 (2016) pp.
278-288.
[7] A. Patcha, J. M. Park, An overview of anomaly detection techniques: Existing
solutions and latest technological trends, Computer Networks (2007), pp. 3448-
3470.
[8] A. Sabau, Survey of Clustering based Financial Fraud Detection Research, In-
formatica Economica Journal, Vol. 16 (2012), pp. 110-122.
[9] Sh. Agrawal, J. Agrawal, Survey on Anomaly Detection using Data Mining Tech-
niques, Procedia Computer Science; Vol. 60 (2015), pp. 708-713.
[10] W. Hilal, S.A. Gadsden, J. Yawney, Financial Fraud: A Review of Anomaly
Detection Techniques and Recent Advances, Expert Systems With Applications
(2022)
[11] John Ulzheimer. How Do Credit Card Issuers Detect Fraud? - Credit Card In-
sider. YouTube. 2015. URL: https://www.youtube.com/watch?v=N3LjCsVYQ-k
[12] Kaggle, Credit Card Fraud Detection, https://www.kaggle.com/mlg-
ulb/creditcardfraud
[13] K. Muameleci, Anomaly Detection in Credit Card Transactions using Multivari-
ate Generalized Pareto Distribution, Chalmers University of Technology (2022)
https://odr.chalmers.se/handle/20.500.12380/304780
[14] S. Thudumu, P. Branch, J. Jin, J. Singh, A comprehensive survey of anomaly
detection techniques for high dimensional big data, Journal of Big Data (2020).
[15] M. Radovanović, A. Nanopoulos, M. Ivanović, Reverse Nearest Neighbors in
Unsupervised Distance-Based Outlier Detection, IEEE Transactions on Knowl-
edge and Data Engineering (May 2015)
37
Bibliography
[16] H. Bodepudi, Credit Card Fraud Detection Using Unsupervised Machine Learn-
ing Algorithms, International Journal of Computer Trends and Technology (2021)
https://ijcttjournal.org/archives/ijctt-v69i8p101
[17] K. Vishwakarma, Credit Card Fraudulent Detection using Machine Learning,
International Research Journal of Engineering and Technology (IRJET). Vol. 7,
Issue: 09, (Sep 2020).
[18] M. Ummul Safa, R. M. Ganga, Credit Card Fraud Detection Using Machine
Learning, International Journal of Research in Engineering, Science and Manage-
ment. Vol. 2, Issue: 11, (Nov 2019).
[19] M. Zhao, J. Chen, Y. Li, A Review of Anomaly Detection Techniques Based on
Nearest Neighbor (2018) https://www.atlantis-press.com/article/25897526.pdf
[20] C. Navamani, M. Phil, Credit Card Nearest Neighbor Based Outlier Detection
Techniques, International Journal of Computer Techniques. Vol. 5, Issue: 2 (2018)
http://www.ijctjournal.org/Volume5/Issue2/IJCT-V5I2P8.pdf
[21] J. S.Gosavi , V. S.Wadne, Unsupervised Distance-Based Outlier Detection
Using Nearest Neighbours Algorithm on Distributed Approach: Survey (2014)
https://www.semanticscholar.org/paper/Unsupervised-Distance-Based-Outlier-
Detection-Using-Gosavi-Wadne/d414d89a1d13c4ae7a73ad274e3ea320289d6369
[22] Altman, N. S. 1992. An introduction to kernel and nearest-neighbor nonpara-
metric regression. The American Statistician 46(3):175–185.
[23] C. Cortes, V. Vapnik, 1995. Support-vector networks. Machine learning
20(3):273–297.
[24] E. Ileberi, Y. Sun, Z. Wang, A machine learning based credit card fraud de-
tection using the GA algorithm for feature selection, Journal of Big Data (2022)
9:24.
[25] Y. Sahin, E. Duman, (2011). Detecting Credit Card Fraud by Decision Trees
and Support Vector Machines, International MultiConference of Engineers and
Computer Scientists, Hong Kong.
[26] L. Breiman, Random forests, Machine Learning 45, 5 (2001).
[27] O. Johansson, Anomaly detection: Credit card fraud detection using Varia-
tional Autoencoder, One-class Classifier and Random Forest, Chalmers University
of Technology, (2022)
[28] M. Gumbao, (2019, June 03). Best clustering algorithms for anomaly
detection. Retrieved from Towards Data Science, available online:
https://towardsdatascience.com/best-clustering-algorithms-for-anomaly-
detection-d5b7412537c8
[29] T. Hastie, R. Tibshirani, Jerome Friedman, (2008) The Elements of Statistical
Learning, Springer Series in Statistics
[30] M. H. Ahmed, Credit Card Fraud Detection Techniques: A Survey, University
of Management and Technology (2022)
[31] X. Niu, Li Wang, X. Yang, A Comparison Study of Credit Card Fraud Detec-
tion: Supervised versus Unsupervised, (2019)
[32] M. Rezapour, Anomaly Detection using Unsupervised Methods: Credit Card
Fraud Case Study, International Journal of Advanced Computer Science and Ap-
plications. Vol. 10, No. 11, (2019)
38
Bibliography
[33] V U. Mahesh, S. K. Reddy, L. Reddy, S. Krishna, J. Dilleshwar, S. Kaur,
Credit Card Fraud Detection using Unsupervised Machine Learning, International
Research Journal of Engineering and Technology (IRJET) Vol. 08 Issue: 04 (Apr
2021)
[34] Intuition- brute force, KD & Ball tree, learndataa, YouTube. 2021. URL:
https://www.youtube.com/watch?v=FnJj28u7rF0
[35] H. Marius, (Jun 15, 2020). Tree algorithms explained: Ball Tree Algo-
rithm vs. KD Tree vs. Brute Force. Retrieved from Towards Data Science:
https://towardsdatascience.com/tree-algorithms-explained-ball-tree-algorithm-
vs-kd-tree-vs-brute-force-9746debcd940
[36] Nearest Neighbors. Retrieved from Scikit Learn. URL: https://scikit-
learn.org/stable/modules/neighbors.html
[37] Precision-Recall, Retrieved from Scikit Learn, URL: https : //scikit −
learn.org/stable/autoexamples/modelselection/plotprecisionrecall.html
[38] V. N. Dornadulaa , S. Geetha, Credit Card Fraud Detection using Ma-
chine Learning Algorithms, Procedia Computer Science 165 (2019) 631–641,
https://doi.org/10.1016/j.procs.2020.01.057
[39] P. Tiwari, S. Mehta, N. Sakhuja, J. Kumar, and A. K. Singh,
2021. Credit card fraud detection using machine learning: A study,
https://arxiv.org/abs/2108.10005
39
Bibliography
40
A
Appendix
In this Appendix, we provided more results based on unsupervised nearest neighbor
algorithm with inverse L2 distance. The results for combination of L2 distance and
L2 distance to zero without considering any standardization and uniform transfor-
mation is also presented in this section. The algorithm applied to the three splits
of 20/10, 80/20 and 20/80 for training and test sets.
A.1 Inverse L2 distance
The nearest neighbor distance algorithm is used based on inverse L2 distance for
different portions of training and test sets. For each point from the test set, we
found the closest point from the training set based on the minimum distance and
took the inverse of distances. The scatter plot of inverse L2 distances is shown in
Figure A.1.
Figure A.1: Scatter plot of distances based on inverse L2 norm.(a) for 20% train
and 10% test sets. (b) for 80% train and 20% test data sets. (c) for 20% train and
80% test data sets.
I
A. Appendix
The performance of the inverse L2 distances is checked by Precision-Recall Curve
computed for different cut-off points between 10% and 99.9% quantiles of inverse
L2 distances. Figure A.2 shows the Precision-Recall Curve based on inverse L2
distances.
Figure A.2: Precision-Recall Curve based on inverse L2 distance. (a) For 20%
train and 10% test sets, (b) For 80% train and 20% test data sets, (c) For 20% train
and 80% test data sets.
It can be seen in Figures A.2 (a)-(c) that the recall values in all the three curves
don’t go out all the way until recalls higher as high as 80% or higher.
In general, there are several very similar transactions done very closely over the
time and due to the dependencies the inverse L2 distance values are very large in
the beginning over the test time. Therefore, the inverse distance plots looks almost
the same for different cut-off values.
A.2 Combination of L2 distance and L2 distance
to zero without any transformation
As mentioned in Section 4.7, another way that have been considered for predicting
the fraud and non-fraud cases was to combine the L2 distance and L2 distance to
zero. Here, the results are presented based on non-transformed L2 distance and L2
distance to zero. We combined the cut-off points between 10% and 99.9% quantiles
of each distances and defined a common range for cut-off points to predict the fraud
cases using the Equations 4.1 and 4.2. Figures A.3, A.4 and A.5 show the Precision-
Recall curves for three different portions of training and test sets. In each figure,
part (a) shows the curve when at least one of the L2 distance and L2 distance to
zero is greater than the cut-off point, and part (b) illustrates the curve when both
distances are greater than the cut-off point.
II
A. Appendix
Figure A.3: Precision-Recall Curve based on the combination of L2 distance and
L2 distance to zero, while the first 20% considered as training and the next 10% as
test. (a) PR curve for ’OR’ condition. (b) PR curve for ’AND’ condition.
Figure A.4: Precision-Recall Curve based on the combination of L2 distance and
L2 distance to zero, while the first 80% considered as training and the next 20% as
test. (a) PR curve for ’OR’ condition. (b) PR curve for ’AND’ condition.
Figure A.5: Precision-Recall Curve based on the combination of L2 distance and
L2 distance to zero, while the first 20% considered as training and the next 80% as
test. (a) PR curve for ’OR’ condition. (b) PR curve for ’AND’ condition.
III
A. Appendix
It can be seen that in all figures, the curves are shown in part (a) are similar with
the corresponding L2 distance to zero curves when we have OR condition, and the
curves in part (b) are almost the same as the corresponding L2 distance when we set
AND condition for combining both distances. The reason for that is, the distribution
of L2 distance and L2 distance to zero are different in all cases. If we have AND
condition, it’s a small distance which dominates and if we have OR condition, then
it’s a big distance which dominates.
IV