BACHELOR’S THESIS IN ECONOMICS 2017
CORPORATE BANKRUPTCY PREDICTION USING
MACHINE LEARNING TECHNIQUES
BJÖRN MATTSSON & OLOF STEINERT
supervised by ANDREAS DZEMSKI
Abstract
Estimating the risk of corporate bankruptcies is of large importance to creditors and in-
vestors. For this reason bankruptcy prediction constitutes an important area of research. In
recent years artificial intelligence and machine learning methods have achieved promising
results in corporate bankruptcy prediction settings. Therefore, in this study, three machine
learning algorithms, namely random forest, gradient boosting and an artificial neural net-
work were used to predict corporate bankruptcies. Polish companies between 2000 and
2013 were studied and the predictions were based on 64 different financial ratios.
The obtained results are in line with previously published findings. It is shown that
a very good predictive performance can be achieved with the machine learning models.
The reason for the impressive predictive performance is analysed and it is found that the
missing values in the data set play an important role. It is observed that prediction models
with surprisingly good performance could be achieved from only information about the
missing values of the data and with the financial information excluded.
Department of Economics
UNIVERSITY OF GOTHENBURG
SCHOOL OF BUSINESS ECONOMICS AND LAW
Gothenburg, Sweden 2017
Corporate Bankruptcy Prediction using Machine Learning Techniques
BJÖRN MATTSSON
OLOF STEINERT
© BJÖRN MATTSSON, OLOF STEINERT, 2017.
Supervisor: Andreas Dzemski
Bachelor’s Thesis 2017
Department of Economics
UNIVERSITY OF GOTHENBURG
SCHOOL OF BUSINESS ECONOMICS AND LAW
Box 100, S-405 30 Gothenburg
Telephone +46 31-786 0000
Typeset in LATEX
Gothenburg, Sweden 2017
ii
Acknowledgements
We want to express our sincere gratitude to our supervisor Andreas Dzemski for helping
us throughout the process, and offering valuable guidance in how to make machine learn-
ing interesting for economists. We also want to thank Tim Falk and Alexander Piauger
for offering valuable feedback during the opposition.
Björn Mattsson & Olof Steinert, London & Stockholm, August 2017
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Theory 4
2.1 Introduction to machine learning . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Machine learning problems . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Machine learning in economics . . . . . . . . . . . . . . . . . . 6
2.1.3 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3.1 Cross validation . . . . . . . . . . . . . . . . . . . . . 7
2.2 Measurement metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Sensitivity and specificity . . . . . . . . . . . . . . . . . . . . . 9
2.2.4 Receiver operating characteristic . . . . . . . . . . . . . . . . . . 10
2.2.5 Area under curve . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.6 Cross-entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Machine learning models . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Random forest . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.3 Gradient boosting . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.4 Artificial neural networks . . . . . . . . . . . . . . . . . . . . . 14
2.3.4.1 Feed forward neural networks . . . . . . . . . . . . . . 15
2.3.4.2 Activation functions . . . . . . . . . . . . . . . . . . . 15
2.3.4.3 Stochastic gradient descent . . . . . . . . . . . . . . . 17
2.3.4.4 Weight initialisation . . . . . . . . . . . . . . . . . . . 18
2.3.4.5 Regularisation . . . . . . . . . . . . . . . . . . . . . . 18
iv
Contents
2.3.4.6 Oversampling . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.1 Normalising . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.2 Missing value imputation . . . . . . . . . . . . . . . . . . . . . . 20
3 Data 21
3.1 Data set description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Methods 26
4.1 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.1 Splitting up the data . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.2 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.3 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Hyperparameter exploration . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.1 Evaluation of predictive performance . . . . . . . . . . . . . . . 28
5 Results and discussion 29
5.1 Algorithm comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.1 Performance of multilayer perceptron . . . . . . . . . . . . . . . 30
5.2 Importance of missing values . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 Industry usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6 Conclusion 36
A Appendix: Programming code I
v
1
Introduction
1.1 Motivation
Estimating the risk of corporate bankruptcies is of large importance to creditors and in-
vestors. There are large indirect and direct costs associated with financial distress and
bankruptcies (Altman, 1984) and for this reason bankruptcy prediction has for a long pe-
riod of time constituted an extensive area of research (Linden et al., 2015). Corporate
bankruptcies can have serious effects both locally and globally (De Haas and Van Horen,
2012), employees, investors, customers, suppliers and their financiers are all affected
when a company disappears (Engström, 2002). In some cases a corporate bankruptcy
can cause an entire industry to suffer (Yang et al., 2015).
Until recently the dominating methods for predicting corporate bankruptcies have
been based on statistical modelling, however, lately models based on machine learning
have been proposed (Linden et al., 2015). Recently, machine learning models have suc-
cessfully been used for many classification and regression problems and these models
have often outperform traditional classification methods (Krizhevsky et al., 2012). The
purpose of bankruptcy prediction is to assess the financial health status and future per-
spectives of a company. For a given period of time this problem can be modelled as a
two-class classification problem (Zie˛ba et al., 2016). Companies either survive the given
time period, or go bankrupt during it. The problem is to predict which of these two pos-
sible outcomes that is the more probable one.
Financial reporting laws and public expectations on transparency means that there is
a lot of data available concerning companies’ financial status (Bredart, 2014). The large
amount of data makes the area well suited for sophisticated data intensive computation
methods (Qiu et al., 2016).
1.2 Problem formulation
The intention of the study is to illustrate and investigate how machine learning can be
exploited in the field of economics. More specifically, the aim is to study how machine
1
1. Introduction
learning methods can be used to predict corporate bankruptcies.
The findings of the Zie˛ba et al. (2016) serve as starting point for this survey. Zie˛ba
et al. suggests a machine learning approach for the problem of predicting corporate
bankruptcies and reports impressive predictive performance. In their study they use fi-
nancial ratios to address bankruptcy prediction of polish companies in the manufacturing
sector between 2000 and 2013.
The intention of this study is to reproduce, improve and extend the work of Zie˛ba
et al. (2016). The first objective is to reproduce the impressive predictive performance, as
a second step we want to investigate if we can achieve better performance for a neural net-
work than reported by the authors. In the paper Zie˛ba et al. reports very poor performance
for a neural network based classifier. This is surprising since neural networks lately have
proved to be very successful in classification tasks in general and also performed well in
bankruptcy prediction settings (Bredart, 2014). For this reason another objective of the
project is propose a neural network based classifier with improved performance on the
data set. The intention is to identify key success factors for the performance of neural
networks in the context of corporate bankruptcy prediction. Lastly, we want to extend the
research scope by analysing underlying reasons for the outstanding performance reported
for some of the algorithms used by Zie˛ba et al. (2016). It seems unlikely that is possible
to make such accurate predictions and for this reason we explore the importance of the
missing values in the data set and the implication of their unbalanced distribution.
In summary, the results presented by Zie˛ba et al. (2016) were reproduced. In addi-
tion, a neural network with significantly better performance than previously reported was
built. The neural network based model was almost on par with the best performing tree-
based models. Finally, it was found that missing values in the data set play an important
role, they carry almost all information that is useful for the predictions. This is, of course,
problematic and raises questions about the validity of the data set of polish companies
first used by Zie˛ba et al. (2016).
1.3 Scope
During this study corporate bankruptcy prediction using machine learning methods have
been studied. To facilitate comparison the same data set that were used in the context of
bankruptcy prediction by Zie˛ba et al. (2016) was studied. Such an approach facilitated
validation and benchmarking of results.
It should be underlined that the intention of the study was not to suggest an ideal
prediction framework or to achieve an optimal prediction algorithm since such objectives
would make the project more about engineering than about economics.
Moreover, it should be pointed out that no attempts to add any new information to the
2
1. Introduction
data set have been made. It is likely, or at least possible, that new information retrieved
from other data sources would be helpful. For example inflation, growth and competition
indices would possibly carry useful information, however, such work is beyond the scope
of this study.
The study addresses bankruptcy prediction of polish companies in the manufacturing
sector between 2000 and 2013, but the methods used should be considered externally valid
and applicable to other sectors and countries as well. An aspect that naturally would have
been interesting to examine when investigating bankruptcy prediction models is how well
they generalise to data collected at other time periods. However, the exploited data set did
not allow for such analysis since the individual observations were not labelled with dates.
1.4 Related work
The first documented attempts of bankruptcy predictions were carried out by Patrick
(1932). At that time no statistical model was used, the predictions were based on Patrick’s
own interpretations of the financial ratios and the trends he could discern. First in the
1960s statistical models and hypothesis testing were used for bankruptcy prediction. The
work was initiated by Beaver (1966) and two years later Altman (1968) proposed the use
of multiple discriminant analysis for bankruptcy prediction. This work was trend-setting
and in the years that followed Altman’s ideas inspired many others.
Another turning point in the field was the initiation of the generalised linear models
(Ohlson, 1980). Generalised linear models have some advantages, firstly, they allow for
analysis of the certainty of the predictions and, secondly, it is possible to analyse the effect
of each predictor individually.
In recent years machine learning methods have at several times been successfully
used to predict corporate bankruptcies. Neural networks trained with back propagation
are the most common method for this type of problem (Tsai and Wu, 2008). In a study
of small and medium-sized Belgian companies it was shown that relatively good results
could be achieved by using only a small number of easily accessible financial ratios as
inputs to an artificial neural network (Bredart, 2014).
The study of the Belgian companies was not the first of its kind. It was inspired by,
among others, Shah and Murtaza (2000) who used a neural network to predict bankrupt-
cies among US companies between 1992 and 1994 and Becerra et al. (2005) that studied
British corporate bankruptcies between 1997 and 2000 using a similar method.
Lately, some attention has also been devoted to ensemble classifiers. It has by Alfaro
et al. (2008) and Zie˛ba et al. (2016) been shown that ensemble classifiers can successfully
be applied to bankruptcy prediction and significantly outperform other methods.
3
2
Theory
This chapter aims to give a theoretical background to some technical concepts that are im-
portant for the project. The chapter gives a general introduction to machine learning and
explains how machine learning techniques can be used for bankruptcy predictions. It also
describes how the performance of the prediction models can be evaluated. The machine
learning algorithms decision trees, random forest and gradient boosting are introduced
briefly, whereas to artificial neural networks a more detailed introduction is given. Lastly,
some simple techniques for data pre-processing are described.
2.1 Introduction to machine learning
Machine learning is a field that focuses on drawing conclusions from large amounts of
data. This is done by letting a model find structures and relationships in the data. By pre-
senting a machine learning model with samples from a data set the model learns learns to
represent hidden relationships and patterns in the data. This process is commonly referred
to as training. After training the algorithm is supposed to be able to generalise what it has
learned to new, unseen, samples (Friedman et al., 2001). The ability of machine learning
models to generalise to new data points is what engineers and researchers that choose to
use machine learning methods are most interested in.
The field of machine learning combines computation and statistics (Bishop, 2006).
Computational power is needed in the training procedure, which sometimes can take
weeks even on computers built especially for the purpose. Statistics is used to derive
the mathematics behind the machine learning models. It can be argued that the main rea-
sons for the increased attention to machine learning during recent years are the increased
processing power of modern computers and the large amounts of data easily accessible in
today’s digital society (Salvador et al., 2017).
2.1.1 Machine learning problems
The machine learning field can be divided into three different sub-fields: supervised learn-
ing, unsupervised learning and reinforcement learning.
4
2. Theory
Supervised learning is about finding a predictive model that maps inputs to outputs
(Hastie et al., 2009). To achieve this the learning algorithm is presented with examples
of input and output pairs. The goal of the learning process is to produce a function that
gives the desired output for inputs that have not been presented to the algorithm during
training. It is achieved by fitting internal parameters, often referred to as θ, to the data
it tries to approximate. The optimal set of parameters found by the algorithm is often
denoted θˆ. The focus in supervised learning is to find a model that as good as possible
maps unseen inputs to outputs. This is clearly similar to well known techniques as e.g.
ordinary least squares. We want to use the data to make forecasts about future data points.
In machine learning the models are generally very flexible and the internal parameters
can thus recreate many different types of patterns in the data. However, the downside of
this is that the values of the fitted parameters cannot easily be analysed with statistical
rigour, and they typically lack structural or causal interpretation. This in contrast with
ordinary least squares, where the fitted parameters, βˆ, can be used to understand causal
relationships in the data.
Supervised learning can further be broken down into classification and regression
problems. In classification problems the data points belong to two or more classes, and the
problem is to assigns unseen inputs to the correct class. In regression, on the other hand,
the outputs are continuous rather than discrete and cannot be summarised into classes.
In unsupervised learning no labels are given to the learning algorithm. The problem
is to find a descriptive model of the input (Friedman et al., 2001). An example of a
possible unsupervised learning problem is customer segmentation. In such a problem
we want to assign customers to different categories, which we prior to using the model
do not know which they are. The customer segments can later can be used for targeted
marketing.
Lastly, reinforcement learning is about learning a policy: what actions are most ben-
eficial depending on the circumstances. In reinforcement learning the learning algorithm
is not presented any supervised output but instead a delayed reward. Instead of starting
with a pre-defined data set, the model itself collects the data. Reinforcement learning is
often used in a game like framework, e.g. a robot in a maze. Reinforcement learning
recently became widely popular when Mnih et al. (2015) presented and algorithm that
could learn to play many Atari games with super-human performance.
This thesis considers only supervised learning. Corporate bankruptcy prediction is
treated as a two-class classification problem. A company either belongs to the class of
companies that faces bankruptcy or to the class that still operates after a given time period.
5
2. Theory
2.1.2 Machine learning in economics
Machine learning and artificial intelligence have in many different fields proved to be
very useful, for example the techniques has revolutionised the computer vision field
(Sharif Razavian et al., 2014). However, the role of machine learning in economics has
so far been limited. In economics structural models are instead dominating. The reason
for this is likely that in economics it is often more important to be able to draw statistical
significant conclusions about causal effects from the model than having a model with as
good predictive performance as possible. With that said there is still large potential for
machine learning in economics: there is much data available and a need to understand it.
Many problems in the field of economics can also be regarded in the terms of classifica-
tion or regression (McNelis, 2005). Furthermore linear models cannot fully exploit the
information in the data since the underlying relationship often is not linear but non-linear
(Clements et al., 2004). Neural networks are able to approximate any continuous function
and therefore they could be expected to provide more effective models in such applica-
tions. Kuan and White (1994) provides a good overview of how artificial neural networks
can be used in econometrics.
From these advantages and disadvantages it is clear that machine learning is suitable
for problems where what is of interest not necessarily is to understand the governing
mechanisms, but rather to have a model with as good predictive performance as possible.
Predicting inflation is an important but difficult task faced by economists and central
banks. Historically, this has been done with different types of autoregressive models.
Lately, artificial neural networks, or simply neural networks, have gained more attention
as prediction models. Nowadays there is a number of different types of neural network
models that are commonly used in the field (Choudhary and Haider, 2012). Gupta and
Kashyap (2015) provides a pedagogical description of how neural networks can be used
to predict inflation.
Neural networks are extensively used in finance for predicting stock prices (Ander-
son and McNeill, 1992). There is also a a number of attempts to predict the currency
exchange rates using machine learning methods (Verkooijen, 1996). However, in macroe-
conomics, the use of neural networks or even machine learning methods in general are
not very common yet.
Bankruptcy prediction is another field of economics which is compatible with ma-
chine learning. To a bank that lends money to small companies the possibility to use the
model to draw statistical significant conclusions about the underlying behaviour is of lim-
ited interest. On the other hand, having a model with good predictive power is of highest
importance.
6
2. Theory
2.1.3 Model selection
In machine learning the data set is typically dived into a training and a test set. The
training set is used by the model to discover and unveil hidden patterns and relationships
in the data. The test set is used to measure the strength and utility of the trained model
(Bishop, 2006). A loss or cost function is typically defined and the training could then be
seen as a optimisation problem. The idea is that the cost function quantifies the error that
the model makes in its predictions of the desired output (Nielsen, 2015).
Overfitting occurs when the model is more complex than the data relationship. The
model loses its predictive power when it has started to learn random noise in the data.
The opposite to overfitting is underfitting and this means that the model does not fully
describe the underlying relationship between the inputs and the outputs (Wahde, 2008).
The problems of over- and underfitting is summarised by the bias–variance tradeoff.
The bias and the variance could be seen as two different sources of error. The error due to
variance is the error caused by the sensitivity of the algorithm to small fluctuations in the
training set in the model building process (overfitting) and the error due to bias is the error
caused by incorrect or insufficient assumptions of the model (underfitting) (Von Luxburg
and Schölkopf, 2008).
To avoid overfitting a validation set is sometimes used in addition to the training
and test sets. Then the errors of predictions on both the training and validation set can
be monitored during training. However, no feedback is given to the algorithm regarding
the performance of the predictive model on the validation set. The information about the
performance on the validation set is used to retrieve the model that best generalises after
training (Wahde, 2008). The test set is finally used as a final check to see that the model
generalises well. Figure 2.1 shows how overfitting can be avoided by using a validation
set.
2.1.3.1 Cross validation
Another method which is commonly used to avoid overfitting is cross validation (Arlot
et al., 2010). Instead of statically dividing the data set into one part which is used for
training and another which is used for testing (and potentially also a validation set) a
dynamic division is used. The data set is divided into x equally large subsets, x − 1
of those are used for training and the last subsets is used for testing. This procedure is
repeated x times, with each of the x subsets acting as test set once each. In this way
we get x measurements of the scoring metric that we are concerned with, and we can
calculate both the mean and standard error of this scoring metric. This makes overfitting
improbable, since it will stop us from choosing a model that performs well for only a
particular division of the data set. Instead,the model that works best on many different
7
2. Theory
Figure 2.1: The figure illustrates the overfitting phenomenon. The number of training
epochs are plotted on the x-axis and the training error on the y-axis. As the training
progresses, the training error is gradually reduced. The validation error also drops in the
beginning, however, after some time, the validation error begins to grow as the model
starts fitting the noise in the data. When this happens it is time to stop training.
divisions is chosen. However, cross validation requires that the model is trained x times
on the data subsets and this increases the required computation time. For this reason cross
validation is in some cases unfeasible (Blockeel and Struyf, 2002).
2.2 Measurement metrics
To measure the performance of the machine learning algorithm we need to use of some
performance metric. There are many different performance metrics to choose from. De-
pending on the problem and the purpose different metrics are suitable. For example,
different performance metrics are generally used in regression and classification. Since
we are only concerned with classification problems we will only present metrics that are
commonly used for this type of task.
2.2.1 Accuracy
Accuracy is probably the metric that is easiest to understand. It is defined as:
Acc =
t
N
,
where t is the number of samples which were correctly classified and N is the total num-
ber of samples (including both correctly and incorrectly classified examples) (Bishop,
2006). Accuracy is easy to understand as it tells us how often the classifier is correct. For
example, an accuracy of 0.8 means that in 80% of the cases the classifier gives the correct
predictions. However, it does not take into account cases where the classifier is incorrect.
It might be that the classifier is very good at doing predictions when our sample belongs
8
2. Theory
Table 2.1: The figure illustrates how a confusion matrix is constructed. Each classified
data point ends up in one of the four cells in the matrix, which cell is dependent on what
class that is predicted and what class the data point actually belongs to.
Predicted class 1 Predicted class 0
Actual class 1 (P ) True positive (TP ) False negative (FN )
Actual class 0 (N ) False positive (FP ) True negative (TN )
to one of the classes, but very poor when the sample belongs to the other class. Also, de-
pending on the data set a trivial classifier can achieve good accuracy. Let us, for example,
assume that we have a machine that predicts whether or not a company will go bankrupt
in a certain time period. If 80% of the companies in the study still operates after the given
time period a machine that predicts "still operating" for all companies (including the ones
that actually face bankruptcy) would be correct in 80% of the cases, and would thus have
an accuracy of 0.8. This would of course be a completely useless machine, however, the
reported accuracy can still appear to be quite impressive.
2.2.2 Confusion matrix
To better analyse the performance of our classifier we can visualise the performance of
the classifier in something that is called a confusion matrix, showed in Table 2.1. In a
confusion matrix the number of true positives, false positives, true negatives and false
negatives are listed (Sing et al., 2005). The machine discussed in the previous paragraph
would get many true positives (companies that still operates and also are predicted do
so), but unfortunately also many false positives (companies that face bankruptcy, but the
machine predicts to be still operating). It will get no true negatives or false negatives since
it never predicts that a company will face bankruptcy.
2.2.3 Sensitivity and specificity
From the confusion matrix many informative metrics can be derived. Two such metrics
are the sensitivity and the specificity, also called the true positive rate (TPR) and the false
positive rate (FPR) (Sing et al., 2005). They are specified as:
TPR =
TP
P
=
TP
TP + FN
,
FPR =
FP
N
=
FP
FP + TN
.
Measured with these metrics the machine in the example above would get TPR = 1
which is very good, but also FPR = 1 which signifies poor performance.
9
2. Theory
Different classifiers will have different values of true positive rates and false positive
rates. These two numbers are typically a good way of measuring desired performance of
a classifier. It is important to understand the application when analysing these numbers.
In the example above, it would, in some applications, be more important with a low
false positive rate than a high true positive rate, but in others the other way around. For
example, if a bank seeks to invest in companies typically a high true positive rate is more
important; a high true positive rate means that few companies that will end up bankrupt are
missed. Whereas for a governmental organisation that wants to discover companies that
are likely to go bankrupt, to offer them some sort of costly support, it is more important
with a low false positive rate to avoid wasting time and resources on companies that would
survive even without the extra help.
2.2.4 Receiver operating characteristic
Most machine learning algorithms does not return a binary prediction e.g. (still operating
or bankrupt), but rather a probability. To calculate the true positive rate and false posi-
tive rate from the returned probabilities we have to chose a threshold, which we use to
decide which class we consider a sample to belong to. For example all companies that
the machine think will face bankruptcy by more than 70% probability we will consider as
bankrupt. As we increase this threshold both the true positive rate and the false positive
rate will decrease. If we set the threshold to 100% we will have TPR = FPR = 0
and if we set the threshold to 0% we will get TPR = FPR = 1. In between these two
endpoints we have a space of opportunity upon which we can decide how we want our
classifier to perform. The curve which connects these two values are called a Receiver
operating characteristic (ROC) curve. In Figure 2.2 we can see an example of such a ROC
curve for one of the algorithms which have been developed in this study. The dashed line
represents what we would get by random guessing, and the blue solid line represents what
our algorithm returns. Values closer to the upper left corner are better since they mean a
high true positive rate as well as a low false positive rate (Bishop, 2006) .
2.2.5 Area under curve
From just looking at a graph it is difficult to compare the performance of two different
algorithms. To overcome this a metric called Area under curve (AUC) has been derived.
It is, as the name suggests, defined as the area under the ROC curve. An algorithm that
guesses randomly will obtain a value of 0.5, since the area under a triangle with legs
of length one is precisely 0.5. The AUC value for the algorithm in Figure 2.2 is 0.852.
AUC can be interpreted as how much opportunities our algorithms give us for different
scenarios, where potentially different measures are of importance.
10
2. Theory
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
0.0
0.2
0.4
0.6
0.8
1.0
T
r
u
e
 
p
o
s
i
t
i
v
e
 
r
a
t
e
ROC curve
Figure 2.2: Receiver operating characteristic generated for one of the experiments that
were run in this study.
2.2.6 Cross-entropy
Cross entropy is a loss function used in machine learning and classification problems
(LeCun et al., 2015). A loss function or cost function is used to quantify how how well
a machine learning model performs its intended task; a quality-of-fit measure. In the
context of classification the loss function defines the penalty for an incorrect classification
of an observation. By e.g. taking the sum over the complete training set the operational
performance of a model can be quantified as a real number.
For a two class classification problem the cross entropy is defined as:
C = −
1
N
N∑
n=1
[yn ln yˆn + (1− yn) ln (1− yˆn)] (2.1)
whereN is the total number of observations, yˆ is the output of the machine learning model
and y is the desired one-hot encoded output (Nielsen, 2015). The function is strictly
positive and when yˆ is close to y the the penalty is close to zero. The cross entropy is
particularly suitable for neural networks since it solves the problem of learning slowdown
(Nielsen, 2015). The cross entropy can be recognised as the negative of the averaged
log-likelihood over the data set. And minimising the loss therefore equals finding the
maximum likelihood estimate (MLE).
11
2. Theory
2.3 Machine learning models
2.3.1 Decision trees
Tree-based learning algorithms are common in supervised learning, in Bishop (2006) a
good introduction to tree-based methods is given. A decision tree is essentially a set of
splitting rules that can be summarised in a tree structure. The different split points are
called nodes and at each node a decision is applied. A decision rule could for example be
"Is the height of the object greater than 2 meters". Starting from what is called the root
node the tree is built top down. The data set is then divided into different bins by splitting
the data into smaller and smaller subsets. At each decision point the data is branched into
subsets.
The training data is used to construct the tree. Then, in order to reach a decision,
you just keep applying decisions until you reach a terminal node. The intermediate nodes
are often referred to as internal nodes and terminal nodes are called leaves. The class is
determined by the empirical knowledge of what classes other data points from the training
data that ended up in the same leaf belonged to. Decision trees are practical since they,
unlike many other classification algorithms, can handle both numerical and categorical
data.
The splitting rules are found by searching through the space of all possible splitting
attributes. The approach is greedy in the sense that at each node the best splitting attribute
is chosen. In other words attributes are not selected with the best final model in mind.
The aim is to partition the data into subsets that, to at least some extent, contain similar
data points. Ideally the data is split into homogeneous subsets.
Different criterion to determine the best split can be used. A commonly used criterion
is to choose the split that maximises the information gain. The information gain is a
measure of the reduction of uncertainty after a split. To measure uncertainty entropy is
used. Formally the entropy of a state S is defined as:1
H(S) = −
K∑
i=1
p(i) log p(i) (2.2)
where K is the number of branches and p the probability of each branch (Jaynes, 1957).
From the definition it could be seen that a homogeneous sample has the entropy zero
and an equally dived sample has the entropy one.
1Entropy is a theoretical measure of information content, or unpredictability, of a state S. Learning the
outcome of a state with high entropy gives us more information, than learning the outcome of a state with
low entropy. We want to store as much information as possible in the Decision trees, therefore the entropy
measure is suitable.
12
2. Theory
Figure 2.3: A very simple decision tree. At each node (circle) a splitting rule is applied.
For example at the root node a feature that either can be "YES" or "NO" determines the
split.
2.3.2 Random forest
Random forests is an ensemble learning method. The algorithm was first proposed by Ho
(1995). Multiple weak learners are combined to form a final model. The weak learners
are in the case of a random forest decision trees, but a weak learner is simply a model
that performs better than chance. A random forest is essentially a forest of decision trees
trained on randomly sampled training data. The main idea is that by combining multiple
decision trees better predictive performance can be obtained than from any of the decision
trees alone. The random forest algorithm can be used both for classification and regression
tasks.
The random forest training algorithm is based on bagging, which is short for boot-
strap aggregating. Each decision tree is trained on a data set randomly sampled from the
training data with replacement. Such an approach means that each tree learner is shown
a different subset of the training data and that same observation can be chosen more than
once in a sample (Breiman, 1996).
Typically, deep decision trees have very high variance and low bias, which means
that they tend to overfit. However, the variance can be reduced by including many trees
trained on different data sets in the model. This is what is done when building a random
forest model. Each decision tree is trained on a subset of the data points in the training
set and the final prediction is decided by a vote. This way the variance is decreased at the
expense of a small increase in bias.
The performance of an ensemble is typically improved if the ensemble is built up by
a diverse and uncorrelated set of weak learners; the submodels should in a sense com-
13
2. Theory
plement each other. Several identical models are no better than one and in the random
forest algorithm a diverse ensembles is achieved by showing different training sets to the
different submodels (or decision trees). However, further diversification is achieved by
introduction a randomness to the node splitting. In contrast to previously described ap-
proach for decision trees only a random subset of the features are considered at each node
when searching for the best split.
2.3.3 Gradient boosting
Gradient boosting is another very powerful classification algorithm. The idea behind
the algorithm is to sequentially construct models that improve performance on data points
where previous models performed badly (Friedman, 2001). The shortcomings of previous
models are found using gradient descent. The residuals y−Fm corresponds to the negative
gradient of the squared loss function 12(y − F )
2. In the expressions F is a weak learner
and y the actual output.
Just like random forest the gradient boosting algorithm is an ensemble method. How-
ever, the model building process is very different. Random forest could be seen as a par-
allel method where several weak learners (decision trees) are trained independently on
different subsets of the training data, gradient boosting on the other hand, is sequential
and the final model is formed by iteratively adding models. Starting from a weak learner
and some loss function the final model is built incrementally by forming a weighted sum
of models, where the models built later in the series compensate for previous models’
shortcomings.
Figure 2.4: A schematic figure of the model building process. Weak learners compensat-
ing for previous models shortcomings are sequentially summed to finally form an ensem-
ble of models.
2.3.4 Artificial neural networks
An artificial neural network is a computational model loosely inspired by biological ner-
vous systems. It is basically a network structure built up by a large number of intercon-
14
2. Theory
nected processing units. Artificial neural network are often simply referred to as neural
networks. A good overview of neural networks was presented by LeCun et al. in (2015).
Nielsen (2015) has pedagogically described the technique in detail.
The neurons or nodes of the network can simply be seen information-processing
units. In neural networks these processing units are typically arranged in layers. Signals
are passed to the input layers and then they travel through the network to the output layer.
Layers between the input and output layers are called hidden layers.
Neural networks can be seen as very complex functions that for a given input X gen-
erates an output Y . The parameters of the function are called weights. The number of
layers defines the depth of the network and the number of nodes in a layer is the width of
that layer. In addition to these parameters there are so called weights of each interconnec-
tion and activation functions that maps the weighted input signal to an output activation.
One of the most important features of a neural network is the non-linearity (Bishop, 2006).
Neural networks can approximate non-linear relations that typically cannot be reproduced
by linear models. The non-linearity is introduced by non-linear activation functions.
Given a set of inputs x1, x2, . . . , xN a weighted sum of the inputs is formed as
aj =
N∑
i
wi,j · xi + bj (2.3)
where wi,j are the network weights, xi the inputs and bj the bias term. Here aj is the
activation, which is the signal that activates the neuron. From the activation the output is
computed by applying the activation function to the activation.
2.3.4.1 Feed forward neural networks
The feed forward neural network (FFNN) is the most simple type. The network has
only forward pointing connections and the nodes are structured in layers. In Figure 2.5 a
schematic drawing of interconnected nodes in a FFNN is shown.
2.3.4.2 Activation functions
The activation function is what makes neural networks to non-linear classifiers. Until
recently the sigmoid function
σ(x) =
1
1 + e−x
, (2.4)
and the hyperbolic tangent function
f(x) = tanh(x) =
2
1 + e−2x
− 1, (2.5)
have been the most commonly used activation functions. The most important difference
between these two functions is the output range. As can bee seen in Figure 2.6 the sigmoid
15
2. Theory
Figure 2.5: A schematic drawing of a feed forward neural network. Two fully connected
layers x and y are displayed. In the drawing w denotes the network weights.
function returns a value in the range [0,1] and the hyperbolic tangent function a value in
the range [-1,1]. Lately, another activation function called the rectifier is mostly used.
The rectifier is in the context of neural networks defined by
g(z) = max{0,z} (2.6)
and has range [0,∞] (Hahnloser et al., 2000). A unit in a neural network implementing
the rectifier is called a rectified linear unit (ReLU) and for this reason the activation type
often is referred to as simply ReLu. In Figure 2.6 the three mentioned activation functions
are drawn in the interval [-5,5].
−4 −2 0 2 4
X
−1
0
1
2
3
4
5
Y
sigmoid
tanh
ReLU
Figure 2.6: The graph shows the sigmoid, hyperbolic tangent and the rectifier drawn in
the interval [-5,5].
The major advantage of the rectifier compared to both the sigmoid and hyperbolic
tangent functions is that it overcomes the problem of vanishing gradients. The gradients
16
2. Theory
of these functions quickly vanishes as x grows. The gradient of the rectifier does not do
this:
d
dx
g(x) =



0 x < 0
1 x > 0
(2.7)
2.3.4.3 Stochastic gradient descent
Training a neural network means finding the set of network weights that best reproduces a
given set of data. There are several algorithms that can be used to solve this optimisation
problem. Stochastic gradient descent is one of the most popular and successful techniques
(Bottou, 1991).
Gradient descent is an optimisation algorithm where an optimum is found by incre-
mentally taking steps in the direction of the negative gradient of the cost function. One
gradient descent step is described by:
w(τ+1) = w(τ) − η∇wE(w(τ),d), (2.8)
where η denotes the step length (sometimes referred to as learning rate), E the cost func-
tion, w i this case the set of weights and d the data used to calculate the cost and τ defines
the current time-step.
In the gradient descent algorithm the cost function is computed for the complete
training set. Hence, the weights will only be updated once for every one pass over the
complete training set. When working with very large data set such a method becomes
very slow and inefficient. Stochastic gradient descent is a method that seeks to solve this
problem. It is a stochastic approximation of the gradient descent method that is used
to reduce the required number of mathematical operations to optimise the model. The
method reduces the computational cost of each update by approximating the gradient from
a stochastic sample subset. This way the weights are updated more frequently. Instead of
one update per epoch there is one update per batch (a subset of examples taken from the
data set). This efficiency is fundamental in large-scale machine learning problems. The
algorithm can be summarised as:
Algorithm 1. Stochastic gradient descent
1. Input to algorithm: data setD, cost functionE, starting weights w(0), learning rate
η, max number epochs I , number of batches J
2. Set τ ← 0
3. for each epoch i of a total of I:
(a) Randomly divide the training data D into J batches: d1, d2, . . . , dJ
(b) for each batch of data dj:
i. Calculate the cost function E(w(τ), dj)
ii. Update the weights according to w(τ+1) ← w(τ) − η∇wE(w(τ), dj)
17
2. Theory
iii. Update the time step τ ← τ + 1
4. Return the trained neural network with parameters w(τ)
2.3.4.4 Weight initialisation
Before starting the training process the set of weights needs to be given initial values.
There are several different methods for this, but what almost all have in common is that
the weights are initialised at random. It is very important to break the symmetry, which is
done effectively by assigning initial values at random.
To illustrate the importance of the initialisation image that all weights are initialised
to the same value. At the first hidden layer the signal is
ai =
N∑
i
wi,j · xi + bj (2.9)
If then all weights, wi,j have the same value then all units will get the exact same input
signal. The activation will be the same for all units independent of the input. A network
where all units are the same will not be able to learn anything.
A common method is to draw the weights from a normal distribution with zero mean
and standard deviation of 0.1 or 0.01. The final result should not be dependent on the
starting values, however, the time it takes to reach convergence may be affected. The
biases are often initialised to 0 or a small value. 2
2.3.4.5 Regularisation
Neural networks can consist of a huge amount of parameters. This imposes the danger
that the neural network uses these parameters to overfit the data, as discussed in Section
2.1.3. In that section we discussed how to detect overfitting. In this section we will briefly
present two techniques that are widely used to reduce the risk of overfitting.
The first one is called weight decay (Bishop, 2006). To the cost function E(w(τ),d)
we add a second term which is called a regularisation term El2 =
β
2 (w
(τ))2, where β is
a constant that can be decided by the user. This cost term penalises large values of the
weights w(τ). In total the cost function now becomes ET = E(w(τ),d) +
β
2 (w
(τ))2. It
is now this cost that we differentiate in the backpropagation step (described in Section
2.3.4.3).
A second method used to avoid overfitting is dropout (Hinton et al., 2012). Dropout
is applied to layers in neural networks. For each training step all the nodes in a layer is
multiplied by a random variable Dp (drawn independently for each node) that is defined
2For more advanced neural networks Glorot and Bengio (2010) have suggested a widely used methods
for weight initialising.
18
2. Theory
as:
Dp =



1/p with probability p,
0 with probability 1− p.
(2.10)
p is the probability of keeping each node in the network.
Both these methods prevents any particular node in the network to have too much
impact on the prediction. The weight decay do this by preventing any of the weights to
be too large, and the dropout achieves the same result by preventing the output to rely on
single inputs. Thus the network distributes the responsibility of the prediction between
several nodes, which prevents the network to fit the noise of the input.
2.3.4.6 Oversampling
Text-book examples of neural networks often makes many assumptions about the data that
does not hold in most real-world applications. One such assumption is that there exists
more or less equally many examples in the two classes. For the data set that was used
in this thesis this was not the case. This particular data set contains much more negative
samples (companies that survive the time period) than positive examples (companies that
go bankrupt), which leads to that each batch that is fed to the algorithm (as described in
Section 2.3.4.3) contains more negative than positive examples. One consequence of this
can be that the model only learns the distribution of the two classes in the data set, and
not the patterns in the data which it is supposed to learn.
To avoid this, one technique that can be used is oversampling of the rare class, which
means that at each training time step e.g. equally many examples of the two classes are
shown to the classifier. This makes the algorithm focus on the patterns in the data without
being distracted by the uneven class distribution.
2.4 Data pre-processing
An important part of machine learning is data pre-processing. It can in many cases be an
even more challenging task than building the machine learning model itself. In this thesis
two data pre-processing techniques were used: normalisation of the data and missing
value imputation.
2.4.1 Normalising
Normalisation of the data is done by for each feature xi replace it by yi calculated as:
yi =
xi − x¯
s
, (2.11)
19
2. Theory
where x¯ and s are the estimated mean and standard deviation computed on the training
data set.
2.4.2 Missing value imputation
Few real-world data set are complete. Most of them contain missing data points. Missing
values are problematic and how to best handle these could easily be the topic of a thesis
in itself. In machine learning the most important consideration is that when dealing with
these missing values we do not want to loose any important information. The imputation
can therefore typically be divided into two separate processes. Firstly, a method is needed
to replace the missing value with a valid value (Donders et al., 2006). Otherwise the
algorithm does not know how to handle the observations with missing values. Secondly,
a process can optionally be used to give the algorithm the information about which values
were missing. For example a new feature representing the number of features that were
missing in an observation could be added. How we dealt with missing values in this study
is discussed in Section 4.1.2. Arial Arial Arial
testtext (2.12)
20
3
Data
The data set used in this study was extracted and first used by Zie˛ba et al. (2016)1. In this
chapter we will describe this data set.
3.1 Data set description
The data set used in this study consisted of financial information about polish companies
in the manufacturing sector. The data set contained information about both bankrupt
companies and still operating ones. Many polish companies in the manufacturing sector
went bankrupt in this sector after 2004 (Zie˛ba et al., 2016).
The financial information were extracted from the database Emerging Markets Infor-
mation Service (EMIS) (Zie˛ba et al., 2016). The financial indicators describing the health
of the bankrupt companies were collected during 2007-2013 and the information about
the still operating companies was gathered between 2000-2012. For simplicity, we will
refer to the companies that went bankrupt as to belonging to Class 1 and the surviving
companies as to belonging to Class 0.
The data is divided into five different subsets, which are described in Table 3.1. The
subsets are created to allow for different lengths of forecasting period. The task in each
of them is to predict whether or not a company goes bankrupt within five, four, three, two
and one years respectively based on information that could retrieved from 64 different
financial indicators (regressor variables or features). All five data sets are, as is expected
also in real life, heavily imbalanced. There are much fewer bankrupt companies compared
to still operating ones.
64 different financial indicators are used to describe the financial status of the com-
panies. Almost all of the features are financial ratios constructed from information found
in the companies’ economic statements. The use of ratios should mean that the predic-
tors are not too heavily correlated with the size of the companies. A complete list of the
features is found in Table 3.2.
1The data is available online on:
https://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data
21
3. Data
Table 3.1: Description of the data pertaining to the five different classification tasks
present in this data set.
Data set Features from Bankruptcy after No. bankrupt No. not bankrupt Sum
1stYear 1st year 5 years 271 6,756 7,027
2ndYear 2nd year 4 years 400 9,773 10,173
3rdYear 3rd year 3 years 495 10,008 10,503
4thYear 4th year 2 years 515 9,277 9,792
5thYear 5th year 1 years 410 5,500 5,910
3.2 Missing values
As in many real world data set some observations had missing data values for some of
the variables. The missing value frequencies in the different task are relatively low about
1-3%. As can be seen in Figure 3.1 it is also rare that observations have large numbers
of missing values. Thus, at a fist look it does not seem that very much attention needs
to be paid to the missing values. However, when looking closer some problematic facts
are revealed. Below some important aspects of the data set are discussed. There is no
discernible difference between the distribution of missing values between the different
classification tasks. For this reason we will here only present graphs and numbers from
one classification task, namely task "1stYear".
Firstly, there is a difference in how the missing values are distributed between the
different features. The features X11, X21, X27 and X37 have the highest missing value
frequency. These four features represent (X11) the sum of gross profit, extraordinary
items, financial expenses divided by total assets, (X21) sales current year divided by last
year, (X27) profit on operating activities divided by financial expenses and (X37) the
difference between current assets and the inventories divided by the long-term liabilities
respectively. In Table 3.3 the missing value frequencies for these features are reported for
each of the two classes. Figure 3.1 shows histograms of the number of missing values per
feature for the remaining features. By combining the information in Table 3.3 and Figure
3.1 it can be understood that the missing value frequencies for X11, X21, X27 and X37
are much higher than for the rest of the features.
The fact that there is a discrepancy in how the missing values are distributed between
the two classes is another important aspect of the the data set. In general missing values
are more frequent for companies that go bankrupt (2.4%) of than for the companies that
survive (1.3%). In Figure 3.2 we can see the difference in distribution of the number of
missing values per company. The shapes of the two histograms are fundamentally differ-
ent. Most still operating companies have zero or one missing value, whereas most of the
22
3. Data
Table 3.2: Description of the 64 features available in this data set.
ID Description ID Description
X1 net profit / total assets X33 operating expenses / short-term liabilities
X2 total liabilities / total assets X34 operating expenses / total liabilities
X3 working capital / total assets X35 profit on sales / total assets
X4 current assets / short-term liabilities X36 total sales / total assets
X5
[(cash + short-term securities +
receivables - short-term liabilities) /
(operating expenses - depreciation)] * 365
X37 (current assets - inventories) / long-term liabilities
X6 retained earnings / total assets X38 constant capital / total assets
X7 EBIT / total assets X39 profit on sales / sales
X8 book value of equity / total liabilities X40
(current assets - inventory - receivables) /
short-term liabilities
X9 sales / total assets X41
total liabilities / ((profit on operating activities +
depreciation) * (12/365))
X10 equity / total assets X42 profit on operating activities / sales
X11
(gross profit + extraordinary items +
financial expenses) / total assets
X43 rotation receivables + inventory turnover in days
X12 gross profit / short-term liabilities X44 (receivables * 365) / sales
X13 (gross profit + depreciation) / sales X45 net profit / inventory
X14 (gross profit + interest) / total assets X46
(current assets - inventory) /
short-term liabilities
X15
(total liabilities * 365) /
(gross profit + depreciation)
X47 (inventory * 365) / cost of products sold
X16 (gross profit + depreciation) / total liabilities X48
EBITDA (profit on operating activities -
depreciation) / total assets
X17 total assets / total liabilities X49
EBITDA (profit on operating activities -
depreciation) / sales
X18 gross profit / total assets X50 current assets / total liabilities
X19 gross profit / sales X51 short-term liabilities / total assets
X20 (inventory * 365) / sales X52
(short-term liabilities * 365) /
cost of products sold)
X21 sales (n) / sales (n-1) X53 equity / fixed assets
X22 profit on operating activities / total assets X54 constant capital / fixed assets
X23 net profit / sales X55 working capital
X24 gross profit (in 3 years) / total assets X56 (sales - cost of products sold) / sales
X25 (equity - share capital) / total assets X57
(current assets - inventory - short-term liabilities) /
(sales - gross profit - depreciation)
X26 (net profit + depreciation) / total liabilities X58 total costs /total sales
X27
profit on operating activities /
financial expenses
X59 long-term liabilities / equity
X28 working capital / fixed assets X60 sales / inventory
X29 logarithm of total assets X61 sales / receivables
X30 (total liabilities - cash) / sales X62 (short-term liabilities *365) / sales
X31 (gross profit + interest) / sales X63 sales / short-term liabilities
X32
(current liabilities * 365) /
cost of products sold
X64 sales / fixed assets
23
3. Data
Table 3.3: Share missing values for four features separated by class (companies that
went bankrupt pertain to Class 1). In the table X11 represent the sum of gross profit,
extraordinary items, financial expenses divided by total assets, X21 the sales current year
divided by last year, X27 the profit on operating activities divided by financial expenses
and X37 the difference between current assets and the inventories divided by the long-
term liabilities.
Feature Missing value (Class 0) Missing value (Class 1)
X11 0.04% 13.3%
X21 22.4% 40.2%
X27 2.8% 44.3%
X37 39.0% 38.7%
0 1 2 3 4 5 6
% missing values
0
10
20
30
40
50
60
F
e
a
t
u
r
e
s
Missing values per feature (class 0)
0 1 2 3 4 5 6
% missing values
0
10
20
30
40
50
60
F
e
a
t
u
r
e
s
Missing values per feature (class 1)
Figure 3.1: Histograms showing the number of missing values per feature for the two
classes. Features X11, X21, X27 and X37 have been excluded from the histogram. Class
0 and 1 signifies companies that survived and went bankrupt respectively.
companies that went bankrupt have one or two missing values. Overall the distribution of
missing values for companies that went bankrupt differs much from that of the companies
that survived. In addition, from Table 3.1 it is clear that the shares of missing values for
some features sometimes are significantly different between the two classes.
Since there are such large differences between how the missing values are distributed
between the two classes we can conclude that how we leverage this information vastly
will affect the performance of the classifier. The implications of the missing values will
be discussed in more detail in Section 5.2.
24
3. Data
0 2 4 6 8 10
Number of missing values
0
500
1000
1500
2000
2500
3000
3500
C
o
m
p
a
n
i
e
s
Missing values per company (class 0)
0 2 4 6 8 10
Number of missing values
0
20
40
60
80
100
120
140
C
o
m
p
a
n
i
e
s
Missing values per company (class 1)
Figure 3.2: Histograms showing the distributions of missing values per company. The
distributions have been separated by class, where Class 0 and 1 signifies companies that
survived and went bankrupt respectively. Companies with more than 10 missing values
were very few and have been excluded from the histogram to improve readability.
25
4
Methods
This chapter describes the experimental design used to generate the results. Parts of the
chapter is of rather technical nature. It could therefore be helpful to read it together with
Chapter 2.
4.1 Experimental design
The goal of this study was to investigate how different machine learning techniques can
be used for corporate bankruptcy predictions. An experimental bench implemented in the
programming language Python was designed. In Appendix A the most important parts
of the code is available1. The idea of the design was to share infrastructure such as data
set splitting, data pre-processing and result generation between the classifiers that was
experimented with. This approach let us easily compare different classifiers and reduced
the required number of lines of code. In the following sub-sections the parts that were
developed will be described.
4.1.1 Splitting up the data
As described in Section 2.1.3 it is important to divide the data set into one training set and
one test set to avoid overfitting. This data split was built-in as a part of the experimental
bench. The set-up allowed for both single run experiments and cross-validation (described
in Section 2.1.3.1). To quickly get feedback about resulting performance when exper-
imenting with different algorithms and hyperparameters (parameters that are set before
the training process is initiated) single runs were mostly used, however, when generating
the final results, cross-validation was used to boost the results and make them comparable
with Zie˛ba et al. (2016).
1The code is also available online together with some instructions: https://github.com/
bamattsson/neural-bankruptcy
26
4. Methods
4.1.2 Data pre-processing
In Section 2.4 some data pre-processing techniques are described. The effect of these
techniques were explored and analysed in order to improve the predictive performance of
the algorithms. The multilayer perceptron worked better with normalised data, therefore
the data was always normalised before being passed to that algorithm. Normalising the
data did not have a big effect on the performance of the random forest and gradient boost-
ing algorithm, however, we still normalised the data before passing it to these algorithms
as well.
Missing values showed to be important and the effects of them were examined by
creating the possibility to add extra synthetic features describing the nature of the missing
values. Adding new features describing the missing values showed to be helpful and
was not examined in the study carried out by Zie˛ba et al. (2016). Two strategies were
investigated. The first strategy was to add an extra feature representing the total number
of missing values in an observation and the second strategy was to add dummy variables
for each of the features with the value 1 if the corresponding feature was missing and 0
otherwise.
4.1.3 Classifiers
Random forest and gradient boosting classifiers from the Scikit-learn package (Pedregosa
et al., 2011) were used. See Section 2.3.2 and 2.3.3 for an introduction to these algo-
rithms. A wrapper was built around the algorithms to make them fit well into the experi-
mental bench. These two algorithms were used to verify the implementation and ensure
that results comparable to the ones presented by Zie˛ba et al. (2016) could be generated.
From the comparison it could be concluded that the experimental bench was designed and
implemented properly and did not contain any flaws or bugs.
The multilayer perceptron that was used was implemented from scratch using the
neural-network library Tensorflow (Abadi et al., 2015). It was designed so that it could
take advantage of the techniques discussed in Section 2.3.4.
4.2 Hyperparameter exploration
Hyperparameters values typically have large influence on the performance and the be-
haviour of a machine learning algorithms. Therefore the hyperparameters of the imple-
mented algorithms were to some extent tuned. For the random forest algorithm 100 trees
were used, the quality of the split was maximised with respect to the information gain and
no bootstrap samples were used. For the gradient boosting classifier default parameter
values were used. 100 trees were used, and cross-entropy was used as objective function.
27
4. Methods
The multilayer perceptron was implemented in Tensorflow. The most successful im-
plementation used two hidden layers of dimension 100 and 50, respectively. As activation
function on hidden nodes the rectified linear unit function was used. Regularisation was
a key to make the model work well, which was done by using dropout with a probability
of 50% on all hidden units. Besides that, weight decay with the l2-norm on all weights,
and a regularisation constant with value 0.001 was used.
All three algorithms had best performance when adding extra feature columns to the
data set containing dummy variables describing the missing values.
4.2.1 Evaluation of predictive performance
After the classifier had been trained on the training set the test set was used to evaluate
the performance of the classifier. On the test set the the Receiver operating characteristic
(ROC) curve as well as the area under the curve (AUC) measure were calculated. This was
done for each cross-validation sub-sample. When using cross-validation the individual
AUC values obtained were not reported but rather its average and standard error of the
mean.
28
5
Results and discussion
This chapter presents the findings of the project. Firstly, the performance of the three dif-
ferent classification algorithms is compared. Next, the performance of the neural network
is discussed in a little more detail. This network significantly outperforms the multilayer
perceptron used in (Zie˛ba et al., 2016). Thereafter, an analysis of the importance of miss-
ing values in the data set is presented. It is shown that surprisingly good performance
can be obtained from data only containing information about the missing values of each
observation. Lastly, the usefulness and implications of our findings are discussed.
5.1 Algorithm comparison
In Table 5.1 we see a comparison of the three algorithms that were used in this study.
Each row in the matrix corresponds to one algorithm, and each column corresponds to one
prediction task, the values that are shown are mean and standard deviation of the AUC
score. We can see that the gradient boosting algorithm (GB) performed best, followed by
the random forest algorithm (RF) and the multilayer perceptron (MLP). They all get fairly
similar results, the largest difference is for "2ndYear" where the average score for the
gradient boosting algorithm is 0.07 higher than the score of for the multilayer perceptron.
The order in which they perform is equal to what Zie˛ba et al. reported, although they
report significantly larger performance differences between the algorithms. The reason
for why we get a smaller differences will be discussed in Section 5.1.1.
Table 5.1: AUC results generated from the three different classification methods that have
been tested. The algorithms shown are Multilayer perceptron (MLP), Randon forest (RF)
and Gradient boosting (GB)
Algorithm
1stYear
mean (std)
2ndYear
mean (std)
3rdYear
mean (std)
4thYear
mean (std)
5thYear
mean (std)
MLP 0.92 (0.03) 0.83 (0.02) 0.87 (0.04) 0.89 (0.03) 0.91 (0.02)
RF 0.92 (0.04) 0.88 (0.04) 0.89 (0.02) 0.91 (0.02) 0.93 (0.02)
GB 0.94 (0.03) 0.90 (0.02) 0.91 (0.02) 0.92 (0.02) 0.95 (0.01)
29
5. Results and discussion
It is not surprising that the gradient boosting algorithm comes out on top as this
algorithm typically performs well on classification tasks similar to the ones of the study.
Decision tree based algorithms are invariant for any scaling of the inputs and therefore
such algorithms work well even if the values of the features vary significantly in size as
in this data set.
The gradient boosting algorithm achieves comparable or slightly worse result than
the gradient boosting algorithm used in (Zie˛ba et al., 2016) on all the five classification
tasks. The random forest algorithm achieves slightly worse performance than the gradient
boosting algorithm, but it achieves better performance than what was reported for the
algorithm in (Zie˛ba et al., 2016). Overall we can see that there indeed is a difference
between how the gradient boosting algorithm and random forest algorithm performs for
this type of classification.
5.1.1 Performance of multilayer perceptron
As we can see in Table 5.1 the performance of the multilayer perceptron was constantly
worse than the performance of the two other methods. The difference was, however,
considerably smaller than what was reported in (Zie˛ba et al., 2016). The improvement of
the multilayer perceptron is between 0.20 and 0.38. It should of course at this stage be
pointed out that their multilayer perceptron only uses one hidden layer, whereas ours use
two hidden layers. However, this is far from enough to explain such a big discrepancy in
performance.
Generally, good results are not achieved when using out-of-the-box implementa-
tion of machine learning algorithms. Some algorithms, like random forest and gradient
boosting, tend to be more robust, but neural networks are known to be difficult to tune
to good performance. Among other things they require special care and considerations
when pre-processing the data and often extensive hyperparameter tuning. This fact makes
comparisons between different algorithms hard; the performance is very much influenced
by how much time was spent optimising the algorithm. Zie˛ba et al. (2016) achieve best
performance the most sophisticated algorithm implementation, a highly optimised and
extended version of the gradient boosting algorithm. Most of the other algorithms on
the other hand are out-of-the-box implementations taken directly from Weka data mining
tool1, which makes the comparison not completely fair. In this thesis we have instead
spent most time fine-tuning the Neural network and used out-of-the-box implementations
of the gradient boosting and random forest algorithm. Consequently, this lead us to report
results which are rather different from what was reported in by Zie˛ba et al. (2016). One
conclusion is therefore that one has to be extremely careful when interpreting compar-
1http://www.cs.waikato.ac.nz/ml/weka
30
5. Results and discussion
isons between different machine learning algorithms since the time spent on optimising
the performance cannot be neglected.
However, in the end, the same conclusion as reported by Zie˛ba et al. (2016) per-
sists: tree-based algorithms tend to outperform neural networks on this classification task.
This is in-line with the experience from the rest of the research community. Neural net-
works are good at feature extraction from unstructured data, but when the data is already
structured into informative features, the best bet is typically to use a tree-based algorithm.
5.2 Importance of missing values
As we saw in Section 3 there is a large discrepancy in the distribution of missing values
between the two different classes. In other words, information about the missing values
should be very helpful in the classification. We wanted to explore the effect of the missing
values more thoroughly than was done by Zie˛ba et al. (2016).
By imputeing missing values as a very large or small values, i.e. placing the missing
values very far away from all the other values, tree-based methods can easily leverage the
information. It is more difficult for the multilayer perceptron to leverage this information,
therefore we created new synthetic features as described in Section 4.1.2. In Table 5.2
we can see how the results differ for the multilayer perceptron when we have no such
synthetic features added (None), when the total number of missing values is added as
a synthetic feature (Sum) and finally when dummy variables for each possible missing
value are added as synthetic features (1-hot). In the table we can see that the neural net-
work performs much better when it can base its predictions on the synthetic features. For
"1stYear"-task the neural network reaches an average score of 0.77 when no synthetic
features are added, which can be compared to an average score of 0.92 when dummy
variable encoding of the missing values is added. We can thus conclude that much infor-
mation about the class labels of the examples in the data set is contained in the synthetic
features, i.e. the missing values.
To further explore the importance of the missing values in the data set we also tried
to do prediction only based on information about the missing values in the data. Before
passing the data to the algorithm we threw away all feature columns except the ones which
had been synthetically generated from the missing values. In other words, no financial
information was used. In Table 5.3 we can see the results from this analysis with the
random forest algorithm used as classifier. Even with only synthetic features we get good
performance. With only information about the missing values good prediction of whether
a company will go bankrupt can be made. In Figure 5.1 the ROC curves of three different
classifiers trained are shown. The classifiers are trained with (1) all the data and one-hot
encoding of missing values, (2) only one-hot encoding of missing values and (3) single
31
5. Results and discussion
Table 5.2: AUC for the multilayer perceptron with different synthetic features created
from the missing values. None signifies no synthetic features, sum corresponds to one
synthetic feature representing the total number of missing values and 1-hot a dummy
variable representation of the missing values.
Synthetic
feature
1stYear
mean (std)
2ndYear
mean (std)
3rdYear
mean (std)
4thYear
mean (std)
5thYear
mean (std)
None 0.77 (0.02) 0.72 (0.02) 0.78 (0.03) 0.77 (0.02) 0.84 (0.03)
Sum 0.83 (0.03) 0.76 (0.02) 0.82 (0.03) 0.83 (0.03) 0.87 (0.02)
1-hot 0.92 (0.03) 0.83 (0.02) 0.87 (0.04) 0.89 (0.03) 0.91 (0.02)
Table 5.3: Comparison between performance of the random forest algorithm depending
on what features were used. Standard & 1-hot means that both the original features
and the one-hot encoding of missing values were used, 1-hot signifies that only one-hot
encoding of missing value was provided and sum means that only the total number of
missing values in the observation was used as feature.
Features
1stYear
mean (std)
2ndYear
mean (std)
3rdYear
mean (std)
4thYear
mean (std)
5thYear
mean (std)
Standard
& 1-hot
0.92 (0.04) 0.88 (0.04) 0.89 (0.02) 0.91 (0.02) 0.93 (0.02)
Only 1-hot 0.85 (0.06) 0.73 (0.06) 0.74 (0.05) 0.76 (0.05) 0.76 (0.03)
Only sum 0.72 (0.05) 0.64 (0.06) 0.65 (0.04) 0.67 (0.03) 0.69 (0.04)
predictor describing the total number of missing values in each observation. The ROC
curves (from which AUC scores are calculated) paints a similar picture, the algorithm
performs well even with only access to the synthetic features.
The performance of the algorithm on data only containing information about what
features that were missing is alarming. The result suggest that one can determine the
health of a company solely based on what financial information that is available. To some
extent this can be logical, the pressure of financial distress may cause firms to fail to meet
reporting standards. If there is a relation between what information that is available and
the type of firm (small/medium/large etc.) a possible explanation would be that some
types of firms are more likely to face bankruptcy than others. Such a relation would need
to be very strong if to explain the performance on the data set of only missing value
information.
Another more problematic reason that one inevitably comes to think of is the fact
that the information about bankrupt and successful companies were collected differently
(described in Section 3). It is tempting to think that this might be the reason for the
32
5. Results and discussion
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
0.0
0.2
0.4
0.6
0.8
1.0
T
r
u
e
 
p
o
s
i
t
i
v
e
 
r
a
t
e
ROC curve
Standard & 1-hot
Only 1-hot
Only sum
Figure 5.1: The figure shows the ROC curves for three different classifiers trained on
(1) all the data and one-hot encoding of missing values, (2) only one-hot encoding of
missing values and (3) single predictor describing the total number of missing values in
each observation.
predictive power of the missing values. The missing values might indicate from where
and when the data was collected and in this way give away information of whether or not
the firm was successful. In such a case the classification problem is reduced to predicting
where the data comes from rather than whether or not a bankruptcy is imminent. In
any case the missing values are not missing at random and any imputation by removing
features or examples with random values would destroy information.
5.3 Industry usage
As described in the first chapter the predictive power of a model is not the only thing that
is of interest for an economist. From a scientific point of view it is often more interesting
if a model can "explain" underlying reasons for behaviours observed in the data. This is
most often difficult to do with machine learning techniques. Neural networks in particular
are very difficult to interpret. Tree-based methods are better for this purpose since they
can be used to analyse the importance of different features in the prediction and this can
33
5. Results and discussion
give valuable insights in the data.
From a more pragmatical point of view of a bank that lends money to companies
what is most interesting is the quality of the predictions. However, rather than choosing an
algorithm with low area under curve value they care about choosing an algorithm that can
identify as many of the companies that are likely to go bankrupt as possible so that they
can avoid lending money to those companies. However, they do not want the algorithm to
give to many false warnings either, since they then would miss business opportunities. To
facilitate this decision the receiver operating characteristic presented in Section 2.2.4 can
be used. In Figure 5.2 we see one of the receiver operating characteristic curves generated
in this thesis and with what probability threshold values the different points are reached.
Lets assume that the bank that is planning to use this classifier wants to detect 90%
of the companies that will go bankrupt they will have to use a probability threshold of
around 30%. In other words, all the companies which the model is more than 30% certain
of will file for bankruptcy will be marked by the model. We can also from this graph
see that this will lead to a false positive rate slightly higher than 0.2. This means that the
model also marks 20% of the companies that actually will survive.
Lets now instead assume that a public organisation that helps companies that are
on the brink of bankruptcy want to use this classifier. Since the organisation has limited
resources they cannot afford spending time and money on companies that would survive
even without their help. Therefore they would, instead of aiming for a high true positive
rate favour a low false positive rate. If they want only maximum 10% of the companies
they help to be able to survive without their help we can from Figure 5.2 see that they
would have to choose a probability threshold of around 60%. This results in a true positive
rate of around 70%, i.e. they miss 30% of the companies that goes bankrupt.
————————
34
5. Results and discussion
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
0.0
0.2
0.4
0.6
0.8
1.0
T
r
u
e
 
p
o
s
i
t
i
v
e
 
r
a
t
e
 
&
 
p
r
o
b
a
b
i
l
i
t
y
 
t
h
r
e
s
h
o
l
d
ROC curve with probability threshold
ROC curve MLP
Probability threshold MLP
Figure 5.2: The figure shows the ROC curve of the multilayer perceptron on the "1stYear"
data set. It also shows for what threshold the different points on the curve are reached.
35
6
Conclusion
The intention of the study was to illustrate how machine learning can be exploited in the
field of economics. To summarise the thesis corporate bankruptcy prediction of polish
companies between 2000 and 2013 has been studied. Three different classifications algo-
rithm have been tested and compared. The predictive performance of the random forest
and gradient boosting model is in line with the results that Zie˛ba et al. (2016) previously
have reported and the performance of our implemented neural network significantly out-
performs the reported results for the neural network based model.
The predictive performance achieved on the data set is very good. One reason for
the superb performance is the information contained in the missing values. Models based
on information about the missing values of the data with the financial information left
out could be trained to a surprisingly good performance. Possibly this could suggest a
connection between financial reporting and the risk of bankruptcy, however, it could also
be the case that the distribution of missing values between the successful and unsuccessful
companies differs due to the fact that the financial information was collected from two
different distributions.
Future interesting research topics in the development of algorithms for bankruptcy
predictions could be to incorporate new types of data. For example unstructured data,
such as texts in annual reports, could be of interest. For this purpose neural networks
could add much value, as they are good at leveraging this type of data. Apply described
algorithms to another data set would also be of interest to better understand the validity of
the results. Especially since the importance of missing values in the data set introduced
by Zie˛ba et al. (2016) casts doubt on its validity.
36
Bibliography
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,
Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian
Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Joze-
fowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga,
Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit
Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasude-
van, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke,
Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on het-
erogeneous systems, 2015. URL http://tensorflow.org/. Software available
from tensorflow.org.
Esteban Alfaro, Noelia García, Matías Gámez, and David Elizondo. Bankruptcy fore-
casting: An empirical comparison of adaboost and neural networks. Decision Support
Systems, 45(1):110–122, 2008.
Edward I Altman. Financial ratios, discriminant analysis and the prediction of corporate
bankruptcy. The journal of finance, 23(4):589–609, 1968.
Edward I Altman. A further empirical investigation of the bankruptcy cost question. The
Journal of Finance, 39(4):1067–1089, 1984.
Dave Anderson and George McNeill. Artificial neural networks technology. Kaman
Sciences Corporation, 258(6):1–83, 1992.
Sylvain Arlot, Alain Celisse, et al. A survey of cross-validation procedures for model
selection. Statistics surveys, 4:40–79, 2010.
William H Beaver. Financial ratios as predictors of failure. Journal of accounting re-
search, pages 71–111, 1966.
Victor M Becerra, Roberto KH Galvão, and Magda Abou-Seada. Neural and wavelet
network models for financial distress classification. Data Mining and Knowledge Dis-
covery, 11(1):35–55, 2005.
Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
37
Bibliography
Hendrik Blockeel and Jan Struyf. Efficient algorithms for decision tree cross-validation.
Journal of Machine Learning Research, 3(Dec):621–650, 2002.
Léon Bottou. Stochastic gradient learning in neural networks. Proceedings of Neuro-
Nımes, 91(8), 1991.
Xavier Bredart. Bankruptcy prediction model using neural networks. Accounting and
Finance Research, 3(2):124, 2014.
Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
M Ali Choudhary and Adnan Haider. Neural network models for inflation forecasting: an
appraisal. Applied Economics, 44(20):2631–2635, 2012.
Michael P Clements, Philip Hans Franses, and Norman R Swanson. Forecasting economic
and financial time-series with non-linear models. International Journal of Forecasting,
20(2):169–183, 2004.
Ralph De Haas and Neeltje Van Horen. International shock transmission after the lehman
brothers collapse: Evidence from syndicated lending. The American Economic Review,
102(3):231–237, 2012.
A Rogier T Donders, Geert JMG van der Heijden, Theo Stijnen, and Karel GM Moons. A
gentle introduction to imputation of missing values. Journal of clinical epidemiology,
59(10):1087–1091, 2006.
Stefan Engström. Kan nyckeltal påvisa framtida betalningsoförmåga? Balans, 3:36–45,
2002.
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learn-
ing, volume 1. Springer series in statistics New York, 2001.
Jerome H Friedman. Greedy function approximation: a gradient boosting machine. An-
nals of statistics, pages 1189–1232, 2001.
Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feed-
forward neural networks. In International conference on artificial intelligence and
statistics, pages 249–256, 2010.
Sanjeev Gupta and Sachin Kashyap. Forecasting inflation in g-7 countries: an application
of artificial neural network. Foresight, 17(1):63–73, 2015.
Richard HR Hahnloser, Rahul Sarpeshkar, Misha A Mahowald, Rodney J Douglas, and
H Sebastian Seung. Digital selection and analogue amplification coexist in a cortex-
inspired silicon circuit. Nature, 405(6789):947–951, 2000.
38
Bibliography
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Overview of supervised learning.
In The elements of statistical learning, pages 9–41. Springer, 2009.
Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. Improving neural networks by preventing co-adaptation of feature
detectors. CoRR, abs/1207.0580, 2012. URL http://arxiv.org/abs/1207.
0580.
Tin Kam Ho. Random decision forests. In Document Analysis and Recognition, 1995.,
Proceedings of the Third International Conference on, volume 1, pages 278–282. IEEE,
1995.
Edwin T Jaynes. Information theory and statistical mechanics. Physical review, 106(4):
620, 1957.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with
deep convolutional neural networks. In Advances in neural information processing
systems, pages 1097–1105, 2012.
Chung-Ming Kuan and Halbert White. Artificial neural networks: an econometric per-
spective. Econometric reviews, 13(1):1–91, 1994.
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):
436–444, 2015.
Heikki Linden et al. Synthesis of research studies examining prediction of bankruptcy.
2015.
Paul D McNelis. Neural networks in finance: gaining predictive edge in the market.
Academic Press, 2005.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness,
Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Os-
trovski, et al. Human-level control through deep reinforcement learning. Nature, 518
(7540):529–533, 2015.
Michael A Nielsen. Neural networks and deep learning, 2015.
James A Ohlson. Financial ratios and the probabilistic prediction of bankruptcy. Journal
of accounting research, pages 109–131, 1980.
P Patrick. A comparison of ratios of successful industrial enterprises with those of failed
firms. Certified Public Accountant, 2:598–605, 1932.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-
del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
39
Bibliography
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research, 12:2825–2830, 2011.
Junfei Qiu, Qihui Wu, Guoru Ding, Yuhua Xu, and Shuo Feng. A survey of machine
learning for big data processing. EURASIP Journal on Advances in Signal Processing,
2016(1):1–16, 2016.
Jaime Salvador, Zoila Ruiz, and Jose Garcia-Rodriguez. Big data infrastructure: A sur-
vey. In International Work-Conference on the Interplay Between Natural and Artificial
Computation, pages 249–258. Springer, 2017.
Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks,
61:85–117, 2015.
Jaymeen R Shah and Mirza B Murtaza. A neural network based clustering procedure for
bankruptcy prediction. American Business Review, 18(2):80, 2000.
Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn
features off-the-shelf: an astounding baseline for recognition. In Proceedings of the
IEEE conference on computer vision and pattern recognition workshops, pages 806–
813, 2014.
Tobias Sing, Oliver Sander, Niko Beerenwinkel, and Thomas Lengauer. Rocr: visualizing
classifier performance in r. Bioinformatics, 21(20):3940–3941, 2005.
Chih-Fong Tsai and Jhen-Wei Wu. Using neural network ensembles for bankruptcy pre-
diction and credit scoring. Expert systems with applications, 34(4):2639–2649, 2008.
William Verkooijen. A neural network approach to long-run exchange rate prediction.
Computational Economics, 9(1):51–65, 1996.
Ulrike Von Luxburg and Bernhard Schölkopf. Statistical learning theory: models, con-
cepts, and results. arXiv preprint arXiv:0810.4752, 2008.
Mattias Wahde. Biologically inspired optimization methods: an introduction. WIT press,
2008.
S Alex Yang, John R Birge, and Rodney P Parker. The supply chain effects of bankruptcy.
Management Science, 61(10):2320–2338, 2015.
Maciej Zie˛ba, Sebastian K Tomczak, and Jakub M Tomczak. Ensemble boosted trees with
synthetic features generation in application to bankruptcy prediction. Expert Systems
with Applications, 58:93–101, 2016.
40
A
Appendix: Programming code
Here we include the most important parts of the code used to generate the results in this
thesis. The entire code, together with instructions on how to execute it, can be found
online at https://github.com/bamattsson/neural-bankruptcy.
I
A. Appendix: Programming code
Listing A.1: run.py
import s y s
import os
import t ime
import p i c k l e
import numpy as np
import m a t p l o t l i b . p y p l o t a s p l t
from s k l e a r n import m e t r i c s , m o d e l _ s e l e c t i o n
from u t i l s import s p l i t _ d a t a s e t , l o a d _ d a t a s e t , l oad_yaml_and_save
from a l g o r i t h m s import ( RandomGuessAlgorithm , RandomFores tAlgor i thm ,
G r a d i e n t B o o s t i n g A l g o r i t h m , M u l t i l a y e r P e r c e p t r o n )
from d a t a _ p r o c e s s o r s import Impute r , P r o c e s s o r
def main ( yaml_pa th = ’ . / c o n f i g . yml ’ , run_name=None ) :
# C re a t e o u t p u t d i r e c t o r y where e x p e r i m e n t i s saved
i f run_name i s None :
run_name = t ime . s t r f t i m e ( ’%Y%m%d−%H%M’ , t ime . l o c a l t i m e ( ) )
r u n _ p a t h = os . p a t h . j o i n ( ’ . / o u t p u t ’ , run_name )
i f not os . p a t h . e x i s t s ( r u n _ p a t h ) :
os . m a k e d i r s ( r u n _ p a t h )
c o n f i g = load_yaml_and_save ( yaml_path , r u n _ p a t h )
# Do t h e s p e c i f i e d e x p e r i m e n t s
np . random . seed ( c o n f i g [ ’ e x p e r i m e n t ’ ] [ ’ np_random_seed ’ ] )
f o r y e a r in c o n f i g [ ’ e x p e r i m e n t ’ ] [ ’ y e a r s ’ ] :
d o _ e x p e r i m e n t _ f o r _ o n e _ y e a r ( run_pa th , year , c o n f i g )
# Show f i g u r e s i f any have been g e n e r a t e d
i f c o n f i g [ ’ a n a l y s i s ’ ] [ ’ p l o t _ r o c ’ ] :
p l t . show ( )
def d o _ e x p e r i m e n t _ f o r _ o n e _ y e a r ( run_pa th , year , c o n f i g ) :
" " " Per forms t h e s p e c i f i e d e x p e r i m e n t s f o r one year . " " "
X, Y = l o a d _ d a t a s e t ( year , s h u f f l e = c o n f i g [ ’ e x p e r i m e n t ’ ] [ ’ s h u f f l e _ d a t a ’ ] )
i f c o n f i g [ ’ e x p e r i m e n t ’ ] [ ’ t y p e ’ ] == ’ s i n g l e ’ :
X _ t r a i n , Y _ t r a i n , X_ te s t , Y _ t e s t = s p l i t _ d a t a s e t (X, Y,
c o n f i g [ ’ e x p e r i m e n t ’ ] [ ’ t e s t _ s h a r e ’ ] )
r e s u l t s = p e r f o r m _ o n e _ e x p e r i m e n t ( X _ t r a i n , Y _ t r a i n , X_ te s t , Y_ te s t ,
c o n f i g )
e l i f c o n f i g [ ’ e x p e r i m e n t ’ ] [ ’ t y p e ’ ] == ’ cv ’ :
r e s u l t s = p e r f o r m _ c v _ r u n s (X, Y, c o n f i g )
r e s u l t s _ p a t h = os . p a t h . j o i n ( run_pa th , ’ r e s u l t s _ y e a r { } . p k l ’ . format ( y e a r ) )
II
A. Appendix: Programming code
wi th open ( r e s u l t s _ p a t h , ’wb ’ ) a s f :
p i c k l e . dump ( r e s u l t s , f )
s h o w _ r e s u l t s ( r e s u l t s , year , ** c o n f i g [ ’ a n a l y s i s ’ ] )
def p e r f o r m _ o n e _ e x p e r i m e n t ( X _ t r a i n , Y _ t r a i n , X_ te s t , Y_ te s t , c o n f i g ) :
" " " Per forms one e x p e r i m e n t w i t h a g i v e n da ta s e t and g e n e r a t e s r e s u l t s . " " "
# Prepare da ta
p r o c e s s o r = P r o c e s s o r (** c o n f i g [ ’ p r o c e s s o r _ p a r a m s ’ ] )
X _ t r a i n = p r o c e s s o r . f i t _ t r a n s f o r m ( X _ t r a i n )
X _ t e s t = p r o c e s s o r . t r a n s f o r m ( X _ t e s t )
i m p u t e r = I m p u t e r (** c o n f i g [ ’ i m p u t e r _ p a r a m s ’ ] )
X _ t r a i n = i m p u t e r . f i t _ t r a n s f o r m ( X _ t r a i n )
X _ t e s t = i m p u t e r . t r a n s f o r m ( X _ t e s t )
# C r e a t e s t h e a l g o r i t h m o b j e c t
a lgo r i t hm_name = c o n f i g [ ’ e x p e r i m e n t ’ ] [ ’ a l g o r i t h m ’ ]
i f a lgo r i t hm_name == ’ random_guess ’ :
a l g o r i t h m = RandomGuessAlgori thm (** c o n f i g [ ’ a lgo_pa rams ’ ] )
e l i f a lgo r i t hm_name == ’ r f ’ :
a l g o r i t h m = RandomFores tAlgor i thm (** c o n f i g [ ’ a lgo_pa rams ’ ] )
e l i f a lgo r i t hm_name == ’ m u l t i l a y e r _ p e r c e p t r o n ’ :
a l g o r i t h m = M u l t i l a y e r P e r c e p t r o n ( n _ i n p u t = X _ t r a i n . shape [ 1 ] ,
** c o n f i g [ ’ a lgo_pa rams ’ ] )
e l i f a lgo r i t hm_name == ’ g r a d i e n t _ b o o s t i n g ’ :
a l g o r i t h m = G r a d i e n t B o o s t i n g A l g o r i t h m (** c o n f i g [ ’ a lgo_pa rams ’ ] )
e l s e :
r a i s e N ot I mp l em e n t e dE r ro r ( ’ A lgo r i t hm {} i s n o t an a v a i l a b l e o p t i o n ’
. format ( a lgo r i t hm_name ) )
# Per form e x p e r i m e n t
r e s u l t s = d i c t ( )
r e s u l t s [ ’ f i t _ i n f o ’ ] = a l g o r i t h m . f i t ( X _ t r a i n , Y _ t r a i n )
p r e d _ p r o b a = a l g o r i t h m . p r e d i c t _ p r o b a ( X _ t e s t )
p r ed = np . argmax ( p red_proba , a x i s =1)
# C a l c u l a t e and save r e s u l t s
r e s u l t s [ ’ l o g _ l o s s ’ ] = m e t r i c s . l o g _ l o s s ( Y_ te s t , p r e d _ p r o b a [ : , 1 ] )
r e s u l t s [ ’ a c c u r a c y ’ ] = m e t r i c s . a c c u r a c y _ s c o r e ( Y_ te s t , p r ed )
r e s u l t s [ ’ r e c a l l ’ ] = m e t r i c s . r e c a l l _ s c o r e ( Y_ te s t , pred , l a b e l s = [0 , 1 ] )
r e s u l t s [ ’ p r e c i s i o n ’ ] = m e t r i c s . p r e c i s i o n _ s c o r e ( Y_ te s t , pred , l a b e l s = [0 , 1 ] )
f p r , t p r , t h r e s h o l d s = m e t r i c s . r o c _ c u r v e ( Y_ te s t , p r e d _ p r o b a [ : , 1 ] )
r e s u l t s [ ’ r o c _ c u r v e ’ ] = { ’ f p r ’ : f p r , ’ t p r ’ : t p r , ’ t h r e s h o l d s ’ : t h r e s h o l d s }
r e s u l t s [ ’ r o c _ a u c ’ ] = m e t r i c s . auc ( f p r , t p r )
r e s u l t s [ ’ c l a s s i f i c a t i o n _ r e p o r t ’ ] = m e t r i c s . c l a s s i f i c a t i o n _ r e p o r t ( Y_ te s t ,
pred , l a b e l s = [ 0 , 1 ] )
III
A. Appendix: Programming code
re turn r e s u l t s
def p e r f o r m _ c v _ r u n s (X, Y, c o n f i g ) :
" " " Per forms cv t e s t f o r one year . " " "
# s p l i t up i n cv and per fo rm runs
c v _ s p l i t t e r = m o d e l _ s e l e c t i o n . KFold ( n _ s p l i t s = c o n f i g [ ’ e x p e r i m e n t ’ ]
[ ’ n _ f o l d s ’ ] )
r e s u l t _ l i s t = [ ]
f o r t r a i n _ i n d e x , t e s t _ i n d e x in c v _ s p l i t t e r . s p l i t (X ) :
X _ t r a i n , X _ t e s t = X[ t r a i n _ i n d e x ] , X[ t e s t _ i n d e x ]
Y _ t r a i n , Y _ t e s t = Y[ t r a i n _ i n d e x ] , Y[ t e s t _ i n d e x ]
r e s u l t s _ o n e _ e x p e r i m e n t = p e r f o r m _ o n e _ e x p e r i m e n t ( X _ t r a i n , Y _ t r a i n ,
X_ te s t , Y_ te s t , c o n f i g )
r e s u l t _ l i s t . append ( r e s u l t s _ o n e _ e x p e r i m e n t )
# Loop t h r o u g h a l l t h e g e n e r a t e d r e s u l t s and save t o one p l a c e
r e s u l t s = d i c t ( )
f o r m e t r i c in r e s u l t _ l i s t [ 0 ] :
# Roc c u r v e comes i n a d i c t and are t r e a t e d s e p e r a t e l y
i f type ( r e s u l t _ l i s t [ 0 ] [ m e t r i c ] ) == d i c t :
r e s u l t s [ m e t r i c ] = d i c t ( )
f o r s u b m e t r i c in r e s u l t _ l i s t [ 0 ] [ m e t r i c ] :
a r r a y s = [ r e s u l t _ f r o m _ c v [ m e t r i c ] [ s u b m e t r i c ] f o r r e s u l t _ f r o m _ c v
in r e s u l t _ l i s t ]
r e s u l t s [ m e t r i c ] [ s u b m e t r i c ] = a r r a y s
# F l o a t v a l u e s are saved as a l i s t o f a l l v a l u e s b u t a l s o as mean and s t d
e l i f i s i n s t a n c e ( r e s u l t _ l i s t [ 0 ] [ m e t r i c ] , f l o a t ) :
r e s u l t s [ m e t r i c ] = d i c t ( )
r e s u l t s [ m e t r i c ] [ ’ v a l u e s ’ ] = np . a r r a y ( [ r e s u l t _ f r o m _ c v [ m e t r i c ] f o r
r e s u l t _ f r o m _ c v in r e s u l t _ l i s t ] )
r e s u l t s [ m e t r i c ] [ ’ mean ’ ] = r e s u l t s [ m e t r i c ] [ ’ v a l u e s ’ ] . mean ( )
r e s u l t s [ m e t r i c ] [ ’ s t d ’ ] = r e s u l t s [ m e t r i c ] [ ’ v a l u e s ’ ] . s t d ( )
# Other v a l u e s are saved i n l i s t s
e l s e :
v a l u e = [ r e s u l t _ f r o m _ c v [ m e t r i c ] f o r r e s u l t _ f r o m _ c v in r e s u l t _ l i s t ]
r e s u l t s [ m e t r i c ] = v a l u e
re turn r e s u l t s
def s h o w _ r e s u l t s ( r e s u l t s , year , p r i n t _ r e s u l t s = [ ] , p l o t _ r o c = F a l s e ) :
" " " P r i n t and p l o t t h e r e s u l t s t h a t have been g e n e r a t e d . " " "
# P r i n t r e s u l t s
i f ( l e n ( p r i n t _ r e s u l t s ) > 0 ) :
IV
A. Appendix: Programming code
p r i n t ( ’ \ n R e s u l t s f o r y e a r { } : ’ . format ( y e a r ) )
f o r m e t r i c in p r i n t _ r e s u l t s :
i f type ( r e s u l t s [ m e t r i c ] ) == d i c t :
p r i n t ( ’ { } = { : . 2 f } , s t d = { : . 2 f } ’ . format ( m e t r i c ,
r e s u l t s [ m e t r i c ] [ ’ mean ’ ] , r e s u l t s [ m e t r i c ] [ ’ s t d ’ ] ) )
e l i f type ( r e s u l t s [ m e t r i c ] ) == s t r :
p r i n t ( r e s u l t s [ m e t r i c ] )
e l i f i s i n s t a n c e ( r e s u l t s [ m e t r i c ] , f l o a t ) :
p r i n t ( ’ { } = { : . 2 f } ’ . format ( m e t r i c , r e s u l t s [ m e t r i c ] ) )
e l s e :
p r i n t ( ’ {} c a n n o t be p r i n t e d . ’ . format ( m e t r i c ) )
# P l o t r e s u l t s
i f p l o t _ r o c :
p l t . f i g u r e ( y e a r )
p l t . t i t l e ( ’ r o c c u r v e y e a r {} ’ . format ( y e a r ) )
p l t . p l o t ( ( 0 , 1 ) , ( 0 , 1 ) , l s = ’−− ’ , c= ’ k ’ )
i f type ( r e s u l t s [ ’ r o c _ c u r v e ’ ] [ ’ f p r ’ ] ) == l i s t :
# A CV run w i t h m u l t i p l e a r r a y s
f o r f p r , t p r in z i p ( r e s u l t s [ ’ r o c _ c u r v e ’ ] [ ’ f p r ’ ] ,
r e s u l t s [ ’ r o c _ c u r v e ’ ] [ ’ t p r ’ ] ) :
p l t . p l o t ( f p r , t p r )
e l s e :
# Not a CV run
p l t . p l o t ( r e s u l t s [ ’ r o c _ c u r v e ’ ] [ ’ f p r ’ ] , r e s u l t s [ ’ r o c _ c u r v e ’ ] [ ’ t p r ’ ] )
p l t . x l a b e l ( ’ F a l s e p o s i t i v e r a t e ’ )
p l t . y l a b e l ( ’ True p o s i t i v e r a t e ’ )
i f __name__ == ’ __main__ ’ :
t r y :
yaml_pa th = s y s . a rgv [ 1 ]
e xc ep t I n d e x E r r o r a s e :
p r i n t ( ’You have t o s p e c i f y t h e c o n f i g . yaml t o use as ‘ py thon run . py ’
’ e x a m p l e _ c o n f i g . yaml ‘ ’ )
p r i n t ( ’ E x i t i n g . ’ )
s y s . e x i t ( )
main ( yaml_pa th = yaml_pa th )
V
A. Appendix: Programming code
Listing A.2: algorithms/algorithm.py
from abc import ABCMeta , a b s t r a c t m e t h o d
import numpy as np
c l a s s Algor i t hm ( m e t a c l a s s =ABCMeta ) :
" " " A b s t r a c t a l g o r i t h m c l a s s . " " "
@abs t r ac tme thod
def f i t ( s e l f , samples , l a b e l s ) :
" " "
Args :
sample s ( np . ndarray ) : X data , shape ( n_samples , n _ f e a t u r e s )
l a b e l s ( np . ndarray ) : y data , shape ( n_samples )
R e t u r n s :
f i t _ i n f o ( d i c t ) : i n f o r m a t i o n from t h e f i t f o r l a t e r a n a l y s i s
" " "
re turn None
@abs t r ac tme thod
def p r e d i c t _ p r o b a ( s e l f , s ample s ) :
" " "
Args :
sample s ( np . ndarray ) : X data , shape ( n_samples , n _ f e a t u r e s )
R e t u r n s :
proba ( np . ndarray ) : P r o b a b i l i t y o f b e l o n g i n g t o a p a r t i c u l a r c l a s s ,
shape ( n_samples , n _ c l a s s e s )
" " "
proba = np . z e r o s ( ( samples . shape [ 0 ] , 2 ) )
re turn proba
def p r e d i c t ( s e l f , s amples ) :
" " "
Args :
sample s ( np . ndarray ) : X data , shape ( n_samples , n _ f e a t u r e s )
R e t u r n s :
p r e d i c t ( np . ndarray ) : P r e d i c t e d c l a s s , shape ( n_samples )
" " "
p r e d i c t _ p r o b a = s e l f . p r e d i c t _ p r o b a ( sample s )
p r e d i c t = np . argmax ( p r e d i c t _ p r o b a , a x i s =1)
re turn p r e d i c t
VI
A. Appendix: Programming code
Listing A.3: algorithms/multilayer_perceptron.py
import numpy as np
import t e n s o r f l o w as t f
import m a t p l o t l i b . p y p l o t a s p l t
from . a l g o r i t h m import Algor i t hm
from u t i l s import s p l i t _ d a t a s e t
c l a s s M u l t i l a y e r P e r c e p t r o n ( Algo r i t hm ) :
def _ _ i n i t _ _ ( s e l f , n _ i n p u t , n_hidden , d ropou t_keep_p rob , l 2 _ r e g _ f a c t o r ,
dev_sha re , num_epochs , b a t c h _ s i z e , b a t c h _ i t e r a t o r _ t y p e ,
e v a l u a t e _ e v e r y _ n _ s t e p s , p l o t _ t r a i n i n g , t f _ s e e d ) :
# S t r u c t u r e o f model
s e l f . n _ i n p u t = n _ i n p u t
s e l f . n_h idden = n_h idden
s e l f . d r o p o u t _ k e e p _ p r o b = d r o p o u t _ k e e p _ p r o b
s e l f . l 2 _ r e g _ f a c t o r = l 2 _ r e g _ f a c t o r
s e l f . n _ c l a s s = 2
# T r a i n i n g p a r a m e t e r s
s e l f . d e v _ s h a r e = d e v _ s h a r e
s e l f . num_epochs = num_epochs
s e l f . b a t c h _ s i z e = b a t c h _ s i z e
s e l f . b a t c h _ i t e r a t o r _ t y p e = b a t c h _ i t e r a t o r _ t y p e
s e l f . e v a l u a t e _ e v e r y _ n _ s t e p s = e v a l u a t e _ e v e r y _ n _ s t e p s
s e l f . p l o t _ t r a i n i n g = p l o t _ t r a i n i n g
# C re a t e TF graph and s e s s i o n
t f . r e s e t _ d e f a u l t _ g r a p h ( )
t f . s e t _ r a n d o m _ s e e d ( t f _ s e e d )
s e l f . g r aph_nodes = s e l f . _ g e t _ g r a p h ( )
s e l f . s e s s = t f . S e s s i o n ( )
s e l f . s e s s . run ( [ t f . g l o b a l _ v a r i a b l e s _ i n i t i a l i z e r ( ) ] )
# TODO: add e a r l y s t o p p i n g w i t h t f . t r a i n . Saver
def f i t ( s e l f , samples , l a b e l s ) :
" " " T r a i n t h e model w i t h t h e samples and l a b l e s p r o v i d e d a c c o r d i n g t o
t h e p a r a m e t e r s o f t h e model . " " "
# S p l i t i n t o t r a i n and dev
x _ t r a i n , y _ t r a i n , x_dev , y_dev = s p l i t _ d a t a s e t ( samples , l a b e l s ,
s e l f . d e v _ s h a r e )
# C re a t e b a t c h i t e r a t o r
VII
A. Appendix: Programming code
i f s e l f . b a t c h _ i t e r a t o r _ t y p e == ’ normal ’ :
b a t c h _ i t e r = _ b a t c h _ i t e r
e l i f s e l f . b a t c h _ i t e r a t o r _ t y p e == ’ o v e r s a m p l e ’ :
b a t c h _ i t e r = _ o v e r s a m p l i n g _ b a t c h _ i t e r
e l s e :
r a i s e V a l u e E r r o r ( ’ {} i s n o t a v a l i d b a t c h _ i t e r a t o r _ t y p e ’ . format (
s e l f . b a t c h _ i t e r a t o r _ t y p e ) )
# T r a i n model
t r a i n _ b a t c h _ n r = [ ]
t r a i n _ l o s s _ v a l = [ ]
d e v _ b a t c h _ n r = [ ]
d e v _ l o s s _ v a l = [ ]
f o r i , ( x , y ) in enumerate ( b a t c h _ i t e r ( x _ t r a i n , y _ t r a i n ,
s e l f . num_epochs , s e l f . b a t c h _ s i z e ) ) :
# T r a i n
f e e d _ d i c t = {
s e l f . g r aph_nodes [ ’ x _ i n p u t ’ ] : x ,
s e l f . g r aph_nodes [ ’ y _ i n p u t ’ ] : y ,
s e l f . g r aph_nodes [ ’ d r o p o u t _ k e e p _ p r o b ’ ] :
s e l f . d r o p o u t _ k e e p _ p r o b
}
_ , l o s s _ v a l = s e l f . s e s s . run ( [ s e l f . g r aph_nodes [ ’ o p t i m i z e ’ ] ,
s e l f . g r aph_nodes [ ’ l o s s ’ ] ] , f e e d _ d i c t = f e e d _ d i c t )
t r a i n _ b a t c h _ n r . append ( i )
t r a i n _ l o s s _ v a l . append ( l o s s _ v a l )
i f i % s e l f . e v a l u a t e _ e v e r y _ n _ s t e p s == 0 :
f e e d _ d i c t = {
s e l f . g r aph_nodes [ ’ x _ i n p u t ’ ] : x_dev ,
s e l f . g r aph_nodes [ ’ y _ i n p u t ’ ] : y_dev ,
s e l f . g r aph_nodes [ ’ d r o p o u t _ k e e p _ p r o b ’ ] : 1 .
}
l o s s _ v a l = s e l f . s e s s . run ( s e l f . g r aph_nodes [ ’ l o s s ’ ] ,
f e e d _ d i c t = f e e d _ d i c t )
d e v _ b a t c h _ n r . append ( i )
d e v _ l o s s _ v a l . append ( l o s s _ v a l )
i f s e l f . p l o t _ t r a i n i n g :
p l t . p l o t ( t r a i n _ b a t c h _ n r , t r a i n _ l o s s _ v a l )
p l t . p l o t ( de v_ ba t c h_ n r , d e v _ l o s s _ v a l )
p l t . show ( )
def p r e d i c t _ p r o b a ( s e l f , s ample s ) :
" " " Make p r o b a b i l i t y p r e d i c t i o n s w i t h t h e t r a i n e d model . " " "
# Smal l model −> no need t o loop over t h e samples
f e e d _ d i c t = {
VIII
A. Appendix: Programming code
s e l f . g r aph_nodes [ ’ x _ i n p u t ’ ] : samples ,
s e l f . g r aph_nodes [ ’ d r o p o u t _ k e e p _ p r o b ’ ] : 1 .
}
p roba = s e l f . s e s s . run ( s e l f . g r aph_nodes [ ’ p roba ’ ] , f e e d _ d i c t = f e e d _ d i c t )
re turn proba
def _ g e t _ g r a p h ( s e l f ) :
# C re a t e p l a c e h o l d e r s f o r i n p u t and d r o p o u t _ p r o b
x _ i n p u t = t f . p l a c e h o l d e r ( t f . f l o a t 3 2 , shape =( None , s e l f . n _ i n p u t ) )
y _ i n p u t = t f . p l a c e h o l d e r ( t f . i n t 3 2 , shape =( None ) )
d r o p o u t _ k e e p _ p r o b = t f . p l a c e h o l d e r ( t f . f l o a t 3 2 )
# V a r i a b l e s
# B u i l d t h e f u l l y c o n n e c t e d l a y e r s
n e u r o n s = x _ i n p u t
l2_norm = t f . c o n s t a n t ( 0 . )
f o r i in range ( l e n ( s e l f . n_h idden ) + 1 ) :
i n p u t _ d i m = s e l f . n _ i n p u t i f i == 0 e l s e s e l f . n_h idden [ i − 1]
o u t p u t _ d i m = s e l f . n _ c l a s s i f i == l e n ( s e l f . n_h idden ) \
e l s e s e l f . n_h idden [ i ]
l aye r_name = i + 1 i f i < l e n ( s e l f . n_h idden ) e l s e ’ o u t ’
# C re a t e w e i g h t s
W = t f . V a r i a b l e ( t f . t r u n c a t e d _ n o r m a l ( [ inpu t_d im , o u t p u t _ d i m ] ,
s t d d e v = 0 . 1 ) , name= ’W_{} _ l a y e r ’ . format ( l aye r_name ) )
b = t f . V a r i a b l e ( 0 . 1 * np . ones ( ou tpu t_d im , d t y p e =np . f l o a t 3 2 ) ,
name= ’ b_ {} _ l a y e r ’ . format ( l aye r_name ) )
l2_norm += t f . nn . l 2 _ l o s s (W)
# Connect nodes
n e u r o n s = t f . add ( t f . matmul ( neurons , W) , b )
i f i < l e n ( s e l f . n_h idden ) : # True i f n o t l a s t ( o u t p u t ) l a y e r
n e u r o n s = t f . nn . d r o p o u t ( neurons , d r o p o u t _ k e e p _ p r o b )
n e u r o n s = t f . nn . r e l u ( n e u r o n s ) # TODO: make t h i s o p t i o n a l
l o g i t s = n e u r o n s
p roba = t f . nn . so f tmax ( l o g i t s )
# Loss and Accuracy
l o s s = t f . reduce_mean ( t f . nn . s p a r s e _ s o f t m a x _ c r o s s _ e n t r o p y _ w i t h _ l o g i t s (
l a b e l s = y _ i n p u t , l o g i t s = l o g i t s ) )
r e g u l a r i z e d _ l o s s = l o s s + s e l f . l 2 _ r e g _ f a c t o r * l2_norm
c o r r e c t _ p r e d i c t i o n s = t f . e q u a l ( t f . c a s t ( t f . argmax ( l o g i t s , 1 ) , t f . i n t 3 2 ) ,
y _ i n p u t )
a c c u r a c y = t f . reduce_mean ( t f . c a s t ( c o r r e c t _ p r e d i c t i o n s , ’ f l o a t ’ ) )
# T r a i n o p e r a t i o n
# TODO: add so t h a t we c o u l d change l e a r n i n g r a t e ?
IX
A. Appendix: Programming code
o p t i m i z e = t f . t r a i n . AdamOptimizer ( ) . min imize ( r e g u l a r i z e d _ l o s s )
# Save i m p o r t a n t nodes t o d i c t and r e t u r n
graph = { ’ x _ i n p u t ’ : x _ i n p u t ,
’ y _ i n p u t ’ : y _ i n p u t ,
’ d r o p o u t _ k e e p _ p r o b ’ : d ropou t_keep_p rob ,
’ p roba ’ : proba ,
’ l o s s ’ : l o s s ,
’ a c c u r a c y ’ : accu racy ,
’ o p t i m i z e ’ : o p t i m i z e }
re turn graph
def _ o v e r s a m p l i n g _ b a t c h _ i t e r ( samples , l a b e l s , num_epochs , b a t c h _ s i z e ) :
" " " Batch i t e r a t o r t h a t o v e r s a m p l e s t h e r a r e c l a s s so t h a t bo th c l a s s e s
become e q u a l l y f r e q u e n t . " " "
pos_examples = ( l a b e l s == 1)
pos_samples , p o s _ l a b e l s = sample s [ pos_examples ] , l a b e l s [ pos_examples ]
neg_examples = np . l o g i c a l _ n o t ( pos_examples )
neg_samples , n e g _ l a b e l s = sample s [ neg_examples ] , l a b e l s [ neg_examples ]
n e g _ b a t c h _ s i z e = np . f l o o r ( b a t c h _ s i z e / 2 )
p o s _ b a t c h _ s i z e = np . c e i l ( b a t c h _ s i z e / 2 )
n e g _ b a t c h _ i t e r = _ b a t c h _ i t e r ( neg_samples , n e g _ l a b e l s , num_epochs ,
n e g _ b a t c h _ s i z e )
p o s _ b a t c h _ i t e r = _ b a t c h _ i t e r ( pos_samples , p o s _ l a b e l s , num_epochs ,
p o s _ b a t c h _ s i z e )
f o r neg_ba tch , p o s _ b a t c h in z i p ( n e g _ b a t c h _ i t e r , p o s _ b a t c h _ i t e r ) :
neg_ba t ch_samp le s , n e g _ b a t c h _ l a b e l s = n e g _ b a t c h
p o s _ b a t c h _ s a m p l e s , p o s _ b a t c h _ l a b e l s = p o s _ b a t c h
b a t c h _ s a m p l e s = np . c o n c a t e n a t e ( ( neg_ba t ch_sample s , p o s _ b a t c h _ s a m p l e s ) ,
a x i s =0)
b a t c h _ l a b e l s = np . c o n c a t e n a t e ( ( n e g _ b a t c h _ l a b e l s , p o s _ b a t c h _ l a b e l s ) ,
a x i s =0)
y i e l d b a t c h _ s a m p l e s , b a t c h _ l a b e l s
def _ b a t c h _ i t e r ( samples , l a b e l s , num_epochs , b a t c h _ s i z e ) :
" " " A b a t c h i t e r a t o r t h a t g e n e r a t e s b a t c h e s from t h e da ta . " " "
d a t a _ s i z e = l e n ( l a b e l s )
batch_num = 0
whi le ( batch_num ) * b a t c h _ s i z e / / d a t a _ s i z e < num_epochs :
s t a r t _ i n d e x = i n t ( batch_num * b a t c h _ s i z e % d a t a _ s i z e )
X
A. Appendix: Programming code
end_ index = i n t ( ( batch_num + 1) * b a t c h _ s i z e % d a t a _ s i z e )
i f s t a r t _ i n d e x < end_ index :
s a m p l e s _ b a t c h = samples [ s t a r t _ i n d e x : end_ index ]
l a b e l s _ b a t c h = l a b e l s [ s t a r t _ i n d e x : end_ index ]
e l s e :
s a m p l e s _ b a t c h = np . c o n c a t e n a t e ( ( samples [ s t a r t _ i n d e x : ] ,
s ample s [ : end_ index ] ) )
l a b e l s _ b a t c h = np . c o n c a t e n a t e ( ( l a b e l s [ s t a r t _ i n d e x : ] ,
l a b e l s [ : end_ index ] ) )
y i e l d s a m p l e s _ b a t c h , l a b e l s _ b a t c h
batch_num += 1
XI
A. Appendix: Programming code
Listing A.4: algorithms/gradient_boosting.py
from . a l g o r i t h m import Algor i t hm
from s k l e a r n . ensemble import G r a d i e n t B o o s t i n g C l a s s i f i e r
c l a s s G r a d i e n t B o o s t i n g A l g o r i t h m ( Algo r i t hm ) :
def _ _ i n i t _ _ ( s e l f , l o s s = ’ d e v i a n c e ’ , l e a r n i n g _ r a t e = 0 . 1 , n _ e s t i m a t o r s =100 ,
subsample = 1 . 0 , c r i t e r i o n = ’ f r iedman_mse ’ , m i n _ s a m p l e s _ s p l i t =2 ,
m i n _ s a m p l e s _ l e a f =1 , m i n _ w e i g h t _ f r a c t i o n _ l e a f = 0 . 0 , max_depth =3 ,
m i n _ i m p u r i t y _ s p l i t =1e−07 , i n i t =None , r a n d o m _ s t a t e =None ,
m a x _ f e a t u r e s =None , v e r b o s e =0 , max_ lea f_nodes =None ,
w a r m _ s t a r t = F a l s e , p r e s o r t = ’ a u t o ’ ) :
s e l f . c l f = G r a d i e n t B o o s t i n g C l a s s i f i e r ( l o s s = l o s s ,
l e a r n i n g _ r a t e = l e a r n i n g _ r a t e , n _ e s t i m a t o r s = n _ e s t i m a t o r s ,
subsample =subsample , c r i t e r i o n = c r i t e r i o n ,
m i n _ s a m p l e s _ s p l i t = m i n _ s a m p l e s _ s p l i t ,
m i n _ s a m p l e s _ l e a f = m i n _ s a m p l e s _ l e a f ,
m i n _ w e i g h t _ f r a c t i o n _ l e a f = m i n _ w e i g h t _ f r a c t i o n _ l e a f ,
max_depth=max_depth , m i n _ i m p u r i t y _ s p l i t = m i n _ i m p u r i t y _ s p l i t ,
i n i t = i n i t , r a n d o m _ s t a t e = r a n d o m _ s t a t e , m a x _ f e a t u r e s = m a x _ f e a t u r e s ,
v e r b o s e = ve rbose , max_ lea f_nodes = max_lea f_nodes ,
w a r m _ s t a r t = warm_s t a r t , p r e s o r t = p r e s o r t )
def f i t ( s e l f , samples , l a b e l s ) :
" " "
Args :
sample s ( np . ndarray ) : X data , shape ( n_samples , n _ f e a t u r e s )
l a b e l s ( np . ndarray ) : y data , shape ( n_samples )
R e t u r n s :
f i t _ i n f o ( d i c t ) : i n f o r m a t i o n from t h e f i t f o r l a t e r a n a l y s i s
" " "
s e l f . c l f . f i t ( samples , l a b e l s )
def p r e d i c t _ p r o b a ( s e l f , s ample s ) :
" " "
Args :
sample s ( np . ndarray ) : X data , shape ( n_samples , n _ f e a t u r e s )
R e t u r n s :
proba ( np . ndarray ) : P r o b a b i l i t y o f b e l o n g i n g t o a p a r t i c u l a r c l a s s ,
shape ( n_samples , n _ c l a s s e s )
" " "
re turn s e l f . c l f . p r e d i c t _ p r o b a ( samples )
XII
A. Appendix: Programming code
Listing A.5: algorithms/random_forest.py
from . a l g o r i t h m import Algor i t hm
from s k l e a r n . ensemble import R a n d o m F o r e s t C l a s s i f i e r
c l a s s RandomFores tAlgor i thm ( Algo r i t hm ) :
def _ _ i n i t _ _ ( s e l f , n _ e s t i m a t o r s =10 , c r i t e r i o n = ’ g i n i ’ , m i n _ s a m p l e s _ s p l i t =2 ,
m i n _ s a m p l e s _ l e a f =1 , m i n _ w e i g h t _ f r a c t i o n _ l e a f = 0 . 0 , w a r m _ s t a r t = F a l s e ,
b o o t s t r a p =True , o o b _ s c o r e = F a l s e , m a x _ f e a t u r e s = ’ a u t o ’ ,
max_depth=None , max_ lea f_nodes =None , m i n _ i m p u r i t y _ s p l i t =1e−07 ,
c l a s s _ w e i g h t =None ) :
s e l f . c l f = R a n d o m F o r e s t C l a s s i f i e r ( n _ e s t i m a t o r s = n _ e s t i m a t o r s ,
c r i t e r i o n = c r i t e r i o n , m i n _ s a m p l e s _ s p l i t = m i n _ s a m p l e s _ s p l i t ,
m i n _ s a m p l e s _ l e a f = m i n _ s a m p l e s _ l e a f ,
m i n _ w e i g h t _ f r a c t i o n _ l e a f = m i n _ w e i g h t _ f r a c t i o n _ l e a f ,
w a r m _ s t a r t = warm_s t a r t , b o o t s t r a p = b o o t s t r a p ,
o o b _ s c o r e = oob_score , m a x _ f e a t u r e s = m a x _ f e a t u r e s ,
max_depth=max_depth , max_ lea f_nodes = max_lea f_nodes ,
m i n _ i m p u r i t y _ s p l i t = m i n _ i m p u r i t y _ s p l i t ,
c l a s s _ w e i g h t = c l a s s _ w e i g h t )
def f i t ( s e l f , samples , l a b e l s ) :
" " "
Args :
sample s ( np . ndarray ) : X data , shape ( n_samples , n _ f e a t u r e s )
l a b e l s ( np . ndarray ) : y data , shape ( n_samples )
R e t u r n s :
f i t _ i n f o ( d i c t ) : i n f o r m a t i o n from t h e f i t f o r l a t e r a n a l y s i s
" " "
s e l f . c l f . f i t ( samples , l a b e l s )
def p r e d i c t _ p r o b a ( s e l f , s ample s ) :
" " "
Args :
sample s ( np . ndarray ) : X data , shape ( n_samples , n _ f e a t u r e s )
R e t u r n s :
proba ( np . ndarray ) : P r o b a b i l i t y o f b e l o n g i n g t o a p a r t i c u l a r c l a s s ,
shape ( n_samples , n _ c l a s s e s )
" " "
re turn s e l f . c l f . p r e d i c t _ p r o b a ( samples )
XIII
A. Appendix: Programming code
Listing A.6: data_processors.py
from abc import ABCMeta , a b s t r a c t m e t h o d
import numpy as np
c l a s s D a t a P r o c e s s o r ( m e t a c l a s s =ABCMeta ) :
" " " A b s t r a c t da ta p r o c e s s o r c l a s s . " " "
@abs t r ac tme thod
def f i t ( s e l f , d a t a ) :
" " " F i t s t h e i n t e r n a l Da taProces sor v a l u e s t o t h e da ta . " " "
r a i s e N ot I mp l em e n t e dE r ro r
@abs t r ac tme thod
def t r a n s f o r m ( s e l f , d a t a ) :
" " " T r a n s f o r m s t h e da ta w i t h t h e Da taProces sor . " " "
r a i s e N ot I mp l em e n t e dE r ro r
def f i t _ t r a n s f o r m ( s e l f , d a t a ) :
s e l f . f i t ( d a t a )
re turn s e l f . t r a n s f o r m ( d a t a )
c l a s s I m p u t e r ( D a t a P r o c e s s o r ) :
def _ _ i n i t _ _ ( s e l f , s t r a t e g y , n e w _ f e a t u r e s = F a l s e , o n l y _ n a n _ d a t a = F a l s e ) :
" " " I n i t i a l i z e o b j e c t .
Args :
s t r a t e g y ( s t r ) : s t r a t e g y t o f o l l o w when i m p u t i n g . A v a i l a b l e :
’ mean ’ , ’ min ’
n e w _ f e a t u r e s ( s t r ) : whe ther we s h o u l d c r e a t e new f e a t u r e s depend ing
from t h e i n f o r m a t i o n from m i s s i n g v a l u e s . F a l s e c r e a t e s no new
f e a t u r e s , ’ sum ’ c r e a t e s one new and ’1− h o t ’ c r e a t e s m u l t i p l e
new f e a t u r e s .
" " "
s e l f . s t r a t e g y = s t r a t e g y
s e l f . n e w _ f e a t u r e s = n e w _ f e a t u r e s
s e l f . o n l y _ n a n _ d a t a = o n l y _ n a n _ d a t a
i f ( s e l f . n e w _ f e a t u r e s == F a l s e and s e l f . o n l y _ n a n _ d a t a == True ) :
r a i s e V a l u e E r r o r ( ’ ‘ n e w _ f e a t u r e s ‘ e q u a l t o {} and ‘ o n l y _ n a n _ d a t a ‘ ’
’ e q u a l t o {} i s n o t a v a l i d p a r a m e t e r c o m b i n a t i o n ’ . format (
s e l f . n e w _ f e a t u r e s , s e l f . o n l y _ n a n _ d a t a ) )
def f i t ( s e l f , d a t a ) :
i f s e l f . s t r a t e g y == ’ mean ’ :
s e l f . i m p u t i n g _ v a l u e s = np . nanmean ( da t a , a x i s =0)
XIV
A. Appendix: Programming code
e l i f s e l f . s t r a t e g y == ’ min ’ :
s e l f . i m p u t i n g _ v a l u e s = np . nanmin ( da t a , a x i s =0)
e l s e :
r a i s e V a l u e E r r o r (
’ {} i s n o t a v a l i d v a l u e f o r ‘ s t r a t e g y ‘ ’ . format (
s e l f . s t r a t e g y ) )
i f s e l f . n e w _ f e a t u r e s == ’1−h o t ’ :
s e l f . c o n t a i n s _ n a n = np . any ( np . i s n a n ( d a t a ) , a x i s =0)
def t r a n s f o r m ( s e l f , d a t a ) :
# Add new f e a t u r e s from nan v a l u e s
i f not s e l f . n e w _ f e a t u r e s :
e x t r a _ f e a t u r e s = np . z e r o s ( [ l e n ( d a t a ) , 0 ] )
e l i f s e l f . n e w _ f e a t u r e s == ’sum ’ :
e x t r a _ f e a t u r e s = np . i s n a n ( d a t a ) . sum ( a x i s = 1 ) [ : , None ]
e l i f s e l f . n e w _ f e a t u r e s == ’1−h o t ’ :
e x t r a _ f e a t u r e s = np . i s n a n ( d a t a ) [ : , s e l f . c o n t a i n s _ n a n ]
e x t r a _ f e a t u r e s = np . a t l e a s t _ 2 d ( e x t r a _ f e a t u r e s )
e l s e :
r a i s e V a l u e E r r o r (
’ {} i s n o t a v a l i d v a l u e f o r ‘ n e w _ f e a t u r e s ‘ ’ . format (
s e l f . n e w _ f e a t u r e s ) )
# I m p u t e s nan v a l u e s
d a t a = np . copy ( d a t a )
f o r i in range ( l e n ( d a t a ) ) :
i s n a n = np . i s n a n ( d a t a [ i ] )
d a t a [ i , i s n a n ] = s e l f . i m p u t i n g _ v a l u e s [ i s n a n ]
i f s e l f . o n l y _ n a n _ d a t a :
o u t _ d a t a = e x t r a _ f e a t u r e s
e l s e :
o u t _ d a t a = np . c o n c a t e n a t e ( ( da t a , e x t r a _ f e a t u r e s ) , a x i s =1)
re turn o u t _ d a t a
c l a s s P r o c e s s o r ( D a t a P r o c e s s o r ) :
def _ _ i n i t _ _ ( s e l f , n o r m a l i z e , max_nan_share ) :
s e l f . n o r m a l i z e = n o r m a l i z e
s e l f . max_nan_share = max_nan_share
def f i t ( s e l f , d a t a ) :
s e l f . f e a t u r e s _ t o _ d r o p = np . z e r o s ( [ l e n ( d a t a ) , 0 ] )
i f s e l f . n o r m a l i z e :
s e l f . mean = np . nanmean ( da t a , a x i s =0)
s e l f . s t d = np . n a n s t d ( da t a , a x i s =0)
XV
A. Appendix: Programming code
i f s e l f . max_nan_share < 1 . 0 :
n a n _ f r e q u e n c y = np . i s n a n ( d a t a ) . sum ( a x i s =0) / l e n ( d a t a )
s e l f . f e a t u r e s _ t o _ d r o p = np . where ( n a n _ f r e q u e n c y >
s e l f . max_nan_share ) [ 0 ]
def t r a n s f o r m ( s e l f , d a t a ) :
d a t a = np . copy ( d a t a )
i f s e l f . n o r m a l i z e :
d a t a = ( d a t a − s e l f . mean ) / s e l f . s t d
i f l e n ( s e l f . f e a t u r e s _ t o _ d r o p ) > 0 :
d a t a = np . d e l e t e ( da t a , s e l f . f e a t u r e s _ t o _ d r o p , a x i s =1)
re turn d a t a
XVI