Benchmarking Machine Learning
Methods for Peptide Activity Predictions
Master’s thesis in Computer science and engineering
Boel Knutson and Lida Meskini Moudi
Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2022

Master’s thesis 2022
Benchmarking Machine Learning
Methods for Peptide Activity Predictions
Boel Knutson and Lida Meskini Moudi
Department of Computer Science and Engineering
Chalmers University of Technology
University of Gothenburg
Gothenburg, Sweden 2022
Benchmarking Machine Learning Methods for Peptide Activity Predictions
Boel Knutson and Lida Meskini Moudi
© Boel Knutson and Lida Meskini Moudi, 2022.
Supervisors: Ola Engkvist, Department of Computer Science and Engineering
Gökçe Geylan, AstraZeneca
Examiner: Alexander Schliep, Department of Computer Science and Engineering
Master’s Thesis 2022
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000
Typeset in LATEX
Gothenburg, Sweden 2022
iv
Benchmarking Machine Learning Methods for Peptide Activity Predictions
Boel Knutson and Lida Meskini Moudi
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
Abstract
One of the main challenges in the drug discovery process is to find a suitable com-
pound for further analysis. The compound must affect the target relevant for the
specific disease, while at the same time have desired properties to make it a safe and
efficient drug candidate. The task of finding and optimizing these compounds is a
long and expensive process. Therefore, using machine learning algorithms to predict
the properties of compounds can speed up the process and reduce the cost. To use
the algorithms, the information about the compounds must be translated into a
numerical representation. The choice of representation and algorithm is of greatest
importance since the predictions must be reliable to avoid late-stage failures in the
drug discovery process.
The objective of this thesis was to investigate if a molecular representation together
with a machine learning model could be found to accurately predict the potency of
peptides. This was done through a benchmarking study where different sequence-
based descriptors and predictive models were combined to see if one combination
worked well for various types of peptides. The descriptors were Z-scales, pseudo
amino acid composition, and one-hot representation, and were combined with two
different machine learning models, namely support vector classifier and random
forests classifier. The results show that one-hot representation outperforms Z-scales
and pseudo amino acid composition, however, the predictive model depends on the
characteristics of peptides.
Keywords: Drug discovery, peptide, classification, molecular representation, Z-scales,
pseudo amino acid composition, one-hot representation, random forests, support
vector machines.
v

Acknowledgements
We would like to express our thanks to our supportive and helpful supervisors Gökçe
Geylan and Ola Engkvist who have supported us through this Master’s thesis. Not
only have they provided us with invaluable feedback but also shared their knowl-
edge and expertise. We would also like to thank AstraZeneca for giving us this
opportunity. Lastly, we want to mention our friends and family who have kept our
motivation and spirits high during this time.
Boel Knutson & Lida Meskini Moudi, Gothenburg, June 2022
vii

Contents
List of Figures xi
List of Tables xiii
1 Introduction 1
1.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Theory 5
2.1 Molecular representation . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 One-hot representation . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Z-scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Pseudo amino acid composition . . . . . . . . . . . . . . . . . 7
2.1.4 Feature exploration . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Machine learning models . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Support vector classifier . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 Hyperparameter tuning . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Methods 15
3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Dataset A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.2 Dataset B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.3 Dataset C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Molecular representation . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Pseudo amino acid composition . . . . . . . . . . . . . . . . . 19
3.2.2 Z-scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.3 One-hot representation . . . . . . . . . . . . . . . . . . . . . . 20
3.2.4 Feature exploration . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Machine learning models . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.1 Support vector machines . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Results 25
ix
Contents
4.1 Feature exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.1 Z-scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.2 PseAAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.1 Dataset A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.2 Dataset B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.3 Dataset C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Evaluation on test set . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5 Discussion 39
5.1 Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.2 Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1.3 Machine learning models . . . . . . . . . . . . . . . . . . . . . 41
5.2 Discussion of methods . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6 Conclusion 45
Bibliography 47
A Feature exploration I
A.1 Z-scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
A.1.1 Dataset A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
A.1.2 Dataset B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II
A.1.3 Dataset C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III
A.2 PseAAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV
A.2.1 Dataset A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV
A.2.2 Dataset B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII
A.2.3 Dataset C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII
B Confusion matrices XI
B.1 Dataset A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XI
B.2 Dataset B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XII
B.3 Dataset C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XIII
x
List of Figures
2.1 A schematic drawing to show one-hot matrix and one-hot vector rep-
resentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 A schematic drawing to show (a) the 1st-tier, (b) the 2nd-tier, and
(c) the 3rd-tier sequence-order-correlation pattern along a protein
sequence, where Ri represents the residue in ith position [1]. . . . . . 8
2.3 A schematic drawing to show binary SVM classifier in linear separable
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 Matrix of robust enrichment ratio of double amino acid substitutions
at positions 1 and 3, for the first five natural amino acids. . . . . . . 16
3.2 Evaluating different λ for (a) Dataset A, (b) Dataset B, and (c)
Dataset C, to find the optimal value. . . . . . . . . . . . . . . . . . . 24
4.1 Distributions of the average z-values for each class in dataset C. . . . 26
4.2 Summary plot for the average z-values in dataset C. . . . . . . . . . . 27
4.3 Distributions of the pseudo components for each class in dataset A
with optimum lambda for SVC. . . . . . . . . . . . . . . . . . . . . . 28
4.4 Accuracy of each combination of model and descriptor. . . . . . . . . 29
4.5 Recall of each combination of model and descriptor. . . . . . . . . . . 30
4.6 Precision of each different combination of model and descriptor. . . . 30
4.7 F1 of each combination of model and descriptor. . . . . . . . . . . . . 31
4.8 MCC of each combination of model and descriptor. . . . . . . . . . . 31
4.9 Confusion matrix for SVC and PseAAC for dataset A. . . . . . . . . 32
4.10 Confusion matrix for SVC and one-hot representation for dataset A. . 32
4.11 Confusion matrix for random forests and one-hot representation for
dataset B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.12 Confusion matrix for SVC and one-hot representation for dataset B. . 33
4.13 Confusion matrix for random forests and one-hot representation for
dataset C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.14 Confusion matrix for best descriptor-model combination for each test
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
A.1 Distributions of the average z-values for each class in dataset A. . . . I
A.2 Summary plot for the average z-values in dataset A. . . . . . . . . . . II
A.3 Distributions of the average z-values for each class in dataset B. . . . II
A.4 Summary plot for the average z-values in dataset B. . . . . . . . . . . III
xi
List of Figures
A.5 Distributions of the average z-values for each class in dataset C. . . . III
A.6 Summary plot for the average z-values in dataset C. . . . . . . . . . . IV
A.7 Distributions of the pseudo components for each class in dataset A
with optimum lambda for SVC. . . . . . . . . . . . . . . . . . . . . . V
A.8 Distributions of the pseudo components for each class in dataset A
with optimum lambda for RFC. . . . . . . . . . . . . . . . . . . . . . VI
A.9 Summary plot for the pseudo components in dataset A. . . . . . . . . VII
A.10 Distributions of the pseudo components for each class in dataset B. . VIII
A.11 Summary plot for the pseudo components in dataset B. . . . . . . . . VIII
A.12 Distributions of the pseudo components for each class in dataset C
with optimum lambda for SVC. . . . . . . . . . . . . . . . . . . . . . IX
A.13 Distributions of the pseudo components for each class in dataset C
with optimum lambda for RFC. . . . . . . . . . . . . . . . . . . . . . IX
A.14 Summary plot for the pseudo components in dataset C. . . . . . . . . X
B.1 Confusion matrix for each descriptor-model combination for dataset A. XI
B.2 Confusion matrix for each descriptor-model combination for dataset B.XII
B.3 Confusion matrix for each descriptor-model combination for dataset C.XIII
xii
List of Tables
3.1 Class sizes within each dataset. . . . . . . . . . . . . . . . . . . . . . 17
3.2 Optimal λ for each dataset and model. . . . . . . . . . . . . . . . . . 19
3.3 Hyperparameters tuned for each model, and the input values. . . . . 21
3.4 Optimal values for each hyperparameter for each model, dataset, and
descriptor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1 Evaluation of best combination on test datasets . . . . . . . . . . . . 36
xiii
List of Tables
xiv
1
Introduction
Being able to predict the potency of peptides with the help of machine learning
can improve the drug discovery process to become both faster and cheaper. The
discovery and development of a new drug is a long and expensive process, where the
idea to the finished product can take approximately 12-15 years and cost around
USD$1 billion [2]. The discovery process begins by finding a suitable target, for
example, an enzyme or gene, that is relevant to a disease. This is followed by
finding and optimizing a compound. The compound must have properties suitable
for a drug while affecting the target of interest in a useful pharmacological context
[3]. The compound can for example be a small molecule, biological therapeutic,
or peptide, and common properties to evaluate are the absorption, distribution,
metabolism, excretion, and toxicity (ADMET) of the compound [4]. After this step,
the chosen compound goes into pre-clinical and clinical development to evaluate its
safety, efficacy, and dosing [4]. Lastly, the compound must be reviewed and approved
by certain agencies before it finally can enter the market. However, the majority of
potential drugs fail before entering these last steps of the process [3]. From a time
and cost perspective, it is better that these failures happen as early as possible,
and that is how predicting potency by using machine learning can improve the drug
discovery process.
Drugs can be categorized based on the type of active substance they rely on. For
instance, 75% of the global pharmaceutical market is accounted for by small molecule
drugs while only 5% is accounted for by peptide drugs [5]. A peptide is a sequence
of amino acids that can be found naturally in the human body or be produced
synthetically [6]. Reasons why peptide drugs only account for around 5% of the
market include that peptides have low oral bioavailability, low plasma stability, and
short circulation time [5]. However, they have several desirable drug features such
as low toxicity and high selectivity and they have been used against several serious
diseases such as diabetes, cancer, and HIV. Moreover, in recent years there has been
a lot of progress within the field of peptide drug development. In particular, in
the last 20 years, the number of peptide drugs entering clinical development has
increased considerably [5]. With this recent progress, it is desirable to explore and
improve the drug discovery process for peptide drugs further.
Traditionally, properties of compounds are analyzed using in vivo and in vitro tests,
which refers to experiments that are performed in living organisms or outside of
1
1. Introduction
organisms, such as in a test tube, respectively [7]. While these tests can provide
accurate results, they are expensive and time-consuming. However, with the devel-
opment of artificial intelligence and machine learning, a lot of work can be automated
by in silico (computer-aided) modeling [8]. With in silico methods, it is possible to
use machine learning models to predict compounds’ properties faster, cheaper, and
more sustainably than using in vivo and in vitro tests.
To use machine learning to predict the potency of peptides, information of the
peptide must be translated into a numerical representation readable by machines
[9]. In this thesis, different sequence-based encodings are explored. Sequence-based
encodings are calculated from the primary structure of the peptide, i.e. from the
sequence of amino acids [10].
Specifically, in this thesis, a benchmarking study is carried out to predict the potency
of peptides by combining different machine learning models with various represen-
tations of peptides as input. Models used for this task are support vector machines
(SVM), random forests, and encodings include one-hot representation, pseudo amino
acid composition (PseAAC), and Z-scales. As the area of peptide drugs is a rela-
tively new and expanding field, there is a need to explore which implementations
work well with peptide data.
1.1 Problem formulation
This project aims to find the best combination of machine learning model and
molecular representation in order to accurately predict the potency of peptides.
Specifically, using different datasets containing peptide sequences and binary classes
which represent high or low potency, the objective is to encode the sequences us-
ing sequence-based descriptors and use these representations in different machine
learning models. The assay of which the dependent variable has been measured, as
well as the length of peptides, differs between the datasets. A comparison between
model performances for each dataset is then conducted in order to find the best
combination of peptide representation and machine learning model.
1.2 Limitations
Rather than considering all possible machine learning algorithms, this project ex-
plores a subset of algorithms, which are selected based on their displayed perfor-
mance in the existing literature. The same holds true also for the selection of
peptide representations. The project is also limited to public datasets which need
to be clean, well-annotated, and sufficiently large. What is considered sufficiently
large is determined by the learning algorithms, as these might be more or less prone
to overfitting when exposed to datasets of different sizes.
2
1. Introduction
1.3 Thesis outline
In this report, Chapter 2 will introduce theory relevant to understanding this thesis.
First, molecular representations are explained and then three different descriptors,
one-hot representation, pseudo amino acid composition, and Z-scales, are presented
closer, followed by an explanation of "SHapley Additive exPlanations" (SHAP) for
feature exploration. Thereafter, two predictive machine learning models, support
vector machine and random forests, are described. Lastly, an overview of commonly
used evaluation metrics is presented.
In Chapter 3, the methodology used in this project is described. The chapter begins
with a presentation of the data used and then explains the steps of data processing.
This is in order for readers to be able to reproduce the same experiment. Then the
motivation and practical details of the generated descriptors are explained, before
going into the details of the chosen machine learning models and the motivation be-
hind these. Lastly, the process of evaluating the different combinations of descriptors
and predictive models is presented.
Next, in Chapter 4, the results of the feature exploration and experiment will be
presented, as well as the results of using the best combination for each dataset on
the corresponding test set. This is followed by Chapter 5, where the results will be
analyzed and discussed. This includes a comparison between results for different
combinations, as well as reflections and interpretations of the results. The report
ends with a conclusion in Chapter 6, where the thesis is summarized.
3
1. Introduction
4
2
Theory
In the following sections, the theory of relevant concepts is presented. First, the
theory behind molecular representation and the three descriptors, one-hot represen-
tation, pseudo amino acid composition (PseAAC), and Z-scales are explained. This
is followed by explanations of relevant machine learning models, namely support vec-
tor machines (SVM), and random forests. An explanation of hyperparameter tuning
within these models is also presented. Lastly, the formulas of relevant evaluation
metrics are explained.
2.1 Molecular representation
A peptide is a chain of molecules, known as amino acid residues, which are connected
by peptide bonds. To make the peptides readable for machine learning models, the
peptide sequences are converted into molecular representations by using descriptors.
The descriptors turn the sequences into numerical representations or features. Two
common types of descriptors are sequence- and structure-based [10]. The structure-
based descriptors contain properties obtained from the structures of peptides, while
the sequence-based is calculated from the sequence of the peptide. An entire amino
acid sequence is a straightforward model of peptide sample representation [11]. The
advantage of such a representation is that it represents the peptide with complete
information, containing both constituent amino acids and their order. Equation 2.1
shows the sequential representation of a given peptide sequence P with L amino
acid residues,
P = [R1R2R3.....RL], (2.1)
where Ri represents the residue in ith position.
Amino Acid Composition (AAC) is another sequential representation of peptide
sample that is not based on the order of amino acids of sequence. Equation 2.2
shows the AAC representation of the given peptide P ,
P = [f1f2f3.....f20]T , (2.2)
where fu (u = 1, 2, ..., 20) are the normalized occurrence frequencies of the 20 natural
5
2. Theory
amino acids in P , and T the transposing operator. The AAC representation is a
popular and one of the simpler descriptors to compute, but its main shortcoming is
the loss of the sequence-order information [11].
Three types of sequence-based descriptors are One-hot representation, Z-scales, and
PseAAC are explained in the following sections.
2.1.1 One-hot representation
For one-hot representation, each peptide sequence is converted into a matrix ∈
RL×20, where L is the length of the sequence. The columns represent the 20 natural
amino acids, and the rows represent the position in the sequence. For each row, all
amino acids are set to 0 except for the amino acid being encoded, which is set to 1
[12]. For instance, if the first position in a sequence is "G", the first row is a vector
where the column corresponding to "G" is 1 and all other columns are 0, see Figure
2.1. To use the matrix in machine learning models, the matrix is flattened into a
vector with length L× 20 by concatenating the columns.
Sequence :          G A A E I R Length(L) = 6 Peptide descriptor matrix 
20 Natural AAS Dimensions = L by 20:     A C   D   E F   G H   I K   L   M   N    P   Q    R S    T    V   W   Y
1 0   0   0   0   0    1 0   0   0   0    0    0    0    0    0    0    0    0    0    0
2 1 0   0   0   0    0    0   0   0   0    0    0    0    0    0    0    0    0    0    0
3 1 0   0   0   0    0    0   0   0   0    0    0    0    0    0    0    0    0    0    0
4 0   0   0   1 0    0    0   0   0   0    0    0    0    0    0    0    0    0    0    0
5 0   0   0   0   0    0   0    1 0   0    0    0    0    0    0    0    0    0    0    0
6 0   0   0   0   0    0   0    0   0   0    0    0    0    0    1 0    0    0    0    0
. . . . . .
Peptide descriptor vector 
Length = L * 20
. . . . . .
Figure 2.1: A schematic drawing to show one-hot matrix and one-hot vector
representation.
2.1.2 Z-scales
Z-scales represent five z-values related to physicochemical properties of the amino
acids in the peptide sequence [13]. Note that Z-scales and z-values are not related
to z-scores in statistics. Hellberg et al. [14] described the variation in the peptide
sequences by three principal properties, Z1, Z2, and Z3, per amino acid in each
position of the sequence. To derive these three principal properties they described
20 natural amino acids by 29 measures of various properties. The variables were
scaled to unit variance and then they used principal components analysis (PCA) on
the scaled data to capture the three components Z1, Z2, and Z3. These three values
6
2. Theory
of amino acid summarize all 29 physicochemical properties. Sandberg et al. [15]
improved the original Z-scales introduced by Hellberg et al. [14], by publishing two
additional values Z4 and Z5 to the three original z-values. They use 26 properties
derived from 87 amino acids. The z-values mainly captures lipophilicity (Z1), bulk
(Z2), polarity / charge (Z3) and Z4 and Z5 are related to properties as electronega-
tivity, heat of formation, electrophilicity, and hardness [15]. As an example, for Z1
a negative value relates to a lipophilic amino acid, and a positive value relates to a
polar, hydrophilic amino acid [15].
In the study by van Westen et al. [16], Z-scales performed well compared to other de-
scriptors such as VHSE, T-scales, ST-scales, ProtFP, and BLOSUM. The descriptor
also had a consistently good score for various datasets, such as dipeptides that have
an inhibitory effect on the angiotensin-converting enzyme, and three proteochemo-
metric datasets.
2.1.3 Pseudo amino acid composition
To overcome the disadvantage of the AAC model in providing information of se-
quence order, PseAAC was proposed [11]. Conventional AAC has 20 components
to show the occurrence frequency of each 20 natural amino acids, however, PseAAC
has more than these 20 factors. The first 20 components are the same as conven-
tional AAC while the additional factors record some sequence-order information via
various pseudo components. The additional factors can be a series of rank corre-
lation factors along a protein chain or any combination of other factors as long as
they are sequence order-correlated, as seen in Equation 2.3.
P = [p1, p2, p3, .....p20, p T20+1, p20+2, ..., p20+λ] , (2.3)
where λ is an integer factor, and assigning a different value to it leads to a different
dimension of the PseAAC [1].
As mentioned, the first 20 components, p1, p2, ..., p20 are associated with the conven-
tional AAC of peptides. The other components p20+1, ..., p20+λ are the correlation
factors that reflect the 1st tier, 2nd tier, up to the λ-th tier sequence order based on
correlation pattern, shown in Figure 2.2, and calculated by Equation 2.4, 2.5, and
2.6.
∑ fu20 ∑+ λ 1 ≤ u ≤ 20=1 fi w =1 τn k k
pu = ∑ (2.4)wτu−2∑020 λ 20 + 1 ≤ u ≤ 20 + λ
n=1 fi+w =1 τk k
where fu (u = 1, 2, ..., 20) is the normalized occurrence frequency of amino acid u,
fi is the normalized occurrence frequency of the ith amino acid of the 20 natural
amino acids in the protein sequence, w is the weight factor, and τk the kth tier
7
2. Theory
correlation factor, that reflects the sequence order correlation between all the k-th
most contiguous residues, see Figure 2.2, and according to Equation 2.5, and 2.6.
L∑−K
τk =
1
Ji,i+k, (2.5)
L−K i=1
1 ∑Γ
Ji,i+k = Γ [Φq(Ri+k)− Φq(R
2
i)] , (2.6)
q=1
where Φq(Ri) is the q-th function of the amino acid Ri, and Γ the total number of
the functions considered. For example, Chou [11] defined Φ1(Ri), Φ2(Ri) and Φ3(Ri)
as the hydrophobicity value, hydrophilicity value, and side chain mass of amino acid
Ri, respectively. Therefore, the total number of functions in this example is Γ = 3.
J1,2 J2,3 J3,4 J4,5
a) R1 R2 R3 R4 R5 RL
J1,3 J2,4 J3,5
b) R1 R2 R3 R4 R5 RL
J1,4 J2,5
c) R1 R2 R3 R4 R5 RL
Figure 2.2: A schematic drawing to show (a) the 1st-tier, (b) the 2nd-tier, and
(c) the 3rd-tier sequence-order-correlation pattern along a protein sequence, where
Ri represents the residue in ith position [1].
PseAAC has previously been used in various prediction tasks, such as predicting al-
lergenic proteins [17], predicting DNA-binding proteins [18], and predicting enzyme
subfamily classes [19]. In the study of predicting DNA-binding proteins made by
Fang et al. [18], PseAAC with w = 0.05 and λ = 20, together with SVM outper-
formed three other descriptors, namely dipeptide composition, autocross-covariance
transform, and native amino acid composition with an Matthews correlation coeffi-
cient (MCC) of 0.92, compared to scores between 0.45-0.66.
8
2. Theory
2.1.4 Feature exploration
Before training machine learning models with generated features, there are different
ways of exploring and analyzing these to extract more information about the data.
One method to use for this is SHAP. SHAP originates from the Shapley values
introduced by Lloyd Shapley in 1953 for cooperative game theory [20, 21]. SHAP
measures the importance of each feature for a specific prediction by the model [21].
SHAP does not only show feature importance but also the contribution of each
feature to the prediction, i.e., if the feature impacts the prediction positively or
negatively [22]. Higher SHAP values correspond to a higher probability for a positive
prediction, and lower SHAP values correspond to a higher probability for a negative
prediction. SHAP can be used to explain different models such as XGBoost and
Scikit-learn tree models by using different SHAP explainers, such as TreeExplainer
[22, 23].
2.2 Machine learning models
2.2.1 Support vector classifier
Support vector machine is a supervised learning algorithm that is used for both
classification (SVC) and regression (SVR) tasks. The algorithm tries to find the
optimal hyperplane based on the training data to classify new data points. In a
two-dimensional space, this hyperplane is a simple line. SVM learns similarities
between classes by finding the most similar samples between classes, which are
called support vectors [24]. Based on these support vectors, the algorithm finds
the optimal hyperplane by maximizing the margin between the hyperplane and all
support vectors of each class. It is a global margin as it considers all classes. Linearly
separable data in two dimensions is shown in Figure 2.3, where the margin (green
line) is orthogonal to the hyperplane (red line) and it is equidistant to the support
vectors.
When the data is not linearly separable and it is not possible to find a linear hyper-
plane to separate the classes, SVM transforms the data into a new representation
[25]. To generate the new representation, the space is transformed by mapping data
points to higher dimensions using non-linear functions. This transformation is called
kernel [25]. When adding additional dimensions to the data, it might be possible
to make the data separable [24]. The kernel function is used to calculate the dot
product of two vectors in a high-dimensional space while using their low-dimensional
counterpart for the computations [26]. Three examples of kernel functions are Lin-
ear, Sigmoid, and Radial basis function (RBF).
In SVM, hinge loss is used as the loss function, as seen in equation 2.7 [27].
Loss(w, x, y) = max(0, 1− y · (w · x+ b)), (2.7)
where w are the learned weights of the model, b is the learned bias, and y and x are
9
2. Theory
Support Vectors: most similar examples
between classes.
Separating boundary:
optimal line in two dimension 
Margin
Feature 1
3
Figure 2.3: A schematic drawing to show binary SVM classifier in linear
separable dataset.
the labels and features of the data, respectively.
Even though higher-dimensional features and kernel functions can reduce the num-
ber of data points being non-separable, there might still be scenarios when it is not
possible. To overcome this issue, it is possible to use soft-margin SVM. This means
that each data point has a slack parameter set to 0, and if a data point cannot
be on the right side of the margin, it is moved in the right direction and the slack
parameter increases. However, the bigger the total slack is, the more the model will
be penalized [27].
In a study by Mohabatkar et al. [28], SVM together with PseAAC performed well
for classifying gamma-aminobutyric acid receptor proteins, which had a length of
450–627 amino acids. The combination resulted in a MCC of 0.88, with λ = 1 and
weight factor = 0.05. SVM also worked well for classifying antibacterial peptides in
a study made by Lata et al. [29]. In their study SVM together with AAC resulted
in an MCC of 0.84.
10
Feature 2
2. Theory
2.2.2 Random forests
A random forest is an ensemble method, which refers to constructing a model by
combining multiple instances of a base model [30]. In random forests, depending on
the prediction task, these base models are either simple classification trees (RFC)
or regression trees (RFR). When training a random forests model, each tree trains
on a random sample of the dataset, and for each split of a node in the tree, only a
random subset of all features is considered [30]. The final prediction consists of the
aggregated output of all trees, which means for a regression task it is the average of
the predictions, and for a classification task, it is the class predicted by the majority
of trees.
In a study by Dybowski et al. [31], random forests were used for predicting Beviri-
mat resistance in HIV-1. Random forests together with a combination of sequence-
based and structural descriptors achieved competitive results for their p2 peptide se-
quences. Furthermore, as shown by Bhadra et al. [32] random forests also work well
for classifying antimicrobial peptides which resulted in an MCC of 0.9. Significant
results have also been achieved for protein sequences. For classifying protein-protein
interactions from primary protein sequences, random forests achieved better than
state-of-the-art predictors in a study made by You et al. [33].
2.2.3 Hyperparameter tuning
Models consist of two kinds of parameters: general parameters and hyperparameters.
Parameters are determined by the training of the model, while hyperparameters refer
to the structure of the model such as the number of trees or depth of trees in the
random forests model. The process of finding the optimal model architecture is
known as hyperparameter tuning. Tuning hyperparameters contain several steps;
defining a model, defining all possible values for hyperparameters, defining a method
for "searching" the hyperparameter space for the optimum values, and defining an
evaluative metric to evaluate the model. Two methods for "searching" are grid search
and random search. Grid search is the basic method of tuning hyperparameters
and creates a model for all possible combinations of the provided value set [34].
This method selects the model which produces the best result by considering the
defined evaluation metric. Random search is different from grid search in the way
of providing possible hyperparameter values. In grid search, a discrete set of values
is provided but in random search, a statistical distribution is given and the values
are sampled randomly from it [34].
As mentioned, different models have different hyperparameters. For example, hyper-
parameters to tune in SVC include kernel, regularization parameter (defined "C" in
sklearn.svm.SVC [35]), and gamma. Choosing the right kernel in SVC is important
as the wrong transformation leads to poor results. The regularization parameter de-
fines how much the model avoids misclassification [24]. A higher value means less
misclassification which leads to a smaller margin. A lower value means a wider mar-
gin but more misclassified samples. However, increasing the value of the parameter
does not mean the model is more precise as it increases the risk of overfitting [24].
11
2. Theory
Gamma is also an important hyperparameter, which modulates at which distance
a single sample has influence [35]. Models with high gamma only consider samples
close to the margin as support vectors, and models with lower gamma allow sam-
ples further away as support vectors. Common hyperparameters in random forest
are "max-depth", "max-leaf-nodes", "min-samples-leaf", "n-estimators", "bootstrap"
and "max-features". "max-depth" defines the longest path between root and node
and limits the depth of each tree in random forests [36]. "max-leaf-nodes" speci-
fies the maximum number of terminal nodes. If the number of terminal nodes is
reached after splitting a node, the tree will terminate and not grow further [36].
"min-samples-leaf" sets a condition on the minimum number of samples presented
in the leaf node after splitting a node. If the number of samples in the leaf node
is equal to or bigger than the specified minimum number, that node is considered
a leaf or terminal node. However, if the number of samples is less than specified,
the parent of that node is considered a leaf node. "n-estimators" define the number
of trees to use in the forest before making the final prediction of the aggregated
output of all trees, and is set to an integer value above 0 [35]. "bootstrap" specifies
if bootstrap samples of the dataset should be used for the decision trees, or to use
the whole dataset [35]. The parameter can either be set to "true" or "false", where
"true" means to use samples, and "false" to use the whole dataset. "max-features"
specifies the maximum number of features to consider when splitting nodes in each
tree. It can either be set to an integer or one of {sqrt, log2}, where sqrt is the
square root of the number of features in the dataset and log2 the logarithm of the
number of features to the base 2 [35].
2.3 Evaluation metrics
Evaluation metrics are used to explain the performance of predictive models. Gen-
erally, they are generated by first feeding an unused test set to the trained model
and then comparing the predicted outcome against the actual value. In classification
tasks, the predicted class is compared to the actual class, and the outcomes can be
represented in four different labels: True Positive (TP), False Positive (FP), True
Negative (TN), and False Negative (FN). These labels can be represented in a table
called confusion matrix which is shown in Figure 2.4.
TP and TN mean that the model predicts the actual correct positive or negative
class, respectively. FP means the model predicted a positive class for the sample,
but the actual class was negative, so the prediction was false/wrong. A similar
situation is given by FN, where the model predicts a negative class for the sample
but the actual class of that sample is positive.
Different metrics can be used for evaluating classification models. The choice of
evaluation metrics can be a challenging task, based on the given problem and the
need of stakeholders. However, there are standard metrics that are widely used for
evaluating classification models, such as accuracy, precision, recall, F1, and MCC.
Accuracy is a standard metric that is commonly used for most problems. It shows the
proportion of the number of correct predictions to the total number of predictions, as
12
2. Theory
Predicted
Negative Positive
TRUE NEGATIVE FALSE POSITIVE
(TN) (FP)
FALSE NEGATIVE TRUE POSITIVE
(FN) (TP)
Figure 2.4: Confusion matrix.
seen in Equation 2.8. For this metric, a score of 1 represents a perfect performance,
and 0 represents that the model performs poorly. Accuracy is defined as
TP + TN
Accuracy = + + + . (2.8)TP TN FP FN
Precision, recall, and F1 are three suitable metrics for imbalanced datasets when the
positive class is the minority class. Precision shows how many of the samples pre-
dicted as positive (TP+FP) are correctly predicted, see Equation 2.9. A maximum
score of 1 is achieved when all positive predictions are correctly classified. A score
of 0 means that all positive predictions are wrong. Precision is formulated as
TP
Precision = . (2.9)
TP + FP
Recall shows how many of the samples belonging to the actual positive class (TP+FN)
that are correctly classified, see Equation 2.10. For recall, a score of 1 means that
all samples with actual positive class are predicted correctly. A score of 0 means
that the model couldn’t assign any of the actual positive samples to the positive
class. Recall is defined as
13
Actual
Positive Negative
2. Theory
TP
Recall = + . (2.10)TP FN
F1-score is the average of precision and recall which is formulated as the harmonic
mean as in Equation 2.11. It gives equal weight to both equations and the score
ranges in [0, 1]. A score of 1 means a perfect prediction and is reached when FN =
FP = 0, and the lowest score of 0 is reached when none of the positive samples has
been classified correctly [37]. The formula for calculating F1 is
= 2× Precision×RecallF1 + . (2.11)Precision Recall
MCC is an evaluation metric for binary classification. A high MCC is only possible
when the model predicts the majority of data samples for both classes correctly
[37]. A model with perfect fit results in a score of 1, a score of 0 means that the
predictions are random, and -1 indicate that the model performs poorly [37]. The
formula can be seen in Equation 2.12. The main difference between MCC and F1,
is that F1 is independent from TN [37]. MCC is formulated as
= √ TP × TN − FP × FNMCC . (2.12)
(TP + FP )× (TP + FN)× (TN + FP )× (TN + FN)
14
3
Methods
In this chapter, the methods used to predict the potency of peptides are presented.
First, a detailed introduction to the datasets is given. Since publicly available
datasets are used, descriptions of how the experiments were conducted and how
the data were obtained will be presented. Then, the implementation of the pep-
tide representations is presented, followed by a detailed explanation of the machine
learning models carried out in this project. Lastly, the evaluation process of the
models will be presented.
3.1 Data
In this project, the modeling has been conducted on three different publicly avail-
able datasets. Each dataset has peptides with different characteristics, which is of
interest to see how well the different descriptors work on different types of peptides.
The datasets have been pre-processed so that all of them have a similar format of
[peptide sequence, class]. How each dataset has been processed can be found in the
following sections. After pre-processing, all datasets were split so that 80% were
kept for training and validating the models, and 20% was kept as test data, which
the final model was evaluated on in the final steps of this thesis. The splitting was
done by using train_test_split from scikit-learn, with random_state=100 [35].
3.1.1 Dataset A
Dataset A is provided by van Rosmalen et al. [38]. This data contains the wild-type
sequence "CQFDLSTRRLKC". Wild-type sequence refers to the sequence in its nat-
ural form, without any mutation. It is a cyclic peptide with ten amino acids, which
are cyclized by a disulfide bridge. The measurement of the data is affinity through
robust enrichment ratio, which can be regarded as the true enrichment ratio’s lower
95% confidence limit, assuming the sampling is unbiased. True enrichment ratio can
be explained as a measurement that describes to which degree a variant contributes
to a selected property [38]. In the study by van Rosmalen et al. [38], the enrich-
ment ratio indicates a positive or negative effect of the property binding affinity. It
is calculated by measuring the frequency of a variant in the selected population and
dividing it by the frequency of the variant in the unselected population. The robust
enrichment ratio is used to minimize the effect of sampling errors [38].
15
3. Methods
From the paper, five different matrices are provided, each visualizing amino acid
substitutions at two positions in the wild-type sequence, together with the corre-
sponding robust enrichment score. In each matrix, the first position is shown along
the horizontal axis and the second position along the vertical axis. What this means
is that at these two positions in the wild-type sequence, the original two amino acids
are substituted with the other 20 natural amino acids.
In Figure 3.1, an example from a matrix is shown. The example only shows the
first five natural amino acids along the vertical and horizontal axis. The horizontal
axis corresponds to position 1, and the vertical axis to position 3. This means that
in the wild-type sequence "CQFDLSTRRLKC", both position 1 (Q), and position 3
(D) are substituted with the five natural amino acids. The variants are saved from
the matrix as "Q1R, D3R" for the first row and column in the matrix, together with
the corresponding robust enrichment score (0.0). The resulting variant therefore
becomes "CRFRLSTRRLKC" with a robust enrichment ratio of 0.0.
A dataset is created where the positions in the wild-type sequence are substituted
with the variant amino acids. The other matrices are similar to Figure 3.1 but have
substitutions at positions 1 and 5, 1 and 6, 3 and 5, and 5 and 6. From the matrices
and the wild-type sequence, a total of 2000 sequences were obtained. However, a
total of 32 sequences did not have an available score, since these combinations were
counted less than 10 times in the unselected library and were therefore removed.
When combining the matrices into one dataset, each matrix was first normalized
such that the score of the wild-type sequence equaled one, i.e., all scores in each
matrix were divided by the score of the wild-type sequence in their matrix. Then 77
duplicate sequences, with a total of 193 appearances, were removed, and a dataset
with 1775 sequences remained. These sequences had a score s ∈ [0, 49.5]. A score of
1 (the score of the wild-type sequence) was used as the threshold when adding class
labels to the sequences. Every score equal to this threshold or less was classified
as the negative class and everything above was classified as the positive class. The
distribution between classes for the training and validation set can be seen in Table
3.1.
Position 1
R H K D
R 0.0 0.0 0.0 0.0
H 1.8 1.5 1.0 0.0
K 0.1 0.1 0.0 0.0
D 15.4 7.5 4.2 0.3
Figure 3.1: Matrix of robust enrichment ratio of double amino acid substitutions
at positions 1 and 3, for the first five natural amino acids.
According to Wang et al. [39], cyclic peptides have shown superior behavior in
16
Position 3
3. Methods
biological activities to linear peptides and have a large potential as therapeutic
agents. It is therefore of interest to analyze how a descriptor and model combination
for cyclic peptides might differ from the combination of linear peptides.
Dataset Set Nr. of negatives Nr. of positives Total
Training 1 117 (87%) 161 (13%) 1 278
A
Validation 115 (81%) 27 (19%) 142
Training 11 272 (87%) 1 673 (13%) 12 945
B
Validation 1 261 (88%) 178 (12%) 1 439
Training 140 (30%) 320 (70%) 460
C
Validation 15 (29%) 37 (71%) 52
Table 3.1: Class sizes within each dataset.
3.1.2 Dataset B
Dataset B consists of a wild-type sequence together with a dataset of variants and
corresponding scores [12, 40]. The measurement of the dataset is β-glucosidic link-
ages through scores calculated by Enrich2 tool [41]. One up to five variants are
provided per measured score, in the format "A104E" where the first letter repre-
sents the amino acid in the wild-type sequence (A) at a position (104) and the
substituting amino acid (E), similar to the variants in dataset A. From these, and
including the wild-type sequence, a total of 26 653 sequences with a corresponding
score s ∈ [−4.72, 3.79] were obtained. 916 of these sequences contained a ’*’-symbol,
which were removed, resulting in a dataset of 25 737 data points. When using the
Enrich2 tool, a score of 0 corresponds to the score of the wild-type sequence. There-
fore, when adding class labels to the sequences, sequences with a score equal or less
than 0 were classified as the negative class and everything above was classified as
the positive class.
The sequences are representations of longer protein sequences, consisting of 501
amino acids, compared to the peptides in dataset A, which has a length of 12
amino acids. Using two such different molecules is relevant to see how descriptors
and models differ for long and short sequences. To be able to compare the results
between dataset A and dataset B, the protein dataset was downsampled so that the
distribution between the two classes was the same as for dataset A. This was done by
removing 7 756 sequences with positive class by using the function resample from
scikit-learn module utils [35]. When implemented, random_state is set to 42, and
since a sample from the class should not appear more than once, the parameter
replace is set to "false". The sample chosen from resample is then combined
together with the negative class. The final dataset consists of 17 981 sequences,
including the test set, and the distribution of the training and validation set can be
seen in Table 3.1.
17
3. Methods
In the study by Gelman et al. [12], the dataset was used to evaluate sequence-
function mapping, i.e., predict the functions of unseen sequences, by using supervised
learning methods. Specifically, by using four different supervised neural network
architectures, they showed that the dataset could be used to design new sequence
variants. The models resulted in a Pearson’s correlation coefficient above 0.5 for all
four architectures, where a score of 1 represents a perfect linear relationship between
true and predicted scores, -1 a negative linear relationship, and 0 represents no linear
relationship.
3.1.3 Dataset C
Dataset C was downloaded from the Immune Epitope Database (IEDB) [42]. This
database includes over one million peptides which are possible to filter and search
among. Possible filters include what type of peptides, sequence length, the natural
source where the peptide can be found, what type of experiment is performed and
the outcome, the type of host mounting the immune response, and what type of
disease affecting the host. For this project, only linear peptides found in humans
with a length of 9 amino acids are considered. Of these, only peptides that are ex-
perimentally tested for binding to major histocompatibility complex (MHC) ligand
’HLA-A*02:01’, with both positive and negative outcomes are chosen. Further, only
humans as hosts with any disease state are selected. After the dataset had been
downloaded from the database, only the peptides with the assay unit nanomolar
(nM) concentration were kept to only keep those peptides that have the same end-
point measurement. After removing duplicate sequences, the dataset consisted of
641 peptides including the test set, of which 441 had a positive outcome and 200
had a negative outcome. The distribution of the training and validation set can be
seen in Table 3.1.
Dataset C differentiates from the other two datasets as the sequences don’t originate
from a wild-type sequence. This is of interest to see how sequences with very different
compositions impact the results of the descriptor and model combinations, compared
to the sequences in datasets A and B, which have a similar amino acid composition.
Furthermore, since IEDB is a widely-used public database including more than
one million peptides, it has been used to compile data for a wide range of studies
using machine learning. Data from this database have for example been used for
predicting T-cell epitopes in order to develop a vaccine against Zika Virus [43],
predicting human leukocyte antigen (HLA) class II molecules from which results
can be utilized when developing a vaccine or cancer treatment [44], and predicting
B-cell epitopes which, according to Liu et al. [45], are important in vaccine design,
clinical diagnosis, and antibody production.
3.2 Molecular representation
In order to use the sequences in predictive machine learning models, the informa-
tion from each sequence must be translated into numerical representations. In this
project, three different descriptors are generated for each dataset, PseAAC, Z-scales,
18
3. Methods
and one-hot representation. Before using the generated representations in the ma-
chine learning models, the representation for each dataset is normalized using the
function robust scaler in scikit-learn [35]. This function centers the data by re-
moving the median and then scales it according to the quantile range. The step
of normalization is to make the data more consistent as it ensures that all features
have roughly the same scale, and no feature has a very large value compared to the
other features [27].
3.2.1 Pseudo amino acid composition
As mentioned in section 2.1.3, PseAAC has previously been applied in various pre-
diction tasks, which makes it an interesting candidate for this project as well. In
this project, to generate PseAAC, the Python package propy is used to generate
the PseAAC [46]. In PseAAC there are two factors for the user to select, λ and
the weight. λ ∈ [0, N ] is a factor reflecting the rank of correlation, where N is the
length of the sequence, and 0 corresponds to the components for conventional AAC.
The sequence-order effect incorporated in the representation increases by choosing
a greater value for λ [11]. This parameter was tuned by trying different values for
λ on a subsample of each dataset together with each model. For each dataset, a
maximum of seven different values were evaluated, which were evenly distributed in
the range ∈ [0, N ]. The λs with the highest F1-score for each dataset and model
were then selected for the modeling. The results for each dataset can be seen in
Figure 3.2 and the optimal values for each combination can be found in Table 3.2.
The weight factor is a value in range ∈ [0.05, 0.7], and is used to put weight to the
additional components (20+1 to 20+λ) with respect to the conventional AAC [47].
In this thesis, the weight factor is set to the default value of 0.05, which is the same
value as Chou [11] proposed when introducing PseAAC.
Dataset Model Optimal λ
SVC 9
Dataset A
Random forests 5
SVC 1
Dataset B
Random forests 1
SVC 5
Dataset C
Random forests 1
Table 3.2: Optimal λ for each dataset and model.
3.2.2 Z-scales
In the study by van Westen et al. [16] mentioned in section 2.1.2, Z-scales performed
well for various types of datasets. This makes it a relevant descriptor to explore in
this thesis, to see if a similar conclusion can be drawn or if a particular predictive
19
3. Methods
model might be more suitable for this descriptor. In this thesis, Z-scales are gener-
ated by using the Peptides package in Python [48]. So for each amino acid in the
sequences, five z-values are calculated which results in a vector of 60 features for
each sequence in dataset A, 2 505 features for the sequences in dataset B, and 45
features for the sequences in dataset C.
3.2.3 One-hot representation
One-hot representation doesn’t capture any prior domain knowledge about amino
acids [49], compared to PseAAC and Z-scales. The descriptor only contains informa-
tion about the positions of the 20 natural amino acids. This makes it an interesting
descriptor for this project in order to analyze if the manually curated features of
PseAAC and Z-scales impact the predictions, or if the order of the amino acids in
the sequence is enough. To generate a one-hot representation, each amino acid in
the sequences is first encoded into integers. Then, for each position in the sequence,
a list with a length of 20 is created with all zeros. At the position of the integer
representing the amino acid, the zero is set to 1. These are inserted in a matrix
∈ RL×20 where L is the length of the sequence. Then, this matrix is flattened into
a vector, with length L× 20. The number of features for dataset A is therefore 240,
for dataset B 10 020, and dataset C 180.
3.2.4 Feature exploration
After generating the descriptors, an exploration of features was conducted. This was
done for Z-scales and PseAAC. Since one-hot representation is a direct representa-
tion of the sequence, where the features don’t include any prior domain knowledge
about amino acids, this descriptor was not explored further before using it in the
machine learning models. For each descriptor and dataset, first, the distribution of
the features between the two classes is analyzed, and thereafter, the SHAP values for
the features are presented. For the implementation of SHAP, only random forests
are considered and therefore the "TreeExplainer" is chosen as the SHAP explainer.
Summary plots of SHAP values give an overall picture of the features’ contributions.
The y-axis is the features in descending order, and the x-axis shows the calculated
mean absolute value of SHAP values, where the red color shows the positive corre-
lation with the target, and the blue the negative correlation [22]. The results of the
feature exploration can be found in section 4.1.
3.3 Machine learning models
The numerical representations are then used for two different machine learning mod-
els: SVC and random forests. Each model is implemented with hyperparameter
tuning, where Gridsearch from scikit-learn is used to find the optimal values of
hyperparameters [35]. The hyperparameters which are tuned and the parameter
values can be seen in Table 3.3. The procedure of defining the possible values for
the hyperparameters was done by observing that no optimal values were the same
as the border value. If an optimized model chose a value corresponding to a border,
20
3. Methods
the borders were extended and the model was optimized once again. When fitting
a model, Stratified K-Folds cross-validator with five folds is used, and the combi-
nation with the highest MCC is retained. MCC is used as the evaluation metric as
it takes correct predictions for both classes into account. The optimal values of the
hyperparameters for the model and descriptor combinations for each dataset can
be seen in Table 3.4. The models are trained on 90% of the training data and the
remaining 10 % is used as a validation set.
Model Hyperparameter Values
Kernel [linear, sigmoid, rbf]
SVC C [0.0001, 0.1, 1, 10, 20, 50, 100]
Gamma [scale, auto]
Bootstrap [True, False]
Max-depth [10, 50, 100, 500, 1000]
Random
forests Max-features [auto, sqrt]
Min-samples-leaf [1, 2, 10, 50]
n-estimators [1, 10, 100, 500, 1000, 1500]
Table 3.3: Hyperparameters tuned for each model, and the input values.
3.3.1 Support vector machines
In the study by Mohabatkar et al. [28] mentioned in section 2.2.1, SVM together
with PseAAC worked very well for long protein sequences, similar to the length of
proteins in dataset B. Considering this, and that SVM also showed a good result
classifying peptides in the study made by Lata et al. [29], it is a relevant model to
explore further to see if it also performs well for other types of peptides and proteins.
In this thesis, SVM is implemented by using SVC from the scikit-learn module svm
[35]. As all three datasets are imbalanced, when implementing SVC, the hyper-
parameter class_weight is set to "balanced". This is to assign different weights
inversely proportional to the class frequencies to the data points [35].
3.3.2 Random forests
As mentioned in section 2.2.2, random forests have shown to be a suitable model in
various studies for predicting different proteins and peptides. With these positive
results, it is a good model to explore further in this thesis.
The implementation of random forest is done by using RandomForestClassifier
from the ensemble module in scikit-learn [35]. When implementing random forests
in this project, the value of the parameter random_state is set to 2 to control the
reproducibility of the results.
21
3. Methods
Dataset Model Hyperparameter Z-scales PseAAC One-hot
Kernel rbf linear rbf
SVC C 20 100 10
Gamma scale scale scale
Bootstrap False False False
A
Max-depth 50 50 100
Random
forests Max-features auto auto auto
Min-samples-leaf 1 1 1
n-estimators 100 1000 100
Kernel linear rbf rbf
SVC C 10 0.1 0.1
Gamma scale auto scale
Bootstrap True True True
B
Max-depth 500 100 500
Random
forests Max-features auto auto auto
Min-samples-leaf 1 1 1
n-estimators 1 1 1
Kernel rbf rbf rbf
SVC C 1 10 10
Gamma scale scale auto
Bootstrap True True False
C
Max-depth 10 10 50
Random
forests Max-features auto auto auto
Min-samples-leaf 1 1 1
n-estimators 500 500 1000
Table 3.4: Optimal values for each hyperparameter for each model, dataset, and
descriptor.
3.4 Evaluation
When evaluating the performance of the descriptor and model combinations, five
different metrics are analyzed: accuracy, precision, recall, MCC, and F1. Only
22
3. Methods
looking at accuracy could be misleading when having an imbalanced dataset. As
87% of datasets A and B are of the negative class, the accuracy would result in 0.87
if the model only predicts the majority class. Therefore, it is important to consider
other metrics as well. Since a negative peptide being classified as positive (FP) can
become very costly in future steps of the drug discovery process, high precision is
necessary. High precision makes it more likely that a sample predicted as positive
is correct. Moreover, high recall is important to avoid missing potential drugable
molecules. Since both precision and recall are of importance, F1 is a suitable metric
for this project, especially for datasets A and B which have the negative class as the
majority class. High MCC can be interpreted as the model that can predict both
classes correctly, which is suitable when both the positive and negative classes are
important. This metric is therefore a suitable metric for binary classification and
this project.
To calculate the metrics, accuracy_score, precision_score, recall_score,
matthews_corrcoef, and f1_score are used from the metrics-module in scikit-
learn [35].
23
3. Methods
a 1.0
svc
0.8 rfc
0.6
0.4
0.2
0.0 2 4 6 8 10
(a) F1 score of SVC and RFC together with PseAAC calculated with different λ
for dataset A.
b 1.0
svc
0.8 rfc
0.6
0.4
0.2
0.0 0 100 200 300 400
(b) F1 score of SVC and RFC together with PseAAC calculated with different λ
for dataset B.
c 1.0
0.8
0.6
0.4
0.2 svc
rfc
0.0 1 2 3 4 5 6 7
(c) F1 score of SVC and RFC together with PseAAC calculated with different λ
for dataset C.
Figure 3.2: Evaluating different λ for (a) Dataset A, (b) Dataset B, and (c) Dataset
C, to find the optimal value.
24
F1 F1 F1
4
Results
This chapter begins by first presenting the results from the exploration of the features
generated by Z-scales and PseAAC. This is followed by a presentation of the results
from the descriptor and model combinations which are evaluated by comparing the
different evaluation metrics for each dataset. Lastly, the results of using the best
combinations on each test set are presented and described.
4.1 Feature exploration
Most features showed a similar distribution for both the negative and the positive
classes, which indicates that the features are not obvious identifiers when classifying
new, unknown samples. Therefore, only the plots showing some indication of being
differentiated and the corresponding summary plots of SHAP values are discussed,
and the other plots can be found in Appendix A.
4.1.1 Z-scales
For Z-scales, the five average z-values were used to generate the feature exploration
for each dataset. For dataset C, the distribution of the two classes for Z1, corre-
sponding to lipophilicity, are more separated, compared to the other z-values, as
seen in Figure 4.1. This means that for the peptides in dataset C, the value of Z1
can be an indicator of which class a sample belongs to. This is supported by the
SHAP value in Figure 4.2. As seen in the figure, Z1 is the feature that has the most
impact on the model output, with a SHAP value above 0.3, compared to the other
z-values which have a score around 0.1.
4.1.2 PseAAC
For PseAAC, only the features corresponding to the pseudo components are used
in the distribution plots, i.e., the value of λ for the different combinations. This
is because the first 20 components corresponding to the conventional amino acid
composition are only a direct representation of the sequence. Similar to Z-scales,
most features show no significant difference in the distribution for both classes.
However, for dataset A and SVC where optimal λ = 9, pseudo components 27, 28,
and 29 show some differences between the distribution, see Figure 4.3. It can be
25
4. Results
a b
0.6
0.4 0 01 1
0.3 0.4
0.2
0.2
0.1
0.0 4 2 0 2 0.0 2 0 2
z1 z2
(a) Z1 (b) Z2
c d
0.8 0 0.8 0
1 1
0.6 0.6
0.4 0.4
0.2 0.2
0.0 2 1 0 1 2 0.0 2 1 0 1
z3 z4
(c) Z3 (d) Z4
e
1.5 0
1
1.0
0.5
0.0 1 0 1 2
z5
(e) Z5
Figure 4.1: Distributions of the average z-values for each class in dataset C.
26
Density Density
Density
Density Density
4. Results
z1
z3
z2
z5
z4 Class 0
Class 1
0.00 0.05 0.10 0.15 0.20 0.25 0.30
mean(|SHAP value|) (average impact on model output magnitude)
Figure 4.2: Summary plot for the average z-values in dataset C.
seen that a lower value for pseudo components 27 and 29, as well as a higher value
for pseudo component 28, is more related to the negative class.
27
4. Results
a b
0.4 0 01 0.8 1
0.3 0.6
0.2 0.4
0.1 0.2
0.0 4 6 8 10 0.0 3 4 5 6 7 8
PAAC21 PAAC22
(a) PAAC21 (b) PAAC22
c d
0.6 0 0.6 0
1 1
0.4 0.4
0.2 0.2
0.0 5 6 7 8 9 10 0.0 6 8 10
PAAC23 PAAC24
(c) PAAC23 (d) PAAC24
e f
0 0.4 0
0.4 1 10.3
0.2
0.2
0.1
0.0 4 6 8 0.0 4 6 8 10
PAAC25 PAAC26
(e) PAAC25 (f) PAAC26
g h
0.6 0 0.6 0
1 1
0.4 0.4
0.2 0.2
0.0 2 4 6 8 10 0.0 6 8 10 12
PAAC27 PAAC28
(g) PAAC27 (h) PAAC28
i
1.5 0
1
1.0
0.5
0.0 0 2 4 6 8
PAAC29
(i) PAAC29
28
Figure 4.3: Distributions of the pseudo components for each class in dataset A
with optimum lambda for SVC.
Density Density Density Density
Density
Density Density Density Density
4. Results
4.2 Modeling
In this project, the comparison between the performance of each combination is made
by considering five evaluation metrics: accuracy, recall, precision, F1, and MCC. A
comparison of all results are shown in Figure 4.4-4.8 and the most relevant confusion
matrices are shown in Figure 4.9-4.13. An overview of all confusion matrices for each
dataset, model and descriptor can be found in Appendix B.
1.0
0.8 SVC - Zscale
SVC - Pseaac
SVC - Onehot
RFC - Zscale
0.6 RFC - Pseaac
RFC - Onehot
0.4
0.2
0.0 Data A Data B Data C
Dataset
Figure 4.4: Accuracy of each combination of model and descriptor.
29
Accuracy
0.9
0.85
0.93
0.89
0.85
0.87
0.71
0.52
0.42
0.83
0.81
0.84
0.77
0.67
0.81
0.81
0.75
0.88
4. Results
1.0
0.8 SVC - Zscale
SVC - Pseaac
SVC - Onehot
RFC - Zscale
0.6 RFC - Pseaac
RFC - Onehot
0.4
0.2
0.0 Data A Data B Data C
Dataset
Figure 4.5: Recall of each combination of model and descriptor.
1.0
0.8 SVC - Zscale
SVC - Pseaac
SVC - Onehot
RFC - Zscale
0.6 RFC - Pseaac
RFC - Onehot
0.4
0.2
0.0 Data A Data B Data C
Dataset
Figure 4.6: Precision of each different combination of model and descriptor.
30
Precision Recall
0.71 0.81
0.55 0.96
0.81 0.81
0.88 0.52
0.69 0.41
0.74 0.52
0.2 0.46
0.17 0.75
0.17 0.92
0.2 0.13
0.2 0.19
0.22 0.11
0.77 0.97
0.78 0.76
0.81 0.95
0.8 0.97
0.76 0.95
0.88 0.97
4. Results
1.0
0.8 SVC - Zscale
SVC - Pseaac
SVC - Onehot
RFC - Zscale
0.6 RFC - Pseaac
RFC - Onehot
0.4
0.2
0.0 Data A Data B Data C
Dataset
Figure 4.7: F1 of each combination of model and descriptor.
1.0
0.8 SVC - Zscale
SVC - Pseaac
SVC - Onehot
RFC - Zscale
0.6 RFC - Pseaac
RFC - Onehot
0.4
0.2
0.0 Data A Data B Data C
Dataset
Figure 4.8: MCC of each combination of model and descriptor.
31
MCC F1
0.7 0.76
0.65 0.7
0.77 0.81
0.62 0.65
0.45 0.51
0.55 0.61
0.15 0.28
0.16 0.28
0.19 0.28
0.07 0.16
0.09 0.2
0.07 0.14
0.37 0.86
0.22 0.77
0.49 0.88
0.5 0.88
0.3 0.84
0.71 0.92
4. Results
80
0 94 21
60
40
1 1 26 20
0 1
Predicted label
Figure 4.9: Confusion matrix for SVC and PseAAC for dataset A.
100
0 110 5
75
50
1 5 22 25
0 1
Predicted label
Figure 4.10: Confusion matrix for SVC and one-hot representation for dataset A.
1192 69 10000
750
500
1 159 19 250
0 1
Predicted label
Figure 4.11: Confusion matrix for random forests and one-hot representation for
dataset B.
32
True label True label True label
4. Results
800
0 440 821 600
400
1 15 163 200
0 1
Predicted label
Figure 4.12: Confusion matrix for SVC and one-hot representation for dataset B.
0 10 5 30
20
1 1 36 10
0 1
Predicted label
Figure 4.13: Confusion matrix for random forests and one-hot representation for
dataset C.
33
True label True label
4. Results
4.2.1 Dataset A
For dataset A, each combination has an accuracy above 0.85, as seen in Figure
4.4, with SVC together with one-hot representation having the best score of 0.93.
Since this dataset is very imbalanced with 87% and 81% of the training and the
validation set, respectively, belonging to the negative class, the accuracy is not a
suitable metric to evaluate the performance. The recall for dataset A in Figure 4.5
shows the best result for SVC together with PseAAC with a score of 0.96, followed
by SVC together with Z-scales and one-hot representation, both with a score of
0.81. This means that these models are good at predicting the samples belonging
to the positive class correctly. However, looking at the precision in Figure 4.6,
SVC together with PseAAC has the lowest score of 0.55. This indicates that a lot
of the model’s positive predictions are false, which is supported by the confusion
matrix in Figure 4.9, where almost half of the positive predictions are false. Low
precision increases the risk of late-stage failure in drug discovery. Choosing the
wrong candidate (FP) for further analysis leads to spending unnecessary costs and
time. However, both SVC together with one-hot representation and random forests
together with Z-scales show good precision scores above 0.8.
F1 takes both precision and recall into account and that is why SVC together with
one-hot representation, which had the second-best score for both these metrics, has
the highest F1 with a score of 0.81, as seen in Figure 4.7. This combination is also
the one showing the best MCC score with a value of 0.77, as seen in Figure 4.8.
This indicates that the model is relatively good at predicting both classes correctly,
and by looking at the confusion matrix in Figure 4.10, only 10 samples, or 7% of
the validation set, are predicted wrongly.
From the results, it is seen that random forests overall show lower scores than SVC,
except for accuracy and precision. Even if precision is of importance when evaluating
potential peptides for drug discovery, the score of 0.88 for random forests together
with Z-scales is not significantly higher than the score of 0.81 for SVC and one-
hot representation. Therefore, when taking all the metrics into consideration, SVC
together with one-hot representation is chosen as the best combination for dataset
A.
4.2.2 Dataset B
For dataset B, the combination of random forests and one-hot representation has
the highest accuracy score of 0.84, closely followed by random forests together with
both Z-scales and PseAAC, as seen in Figure 4.4. Even if accuracy gives a good
overview of the performance, since it shows the proportions of correct predictions,
it is not a reliable score for the evaluation of imbalanced datasets. Since dataset B
is extremely imbalanced, similar to dataset A, with 87% consisting of the negative
class in both the training and validation sets, the accuracy can be affected by the
majority class. As shown in the confusion matrix in Figure 4.11, random forests
together with one-hot representation, which had the highest accuracy, have a large
number of TNs compared to the other combinations.
34
4. Results
Recall score in Figure 4.5 shows that the combination of SVC and one-hot repre-
sentation has the highest score of 0.92, which means that this combination could
predict 92% of the samples actually belonging to the positive class correctly. The
high recall score for this combination is because of the low number of FNs (15) and
high number of TPs (163) as seen in Figure 4.12. As also shown in Figure 4.5, the
combinations of SVC have significant higher scores compared to the combinations of
random forests. This means that SVC could predict more samples actually belong-
ing to the positive class correctly. One reason for this could be that the tuned value
for hyperparameter C (regularization) for this model and dataset is 0.1, as seen in
Table 3.4. Lower C leads to a larger margin and allows more misclassification. It
increases the ability of the model to classify most of the samples belonging to the
minority class (higher recall), but it also increases the number of FP which leads to
a lower precision score, as seen in Figure 4.6. The precision of SVC together with
one-hot representation shows that only 17% of the samples predicted as positive are
correct, even if the combination could capture a majority of the positive class, as
shown by the high recall score.
The highest score of F1 presented in Figure 4.7 is 0.28, which is the same for all
SVC combinations. The highest MCC score is 0.19 for SVC together with one-
hot representation which implies this combination has better performance in both
negative and positive classes compared to the other combinations. Even if SVC
together with one-hot representation has the highest MCC, it has also the highest
number of FPs which leads to the lowest precision. This is unfavorable when ex-
ploring new molecules as peptides without the expected properties will move on for
further analysis. However, all combinations have a very low score of F1 and MCC,
indicating that the explored combinations work poorly for this dataset. Although
random forests together with one-hot representation has the lowest MCC, it also
has the lowest rate of FPs. This means that if the model predicts a molecule as
positive it is more likely to be correct and decreases the risk of failure in later stages
of the process. Therefore, to evaluate the performance of the unseen test set, this
combination is chosen.
4.2.3 Dataset C
When looking at the accuracy for dataset C in Figure 4.4, random forests together
with one-hot representation show the best result of 0.88, followed by SVC together
with one-hot representation and random forests with Z-scales, both with a score of
0.81. However, the dataset is imbalanced with around 70% of the dataset consisting
of samples belonging to the positive class, which makes accuracy a non-suitable
score. However, all combinations show a very high recall, as seen in Figure 4.5. All
combinations except SVC together with PseAAC have a score above 0.95. Also for
precision, all combinations show good results, with scores above 0.76, with random
forests together with one-hot representation having the best score of 0.88. The high
values for both recall and precision also result in high F1 scores. As seen in Figure
4.7, all combinations have a score above 0.77, again with random forests together
with one-hot representation having the best result with a score of 0.92. However,
when looking at the MCC in Figure 4.8, no combination except random forests
35
4. Results
together with one-hot representation has a score above 0.5. The lower scores of MCC
and the high scores of precision and recall indicate that the models have problems
predicting the negative class correctly. However, as random forests together with
one-hot representation has the best MCC with a score of 0.71, and it has shown
the highest scores consistently for all evaluation metrics, this is chosen as the best
combination for dataset C. As seen in the confusion matrix in Figure 4.13, the model
predicts almost all samples belonging to the positive class correctly, as expected.
However, around 10% of the samples are predicted as positives but are actually
negatives, which is the reason behind a lower MCC.
4.3 Evaluation on test set
After the best combination had been chosen for each dataset, the combinations were
used on the respective test set to evaluate the final models on unseen data. In Table
4.1, the final results are presented. For dataset A, F1 and MCC scores of SVC and
one-hot combination are better on the test set compared to the validation set. A
reason behind this could be that the data in the test set is very similar to the one
in the training set. The data in the test set is more similar to the training data,
compared to the validation data. As seen in the confusion matrix in Figure 4.14a,
this combination only predicted one FP, corresponding to 0.2% of the dataset, which
implies that this combination is a good choice for classification of unseen data with
similar characteristics to dataset A.
The result of the evaluation on test set B showed poor performance. This was
expected as the results of the validation set of dataset B showed similar results. As
seen in Figure 4.14b, the confusion matrix has a high number of FPs, which is not
desirable in the drug discovery procedure.
For dataset C, F1 and MCC scores of random forests together with one-hot repre-
sentation on the test set are not as good as on the validation set. This indicates that
there is a risk of overfitting on the training dataset when building this predictive
model. Moreover, dataset C is smaller than the two other datasets, with only around
a third of the number of samples in dataset A. A larger dataset where the model has
more samples to train on could have a positive effect on the results. However, the
model still performs well, which implies that the model and descriptor combination
work on unseen data with similar characteristics as dataset C.
Dataset Model Descriptor Accuracy F1 MCC Recall Precision
Dataset A SVC One-hot 0.98 0.92 0.91 0.86 0.97
Dataset B RFC One-hot 0.81 0.15 0.05 0.12 0.20
Dataset C RFC One-hot 0.77 0.84 0.48 0.89 0.79
Table 4.1: Evaluation of best combination on test datasets
36
4. Results
300
250
0 309 1
200
150
100
1 6 39
50
0 1
Predicted label
(a) Confusion matrix of SVC and one-hot representation on test set A
2500
0 2859 234
2000
1500
1000
1 444 60
500
0 1
Predicted label
(b) Confusion matrix of RFC and one-hot representation on test set B
70
0 25 20 60
50
40
30
1 9 75
20
10
0 1
Predicted label
(c) Confusion matrix of random forests and one-hot representation on test set C
Figure 4.14: Confusion matrix for best descriptor-model combination for each test
dataset.
37
True label True label True label
4. Results
38
5
Discussion
In this chapter, the results of this thesis are discussed. First, the possible impact
which the datasets, descriptors, and machine learning models have on the results
are analyzed. Thereafter, the choice of research methodology is discussed and how
these choices might have affected the results. Lastly, some suggestions for future
research are pointed out, which could improve the performance and therefore also
move a step closer to predicting the potency of peptides.
5.1 Discussion of results
Here the previously presented results will be discussed. This is done by first focusing
on which aspects of the datasets could affect the results, followed by a discussion of
the results on descriptor level, and lastly analyzing the performance of the machine
learning models.
5.1.1 Datasets
Looking at the MCC for the best combinations for all datasets, the selected models
for datasets A and C outperform the selected model on dataset B with scores of
0.77 and 0.71 respectively, compared to 0.19 for dataset B. Overall, the scores of
F1 and MCC show a much better performance for the models on datasets A and
C than for dataset B. A possible explanation for this could be that the sequences
in dataset B are significantly longer with a length of 501 amino acids, compared
to datasets A and C which has a length of 12 and 9 amino acids, respectively.
Longer sequences increase the complexity of modeling by enlarging the feature space
for the descriptors, except for PseAAC. Representing longer sequences could be
more complicated than representing shorter sequences. The representation for longer
sequences should provide a suitable dimension of the feature space which also is
computationally efficient and contains the most relevant information. Running the
models and tuning hyperparameters of a large dataset containing long sequences
like a dataset B is not time efficient and makes this process more challenging.
Moreover, dataset B is much larger than the other two datasets, with 14 384 samples
in the train and validation sets, compared to 1 420 samples in dataset A and 512
sequences in dataset C. With larger datasets, the machine learning models have more
samples to train on which should be an advantage, compared to smaller datasets.
39
5. Discussion
However, dataset B probably doesn’t have enough variation in the data for the size
to have an impact. As only one up to five positions in the sequences have variations,
which corresponds to a maximum of 1% of the sequence, the difference between
sequences is probably not large enough for the model to learn anything tangible.
The sequences in dataset A also originate from a wild-type sequence, however, the
variety in the sequences is sufficient for the model to reach a better performance. The
reason behind this is that the sequences are much shorter with a length of 12 amino
acids, and therefore when one or two positions mutate, that corresponds to 8.3%
or 16.7% of the sequence which leads to having a greater impact on the measured
property. Furthermore, as mentioned in section 2.2.1 and 3.3.1, SVC together with
PseAAC have shown good performance for long protein sequences similar to dataset
B in the study made by Mohabatkar et al. [28]. In this study, the same parameters
were used, i.e., weight factor = 0.05 and λ = 1. Therefore, a probable reason behind
the poor results of dataset B is the type of sequences.
In dataset C the majority class is the positive class, compared to datasets A and B,
which have the negative class as the majority class. The positive class is the more
important class to predict correctly in this work, as those represent the peptides
with high potency. However as dataset C has this as the majority class, it is easy for
machine learning models to predict these correctly, and detecting the negative class
will be harder. Therefore the recall and precision will automatically be higher for
this dataset, compared to datasets A and B. However, this difference was handled
by analyzing the performance of multiple metrics.
5.1.2 Descriptors
As seen in the results, one-hot representation worked best for all three datasets.
One reason for this could be that, even if all descriptors in this study are sequence-
based, one-hot representation is the one providing complete information about the
order of amino acids in the sequences. Z-scales don’t have this incorporated, and in
PseAAC λ defines the sequence-order effect, therefore it is not as explicit as one-hot
representation. For example, for dataset B the optimal λ was found to be one for
both SVC and random forests, which means that only the pairwise relation with the
immediate neighboring amino acid is considered, and limited information about the
sequence order is incorporated. However, different λs were investigated, and λ = 1
had the best result. For dataset A, the optimal λ was found to be nine and five
for SVC and random forests, respectively. Compared to dataset B, this means that
PseAAC considers more correlation factors and reflects more information about the
sequence order between the amino acids. However, when finding the optimal λ,
the models were not tuned which could have affected the results. With optimized
models, the optimal λs could have been different which would affect the results.
Another advantage of one-hot representation is the size of the feature space. One-
hot representation provides the most features, with a feature space of size L × 20,
where L is the length of sequences. Comparing this to Z-scales’ L×5 and PseAAC’s
20 + λ, the feature space of one-hot representation is significantly greater. As the
machine learning models are provided with more information, it is possible that the
40
5. Discussion
model can learn more complex patterns.
However, one-hot representation doesn’t capture any prior domain knowledge about
amino acids, therefore it was not expected that this descriptor would the best de-
scriptor for all three datasets. One-hot representation only contains information
about the number and the positions of the 20 natural amino acids in the peptide
sequences. Therefore, the descriptor is a very simple representation and was not
expected to get the best result. However, as seen in the feature exploration of Z-
scales and PseAAC, the majority of features followed the same distribution for both
classes and the SHAP values showed that most features had a relatively low impact
on the model’s predictions. This is a possible reason for their slightly worse perfor-
mance since only a few features give an indication of which class a sample belongs
to. Another reason behind PseAAC and Z-scales not having the best performance
could be that the incorporated domain knowledge in these two descriptors is too
general for specific tasks. Since three datasets with significantly different peptides
were used in this work, with three different assays, the descriptors might be too gen-
eral for these datasets. However, as the descriptors have shown promising results
in for example the study by van Westen et al. [16], where Z-scales performed well
for various tasks, it was expected that these would perform better than a simpler,
direct representation.
5.1.3 Machine learning models
As seen in the results, SVC was chosen as the best model for dataset A and random
forests for datasets B and C. The different characteristics of the datasets might be
the reason behind these different results. For dataset A, where all sequences originate
from the same wild-type sequence, with a length of 12 amino acids, random forests
perform overall worse than SVC for all three descriptors. Similarly, for dataset B
where all peptide sequences also originate from a wild-type sequence but are 501
amino acids long, random forests have the lowest MCC and F1 score for all three
descriptors, compared to SVC. However, as mentioned in the results, precision shows
a better score for random forests. For dataset C, where the peptides are shorter,
similar to dataset A, but much more differentiated, random forests have a similar
or higher score than SVC for all descriptors and metrics.
SVC has performed well in a lot of previous research, and for datasets A and B the
model shows the highest MCC for all three descriptors. However, looking at the
precision, random forests perform better, which is of high importance in the drug
discovery process. Establishing which model has superior performance might be a
task-specific and data-dependent decision. Regardless, further research is needed.
5.2 Discussion of methods
The research methodology used in this thesis has had a large influence on the
achieved results. Firstly, three public datasets with various formats were used,
of which two were peptide datasets with shorter sequences and one protein dataset
41
5. Discussion
with longer sequences. The reason for using such various datasets was to see if the
best combination of descriptor and machine learning model had similarities for var-
ious types of peptide sequences. In this thesis, the positive class in dataset B was
chosen to be downsampled to have the same class ratios as dataset A. A more bal-
anced dataset would probably have provided better results in this work. However,
in reality, the positive class is usually the minority class with only a few samples.
Therefore, downsampling the positive class better reflects the reality, and could make
the results more applicable. Using public available datasets have both advantages
and disadvantages. Advantages include for example that similar experiments don’t
have to be performed multiple times, the experiments can be conducted by experts,
but the results can be used by multiple people, and in this work it was time-efficient.
An example of a disadvantage is that the users can’t be sure of the quality of the
data, and there is no way of controlling this. Furthermore, finding a well-annotated
peptide dataset large enough to train a model is difficult.
Secondly, in this thesis, three descriptors, namely Z-scales, PseAAC, and one-hot
representation were implemented. These descriptors were chosen because of good
performances in previous research. From these chosen descriptors, one-hot repre-
sentation showed a better performance and could therefore be the focus of future
research as a molecular representation when using peptides to develop new, potent
drugs. The choice of descriptors certainly has an impact on the final results, and
there might be other descriptors that would have performed better. However, as
mentioned in the introduction, this work had to be limited to the descriptors that
seemed most suitable, as the work otherwise risked becoming too large and complex.
Thirdly, a binary classification task was used in this work to predict the potency
of peptides. However, the aim was first to proceed as a regression task by using
the continuous values in datasets A and B as the dependent variable. However, the
results of the descriptor and model combinations showed very poor performance,
and it seemed as the datasets were not suitable for this kind of task. Therefore, the
continuous values were converted to a positive and negative class, the models were
changed into classification models, and the results improved considerably.
Lastly, the choice of methods, for example normalizing features and tuning hyper-
parameters have also affected the results. When tuning the hyperparameters, MCC
was used as the metric when evaluating the models. This can be evaluated as a
convenient metric for dataset C, as that dataset has a positive majority class, and
therefore looking at precision, recall, or F1 would not be beneficial. However, as the
results for dataset B show, the combinations with higher MCC had a lower preci-
sion, which is disadvantageous when trying to predict new peptides. Therefore, a
possible improvement could be a more thorough investigation of the choice of the
metric by considering task-specific model optimization.
5.3 Future work
The work done in this thesis can be developed further in future research. Three dif-
ferent parts can be of interest, namely additional descriptors, other machine learning
42
5. Discussion
models, and further improvements to existing work. Since only three descriptors
have been implemented in this thesis, other existing descriptors, such as structure-
based descriptors, can be used as well. Another interesting proposal would be to
combine the descriptors used in this thesis in different ways to see if the results would
improve or get worse when using these as input to the machine learning models.
Even if SVC and random forests have shown good performance in previous research,
other models can be implemented as well. Particularly neural networks are of interest
as many studies have shown that this type of model works well for predicting different
aspects of peptides. It would therefore be of interest to investigate how well these
would perform compared to the existing results. For example, since the positional
information was seen to improve the model performance, a deep learning model such
as CNN can allow the model to capture the long-range patterns in the data.
Additional parts which might improve the results and can be considered for future
research are for example choosing another type of method when tuning hyperparam-
eters. In this work Gridsearch was used, however, there are other options as well,
such as random search. Using another type of tuning method would use other ways
of trying potential values and therefore other optimal values could be found to im-
prove the predictive results. Moreover, when generating PseAAC in this work, only
λ is tuned, however, the second parameter, weight factor, can be optimized as well.
Finding the optimal value for this parameter could also have an improvement on
the results for this descriptor. Lastly, implementing feature selection to the models
could improve the results. This means exploring which features are the most impor-
tant and have the biggest impact on the model, and only using these features when
predicting. This can improve the results of the models as unnecessary features which
might have a negative impact on the predictions would be removed. By exploring
the most important features, additional information which could help explain the
results would also be gathered. This information can then be used in other parts of
the drug discovery process and improve the exploration of future peptides.
43
5. Discussion
44
6
Conclusion
Being able to predict the potency of peptides with the help of machine learning
would be a major advantage in the discovery of new peptide drugs. In this thesis,
we have explored different combinations of machine learning models and descriptors
on different kinds of peptides and proteins found in three public datasets, to predict
the potency of peptides. In particular, we implemented three sequence-based de-
scriptors, namely Z-scales, PseAAC, and one-hot representation. To evaluate these
descriptors, they were combined with two different machine learning models, SVC
and random forests. The combinations were evaluated by five metrics, accuracy,
recall, precision, F1, and MCC.
Of the descriptors, one-hot representation was found to perform best on all three
datasets. A reason behind this could be that the position has a significant impact
in the explored datasets and can be even more prominent than the amino acid
information. Another reason could be that one-hot representation has the largest
feature space, providing the machine learning models with the most information.
From SVC and random forests, no model outperform the other since it depends
on the type of data. SVC performed better for dataset A, while random forests
were chosen as the best model for datasets B and C. This thesis concludes that
more research has to be conducted to investigate if a general descriptor and model
combination can be found to work well for various kinds of peptides.
45
6. Conclusion
46
Bibliography
[1] K.-C. Chou and H.-B. Shen, “Recent progress in protein subcellular location
prediction.” Analytical biochemistry, vol. 370, no. 1, pp. 1–16, 2007.
[2] J. P. Hughes, S. Rees, S. B. Kalindjian, and K. L. Philpott, “Principles of early
drug discovery,” British journal of pharmacology, vol. 162, no. 6, pp. 1239–1249,
2011.
[3] R. C. Mohs and N. H. Greig, “Drug discovery and development: Role of ba-
sic biological research,” Alzheimer’s & Dementia: Translational Research &
Clinical Interventions, vol. 3, no. 4, pp. 651–657, 2017.
[4] A. B. Deore, J. R. Dhumane, R. Wagh, and R. Sonawane, “The stages of drug
discovery and development process,” Asian Journal of Pharmaceutical Research
and Development, vol. 7, no. 6, pp. 62–67, 2019.
[5] M. Muttenthaler, G. F. King, D. J. Adams, and P. F. Alewood, “Trends in
peptide drug discovery,” Nature Reviews Drug Discovery, vol. 20, no. 4, pp.
309–325, 2021.
[6] (2019) Impact story: Developing the tools to evaluate complex drug
products: Peptides. https://www.fda.gov/drugs/regulatory-science-action/
impact-story-developing-tools-evaluate-complex-drug-products-peptides. Ac-
cessed on: 2022-02-01.
[7] European food safety authority: Glossary. https://www.efsa.europa.eu/en/
glossary-taxonomy-terms/i. Accessed on: 2022-06-16.
[8] H. Van De Waterbeemd and E. Gifford, “Admet in silico modelling: towards
prediction paradise?” Nature reviews Drug discovery, vol. 2, no. 3, pp. 192–204,
2003.
[9] S. Spänig and D. Heider, “Encodings and models for antimicrobial peptide
classification for multi-resistant pathogens,” BioData Mining, vol. 12, no. 1,
pp. 1–29, 2019.
[10] E. C. L. de Oliveira, K. Santana, L. Josino, L. e Lima, A. Henrique, and
C. de Souza de Sales Júnior, “Predicting cell-penetrating peptides using ma-
47
Bibliography
chine learning algorithms and navigating in their chemical space,” Scientific
reports, vol. 11, no. 1, pp. 1–15, 2021.
[11] K.-C. Chou, “Prediction of protein cellular attributes using pseudo-amino acid
composition,” Proteins: Structure, Function, and Bioinformatics, vol. 43, no. 3,
pp. 246–255, 2001.
[12] S. Gelman, S. A. Fahlberg, P. Heinzelman, P. A. Romero, and A. Gitter, “Neural
networks to learn protein sequence–function relationships from deep mutational
scanning data,” Proceedings of the National Academy of Sciences, vol. 118,
no. 48, 2021.
[13] G. J. van Westen, R. F. Swier, J. K. Wegner, A. P. IJzerman, H. W. van Vlij-
men, and A. Bender, “Benchmarking of protein descriptor sets in proteochemo-
metric modeling (part 1): comparative study of 13 amino acid descriptor sets,”
Journal of cheminformatics, vol. 5, no. 1, pp. 1–11, 2013.
[14] S. Hellberg, M. Sjoestroem, B. Skagerberg, and S. Wold, “Peptide quantitative
structure-activity relationships, a multivariate approach,” Journal of medicinal
chemistry, vol. 30, no. 7, pp. 1126–1135, 1987.
[15] M. Sandberg, L. Eriksson, J. Jonsson, M. Sjöström, and S. Wold, “New chemical
descriptors relevant for the design of biologically active peptides. a multivariate
characterization of 87 amino acids,” Journal of medicinal chemistry, vol. 41,
no. 14, pp. 2481–2491, 1998.
[16] G. J. vanWesten, R. F. Swier, I. Cortes-Ciriano, J. K. Wegner, J. P. Overington,
A. P. IJzerman, H. W. van Vlijmen, and A. Bender, “Benchmarking of protein
descriptor sets in proteochemometric modeling (part 2): modeling performance
of 13 amino acid descriptor sets,” Journal of cheminformatics, vol. 5, no. 1, pp.
1–20, 2013.
[17] H. Mohabatkar, M. Mohammad Beigi, K. Abdolahi, and S. Mohsenzadeh,
“Prediction of allergenic proteins by means of the concept of chou’s pseudo
amino acid composition and a machine learning approach,” Medicinal Chem-
istry, vol. 9, no. 1, pp. 133–137, 2013.
[18] Y. Fang, Y. Guo, Y. Feng, and M. Li, “Predicting dna-binding proteins: ap-
proached from chou’s pseudo amino acid composition and other specific se-
quence features,” Amino acids, vol. 34, no. 1, pp. 103–109, 2008.
[19] X.-B. Zhou, C. Chen, Z.-C. Li, and X.-Y. Zou, “Using chou’s amphiphilic
pseudo-amino acid composition and support vector machine for prediction of
enzyme subfamily classes,” Journal of theoretical biology, vol. 248, no. 3, pp.
546–551, 2007.
[20] L. S. Shapley, “A value for n-person games,” in Contributions to the Theory
of Games II, H. W. Kuhn and A. W. Tucker, Eds. Princeton: Princeton
University Press, 1953, pp. 307–317.
48
Bibliography
[21] S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model pre-
dictions,” Advances in neural information processing systems, vol. 30, 2017.
[22] A. B. Parsa, A. Movahedi, H. Taghipour, S. Derrible, and A. K. Mohammadian,
“Toward safer highways, application of xgboost and shap for real-time accident
detection and feature analysis,” Accident Analysis & Prevention, vol. 136, p.
105405, 2020.
[23] S. M. Lundberg, G. Erion, H. Chen, A. DeGrave, J. M. Prutkin, B. Nair,
R. Katz, J. Himmelfarb, N. Bansal, and S.-I. Lee, “From local explanations
to global understanding with explainable ai for trees,” Nature machine intelli-
gence, vol. 2, no. 1, pp. 56–67, 2020.
[24] S. Lessmann, R. Stahlbock, and S. F. Crone, “Optimizing hyperparameters of
support vector machines by genetic algorithms.” in IC-AI, 2005, pp. 74–82.
[25] N. Cristianini, J. Shawe-Taylor et al., An introduction to support vector ma-
chines and other kernel-based learning methods. Cambridge university press,
2000.
[26] J. Shen, F. Cheng, Y. Xu, W. Li, and Y. Tang, “Estimation of adme properties
with substructure pattern recognition,” Journal of chemical information and
modeling, vol. 50, no. 6, pp. 1034–1041, 2010.
[27] H. Daumé, A course in machine learning. Hal Daumé III, 2017.
[28] H. Mohabatkar, M. M. Beigi, and A. Esmaeili, “Prediction of gabaa receptor
proteins using the concept of chou’s pseudo-amino acid composition and support
vector machine,” Journal of Theoretical Biology, vol. 281, no. 1, pp. 18–23, 2011.
[29] S. Lata, N. K. Mishra, and G. P. Raghava, “Antibp2: improved version of
antibacterial peptide prediction,” BMC bioinformatics, vol. 11, no. 1, pp. 1–7,
2010.
[30] A. Lindholm, N. Wahlström, F. Lindsten, and T. B. Schön, Machine Learning
- A First Course for Engineers and Scientists. Cambridge University Press,
2022. [Online]. Available: https://smlbook.org
[31] J. N. Dybowski, M. Riemenschneider, S. Hauke, M. Pyka, J. Verheyen, D. Hoff-
mann, and D. Heider, “Improved bevirimat resistance prediction by combina-
tion of structural and sequence-based classifiers,” BioData mining, vol. 4, no. 1,
pp. 1–13, 2011.
[32] P. Bhadra, J. Yan, J. Li, S. Fong, and S. W. Siu, “Ampep: Sequence-based
prediction of antimicrobial peptides using distribution patterns of amino acid
properties and random forest,” Scientific reports, vol. 8, no. 1, pp. 1–10, 2018.
[33] Z.-H. You, K. C. Chan, and P. Hu, “Predicting protein-protein interactions from
primary protein sequences using a novel multi-scale local feature representation
scheme and the random forest,” PloS one, vol. 10, no. 5, p. e0125811, 2015.
49
Bibliography
[34] M. Feurer and F. Hutter, “Hyperparameter optimization,” in Automated ma-
chine learning. Springer, Cham, 2019, pp. 3–33.
[35] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Ma-
chine learning in Python,” Journal of Machine Learning Research, vol. 12, pp.
2825–2830, 2011.
[36] P. Probst, M. N. Wright, and A.-L. Boulesteix, “Hyperparameters and tuning
strategies for random forest,”Wiley Interdisciplinary Reviews: data mining and
knowledge discovery, vol. 9, no. 3, p. e1301, 2019.
[37] D. Chicco and G. Jurman, “The advantages of the matthews correlation co-
efficient (mcc) over f1 score and accuracy in binary classification evaluation,”
BMC genomics, vol. 21, no. 1, pp. 1–13, 2020.
[38] M. van Rosmalen, B. M. Janssen, N. M. Hendrikse, A. J. van der Linden,
P. A. Pieters, D. Wanders, T. F. de Greef, and M. Merkx, “Affinity maturation
of a cyclic peptide handle for therapeutic antibodies using deep mutational
scanning,” Journal of Biological Chemistry, vol. 292, no. 4, pp. 1477–1489,
2017.
[39] X. S. Wang, P.-H. C. Chen, J. T. Hampton, J. M. Tharp, C. A. Reed, S. K. Das,
D.-S. Wang, H. S. Hayatshahi, Y. Shen, J. Liu et al., “A genetically encoded,
phage-displayed cyclic-peptide library,” Angewandte Chemie, vol. 131, no. 44,
pp. 16 051–16 056, 2019.
[40] P. A. Romero, T. M. Tran, and A. R. Abate, “Dissecting enzyme function
with microfluidic-based deep mutational scanning,” Proceedings of the National
Academy of Sciences, vol. 112, no. 23, pp. 7159–7164, 2015.
[41] A. F. Rubin, H. Gelman, N. Lucas, S. M. Bajjalieh, A. T. Papenfuss, T. P.
Speed, and D. M. Fowler, “A statistical framework for analyzing deep muta-
tional scanning data,” Genome biology, vol. 18, no. 1, pp. 1–15, 2017.
[42] Free epitope database and prediction resource. http://www.iedb.org/. Accessed
on: 2022-03-21.
[43] S. N. H. Bukhari, A. Jain, E. Haq, M. A. Khder, R. Neware, J. Bhola, and
M. Lari Najafi, “Machine learning-based ensemble model for zika virus t-cell
epitope prediction,” Journal of Healthcare Engineering, vol. 2021, 2021.
[44] B. Chen, M. S. Khodadoust, N. Olsson, L. E. Wagar, E. Fast, C. L. Liu,
Y. Muftuoglu, B. J. Sworder, M. Diehn, R. Levy et al., “Predicting hla class ii
antigen presentation through integrated deep learning,” Nature biotechnology,
vol. 37, no. 11, pp. 1332–1343, 2019.
[45] T. Liu, K. Shi, and W. Li, “Deep learning methods improve linear b-cell epitope
prediction,” BioData mining, vol. 13, no. 1, pp. 1–13, 2020.
50
Bibliography
[46] D.-S. Cao, Q.-S. Xu, and Y.-Z. Liang, “propy: a tool to generate various modes
of chou’s pseaac,” Bioinformatics, vol. 29, no. 7, pp. 960–962, 2013.
[47] H.-B. Shen and K.-C. Chou, “Pseaac: a flexible web server for generating vari-
ous kinds of protein pseudo amino acid composition,” Analytical biochemistry,
vol. 373, no. 2, pp. 386–388, 2008.
[48] D. Osorio, P. Rondón-Villarreal, and R. Torres, “Peptides: a package for data
mining of antimicrobial peptides,” Small, vol. 12, pp. 44–444, 2015.
[49] H. ElAbd, Y. Bromberg, A. Hoarfrost, T. Lenz, A. Franke, and M. Wendorff,
“Amino acid encoding for deep learning applications,” BMC bioinformatics,
vol. 21, no. 1, pp. 1–14, 2020.
51
Bibliography
52
A
Feature exploration
A.1 Z-scales
A.1.1 Dataset A
a b
0 01.00 1 1.0 1
0.75
0.50 0.5
0.25
0.00 1.0 0.5 0.0 0.5 1.0 1.5 0.0 1.0 0.5 0.0 0.5 1.0
z1 z2
(a) Z-scale 1 (b) Z-scale 2
c d
0 2.0 0
2 1 11.5
1.0
1
0.5
0 1.0 0.5 0.0 0.5 0.0 0.5 0.0 0.5 1.0
z3 z4
(c) Z-scale 3 (d) Z-scale 4
e
3 0
1
2
1
0 0.8 0.6 0.4 0.2 0.0 0.2
z5
(e) Z-scale 5
Figure A.1: Distributions of the average z-values for each class in dataset A.
I
Density
Density
Density
Density Density
A. Feature exploration
z3
z2
z5
z1
z4 Class 1
Class 0
0.00 0.02 0.04 0.06 0.08 0.10
mean(|SHAP value|) (average impact on model output magnitude)
Figure A.2: Summary plot for the average z-values in dataset A.
A.1.2 Dataset B
a b
0
1 60
0
40 1
40
20
20
0 0.24 0.26 0.28 0.30 0 0.51 0.50 0.49 0.48 0.47 0.46
z1 z2
(a) Z-scale 1 (b) Z-scale 2
c d
0 80 0
60 1 60 1
40 40
20 20
0 0.12 0.10 0.08 0.06 0 0.26 0.25 0.24 0.23 0.22
z3 z4
(c) Z-scale 3 (d) Z-scale 4
e
0
100 1
50
0 0.330 0.335 0.340 0.345 0.350
z5
(e) Z-scale 5
Figure A.3: Distributions of the average z-values for each class in dataset B.
II
Density Density
Density
Density Density
A. Feature exploration
z5
z1
z2
z3
z4 Class 0
Class 1
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
mean(|SHAP value|) (average impact on model output magnitude)
Figure A.4: Summary plot for the average z-values in dataset B.
A.1.3 Dataset C
a b
0.6
0.4 0 01 1
0.3 0.4
0.2
0.2
0.1
0.0 4 2 0 2 0.0 2 0 2
z1 z2
(a) Z-scale 1 (b) Z-scale 2
c d
0.8 0 0.8 0
1 1
0.6 0.6
0.4 0.4
0.2 0.2
0.0 2 1 0 1 2 0.0 2 1 0 1
z3 z4
(c) Z-scale 3 (d) Z-scale 4
e
1.5 0
1
1.0
0.5
0.0 1 0 1 2
z5
(e) Z-scale 5
Figure A.5: Distributions of the average z-values for each class in dataset C.
III
Density Density
Density
Density Density
A. Feature exploration
z1
z3
z2
z5
z4 Class 0
Class 1
0.00 0.05 0.10 0.15 0.20 0.25 0.30
mean(|SHAP value|) (average impact on model output magnitude)
Figure A.6: Summary plot for the average z-values in dataset C.
A.2 PseAAC
A.2.1 Dataset A
IV
A. Feature exploration
a b
0.4 0 01 0.8 1
0.3 0.6
0.2 0.4
0.1 0.2
0.0 4 6 8 10 0.0 3 4 5 6 7 8
PAAC21 PAAC22
(a) PAAC21 (b) PAAC22
c d
0.6 0 0.6 0
1 1
0.4 0.4
0.2 0.2
0.0 5 6 7 8 9 10 0.0 6 8 10
PAAC23 PAAC24
(c) PAAC23 (d) PAAC24
e f
0 0.4 0
0.4 1 10.3
0.2
0.2
0.1
0.0 4 6 8 0.0 4 6 8 10
PAAC25 PAAC26
(e) PAAC25 (f) PAAC26
g h
0.6 0 0.6 0
1 1
0.4 0.4
0.2 0.2
0.0 2 4 6 8 10 0.0 6 8 10 12
PAAC27 PAAC28
(g) PAAC27 (h) PAAC28
i
1.5 0
1
1.0
0.5
0.0 0 2 4 6 8
PAAC29
(i) PAAC29
V
Figure A.7: Distributions of the pseudo components for each class in dataset A
with optimum lambda for SVC.
Density Density Density Density
Density
Density Density Density Density
A. Feature exploration
a b
0.6
0 0
0.3 1 1
0.4
0.2
0.1 0.2
0.0 6 8 10 12 14 0.0 4 6 8 10
PAAC21 PAAC22
(a) PAAC21 (b) PAAC22
c d
0.6 0 0.6 0
1 1
0.4 0.4
0.2 0.2
0.0 8 10 12 0.0 6 8 10 12
PAAC23 PAAC24
(c) PAAC23 (d) PAAC24
e
0.4 0
1
0.3
0.2
0.1
0.0 4 6 8 10
PAAC25
(e) PAAC25
Figure A.8: Distributions of the pseudo components for each class in dataset A
with optimum lambda for RFC.
VI
Density Density
Density
Density Density
A. Feature exploration
PAAC16
PAAC7
PAAC20
PAAC10
PAAC17
PAAC24
PAAC5
PAAC25
PAAC23
PAAC21
PAAC22
PAAC2
PAAC14
PAAC13
PAAC15
PAAC11
PAAC3
PAAC12
PAAC4 Class 0
PAAC8 Class 1
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16
mean(|SHAP value|) (average impact on model output magnitude)
Figure A.9: Summary plot for the pseudo components in dataset A.
VII
A. Feature exploration
A.2.2 Dataset B
a
14
0
12 1
10
8
6
4
2
0 9.7 9.8 9.9 10.0 10.1
(a) PAPAACA2C121
Figure A.10: Distributions of the pseudo components for each class in dataset B.
PAAC1
PAAC4
PAAC17
PAAC2
PAAC9
PAAC16
PAAC8
PAAC14
PAAC11
PAAC15
PAAC20
PAAC6
PAAC3
PAAC10
PAAC12
PAAC7
PAAC21
PAAC18
PAAC19 Class 0
PAAC13 Class 1
0.00 0.01 0.02 0.03 0.04 0.05 0.06
mean(|SHAP value|) (average impact on model output magnitude)
Figure A.11: Summary plot for the pseudo components in dataset B.
A.2.3 Dataset C
VIII
Density
A. Feature exploration
a b
0.20 0 0
1 0.20 1
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0 5 10 15 0.00 0 5 10 15
PAAC21 PAAC22
(a) PAAC21 (b) PAAC22
c d
0 0.15 0
0.15 1 1
0.10
0.10
0.05 0.05
0.00 0 5 10 15 0.00 0 5 10 15
PAAC23 PAAC24
(c) PAAC23 (d) PAAC24
e
0.15 0
1
0.10
0.05
0.00 0 5 10 15
PAAC25
(e) PAAC25
Figure A.12: Distributions of the pseudo components for each class in dataset C
with optimum lambda for SVC.
a
0.15 0
1
0.10
0.05
0.00 0 10 20
(a) PAPAACA2C121
Figure A.13: Distributions of the pseudo components for each class in dataset C
with optimum lambda for RFC.
IX
Density Density
Density
Density
Density Density
A. Feature exploration
PAAC11
PAAC20
PAAC6
PAAC7
PAAC13
PAAC21
PAAC10
PAAC15
PAAC8
PAAC2
PAAC12
PAAC3
PAAC16
PAAC1
PAAC4
PAAC17
PAAC14
PAAC9
PAAC5 Class 1
PAAC19 Class 0
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14
mean(|SHAP value|) (average impact on model output magnitude)
Figure A.14: Summary plot for the pseudo components in dataset C.
X
B
Confusion matrices
B.1 Dataset A
100 100
0 106 9 75 0 113 2 75
50 50
1 5 22 25 1 13 14 25
0 1 0 1
Predicted label Predicted label
(a) Confusion matrix: Z-scales with SVC (left) and random forests (right)
80 100
0 94 21 0 110 5
60 75
40 50
1 1 26 20 1 16 11 25
0 1 0 1
Predicted label Predicted label
(b) Confusion matrix: PseAAC with SVC (left) and random forests (right)
100 100
0 110 5 0 110 5
75 75
50 50
1 5 22 25 1 13 14 25
0 1 0 1
Predicted label Predicted label
(c) Confusion matrix: one-hot representation with SVC (left) and random forests
(right)
Figure B.1: Confusion matrix for each descriptor-model combination for dataset
A.
XI
True label True label True label
True label True label True label
B. Confusion matrices
B.2 Dataset B
1000
0 939 322
800
0 1169 92
600 750
400 500
1 96 82 1 155 23
200 250
0 1 0 1
Predicted label Predicted label
(a) Confusion matrix: Z-scales with SVC (left) and random forests (right)
600 1000
0 616 645 0 1128 133
400 750
500
1 45 133 200 1 144 34 250
0 1 0 1
Predicted label Predicted label
(b) Confusion matrix: PseAAC with SVC (left) and random forests (right)
800
0 440 821 600 0 1192 69
1000
750
400 500
1 15 163 200 1 159 19 250
0 1 0 1
Predicted label Predicted label
(c) Confusion matrix: one-hot representation with SVC (left) and random forests
(right)
Figure B.2: Confusion matrix for each descriptor-model combination for dataset
B.
XII
True label True label True label
True label True label True label
B. Confusion matrices
B.3 Dataset C
0 4 11 30 0 6 9 30
20 20
1 1 36 10 1 1 36 10
0 1 0 1
Predicted label Predicted label
(a) Confusion matrix: Z-scales with SVC (left) and random forests (right)
25 30
0 7 8 0 4 11
20 20
15
1 9 28 1 2 35 10
10
0 1 0 1
Predicted label Predicted label
(b) Confusion matrix: PseAAC with SVC (left) and random forests (right)
30
0 7 8 0 10 5 30
20 20
1 2 35 10 1 1 36 10
0 1 0 1
Predicted label Predicted label
(c) Confusion matrix: one-hot representation with SVC (left) and random forests
(right)
Figure B.3: Confusion matrix for each descriptor-model combination for dataset
C.
XIII
True label True label True label
True label True label True label