Detecting Money Laundering with Machine Learning
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
This thesis will investigate how well machine learning models such as logistic regression, random forest and XGBoost detect money laundering transactions and which features are most influential in the models. The dataset used is synthetic and provided by IBM consisting of 5,073,167 transactions. A common challenge in money laundering analysis is the classification imbalance, where the classes of interest have significantly different sample sizes, illicit transactions are a rare event making it a challenge for the machine learning models to detect illicit transactions, whereas legitimate transaction are overrepresented. This imbalance can negatively affect the model performance and evaluation as the models can become skewed and biased toward prediction the majority class. To address this issue and improve model performance, techniques such as feature engineering, L2 regularization, resampling techniques as stratified k-fold cross validation and cost-sensitive learning are applied to address the imbalance, improve model performance, to ensure a reliable evaluation. For evaluation, a confusion matrix is used, providing insight of the true positives, true negatives, false positives and false negatives. These values form the foundation for evaluation metrics as recall, precision and F1-score that are best suited for analysis involving class imbalance. By using metrics not suited for imbalanced classification, may lead to misleading results. The results show that classification imbalance remains a significant challenge. The models show high recall score at the expense of low precision, resulting in a high false positive rate. Random Forest and XGBoost perform better than logistic regression in detecting fraudulent transactions. Random Forest produces more balanced classification results, while XGBoost demonstrates stronger performance, by capturing a larger proportion of both fraudulent and non-fraudulent transactions. Despite the results, the models still lack sufficient precision for practical implementation, this highlights the importance of properly addressing imbalanced data in AML detection. In practice, machine learning models may incorrectly flag legitimate transactions as money laundering or illicit transactions are flagged as legitimate which can impose operational and financial cost on financial institutions and regulatory bodies.