Applying Machine Learning to High-Dimensional Proteomics Datasets for Biomarker Discovery in Neurodegenerative Disorders
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Identifying biomarkers for Alzheimer’s Disease (AD), a progressive neurodegenerative disorder characterized by progressive cognitive decline is crucial for early diagnosis and treatment. This thesis explores proteomic abundances along the AD continuum using lumbar and ventricular cerebrospinal fluid (CSF) samples from patients with idiopathic normal pressure hydrocephalus (iNPH) to identify potential new biomarkers. Our study emphasizes the necessity of treating lumbar and ventricular CSF samples as separate datasets due to their distinct proteomic profiles. Challenges such as handling high-dimensional data with missing values, small sample sizes and class imbalances were addressed through imputation, oversampling and k-fold cross-validation techniques. We discuss the presence and consequence of batch effect, a remnant of the mass spectrometry technique tandem mass tag. Comparative analysis through staging on existing biomarkers highlights the uniqueness of the dataset provided by Sahlgrenska University Hospital. Through machine learning and feature selection techniques, we propose eight protein and nine peptide biomarkers for distinguishing iNPH patients on the pathological AD spectra. One such biomarker shows relevance in both lumbar and ventricular CSF. Despite the study’s limited cohort size, our findings contribute insights into the proteomic analysis of neurodegenerative disorders.