Unsupervised methods for integrative data analysis
Abstract
Unsupervised data analysis methods are important for data exploration to introduce structure, reduce data dimensions, or extract interpretable knowledge. Integrative analysis of two or more data sets is crucial to gain understanding of local and global effects within and across data sources. Recent technological advancements in large scale collection of single cell data require efficient and scalable methods to process the increasing size of available data. Integration of data sources with secondary data, also known as side information, can improve prediction of missing data and is important for recommender systems. However, many currently existing methods cannot accommodate the scale or complexity of available data. Therefore, there is a need for new methods for unsupervised integrative data analysis that scale well with input data size, can be applied efficiently, and provide flexible support for complex data input.
In Paper I, a novel scalable method is proposed which integrates gene clustering of single cell data with selection of cluster-specific gene regulators having sign-consistent correlation and therefore well-defined effect within each cluster. An efficient alternating two-step algorithm for parameter estimation is developed, along with criteria for optimal hyperparameter and cluster count selection. Applications to single cell data demonstrate the methods capability to identify regulators of intratumoral heterogeneity, primarily in neural cancers.
In Paper II, a low-rank matrix factorization model is proposed which allows flexible integration of input data sources and produces interpretable estimates of orthogonal latent factors. Parameter estimation is performed efficiently within an ADMM framework and its convergence theory is extended to support embedded manifold constraints such as orthogonality. Simulation studies show that the method performs well in comparison to established methods and the importance of support for flexible data input layouts is demonstrated.
The lack of scalable flexible matrix integration methods is addressed in Paper III by reformulating the data integration problem as a graph estimation problem. A novel algorithm is proposed, using matrix denoising and the asymptotic geometry of singular vectors in noise-perturbed low-rank matrices, to perform estimation within the graphical framework. Simulation studies demonstrate the method's high scalability in comparison to established methods.
Software packages with easy-to-use interfaces for each paper are publicly available. The methods presented in this thesis contribute to the development of efficient, flexible, and scalable unsupervised methods for integrative data analysis.
Parts of work
Paper I: Larsson I, Held F, Popova G, Koc A, Kundu S, Jörnsten R, Nelander S. Reconstructing regulatory programs underlying intratumoral heterogeneity and plasticity of cancer using scregclust. https://doi.org/10.1101/2023.03.10.532041 Paper II: Held F, Lindbäck J, Jörnsten R. Sparse and Orthogonal Low-rank Collective Matrix Factorization (solrCMF): Efficient data integration in flexible layouts. https://doi.org/10.48550/arXiv.2405.10067 Paper III: Held F. Large-scale Data Integration using Matrix Denoising and Geometric Factor Matching. https://doi.org/10.48550/arXiv.2405.10036
Degree
Doctor of Philosophy
University
University of Gothenburg. Faculty of Science.
Institution
Department of Mathematical Sciences ; Institutionen för matematiska vetenskaper
Disputation
Fredagen den 7 juni 2024, kl 13.15, Hörsal Pascal, Matematiska Vetenskaper, Hörsalsvägen 1
Date of defence
2024-06-07
felix.held@gu.se
Date
2024-05-17Author
Held, Felix
Keywords
clustering of regression models
low-rank matrix factorization
penalized optimization
ADMM with multi-affine constraints
orthogonality constraints
flexible data layouts
graph structure estimation
scalability
Publication type
Doctoral thesis
ISBN
978-91-8069-599-2 (tryckt)
978-91-8069-600-5 (PDF)
Language
eng