Den 5/1-2026 kommer GUPEA att vara otillgängligt för alla under hela dagen.
An exploratory field study on the use of data management and data quality requirements in ML-enabled software applied in environmental research
Abstract
Integrating machine learning into environmental science has shown great promise in improving research outcomes. However, the effective application of machine learning and the reliability of the results depend heavily on data quality and management practices, which are often overlooked or addressed inconsistently. It is important to have a proper data pipeline that includes good practices for quality data and data management. This thesis introduces SPADES-ML (Scientific Pipeline Assessment and Data-Centric Evaluation Scorecard for Machine Learning), a structured assessment framework developed to evaluate the quality and transparency of data-related practices in machine learning based research. SPADES-ML is demonstrated through a case study of machine learning based environmental research.
A total of 28 research papers were analysed using SPADES-ML. The framework was applied to assess five critical areas: data selection and suitability, data quality, adherence to the FAIR principles, data preprocessing, and challenges in preprocessing. A survey was conducted to validate the findings targeting practitioners in machine learning based environmental research. Results from the literature and survey analyses revealed recurring challenges in ensuring data quality, reproducibility, and methodological excellence. The analysis of SPADES-ML and the survey revealed recurring challenges in ensuring data quality, reproducibility, and methodological excellence. Furthermore, this study provides initial recommendations to improve data practices in machine learning-based research by adhering software engineering principles in the results. This thesis contributes to the emerging field of research software engineering by offering a structured evaluation and guidelines for robust methodology pipelines in interdisciplinary, machine learning based research.
Degree
Student essay
Collections
View/ Open
Date
2025-10-07Author
Mahagamarachchi, Devasinghage Sara Nirmani
Pamali Chathurika, Hikkaduwa Liyanage
Keywords
Data-Centric Evaluation
Data Management
Data Quality
Data Quality Challenges
Environmental Research
FAIR
Machine Learning
Methodological Guidelines
Software engineering
SPADES-ML
Language
eng