An exploratory field study on the use of data management and data quality requirements in ML-enabled software applied in environmental research Master’s Thesis in Computer science and engineering Devasinghage Sara Nirmani Mahagamarachchi Hikkaduwa Liyanage Pamali Chathurika Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2025 Master’s Thesis 2025 An exploratory field study on the use of data management and data quality requirements in ML-enabled software applied in environmental research Devasinghage Sara Nirmani Mahagamarachchi Hikkaduwa Liyanage Pamali Chathurika Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2025 An exploratory field study on the use of data management and data quality require- ments in ML-enabled software applied in environmental research Devasinghage Sara Nirmani Mahagamarachchi Hikkaduwa Liyanage Pamali Chathurika © Devasinghage Sara Nirmani Mahagamarachchi, 2025. © Hikkaduwa Liyanage Pamali Chathurika, 2025. Supervisor: Hans-Martin Heyn, Department of Computer Science and Engineering Supervisor: Yi Peng, Department of Computer Science and Engineering Examiner: Eric Knauss, Department of Computer Science and Engineering Master’s Thesis 2025 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2025 iv Devasinghage Sara Nirmani Mahagamarachchi Hikkaduwa Liyanage Pamali Chathurika Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract Integrating machine learning into environmental science has shown great promise in improving research outcomes. However, the effective application of machine learning and the reliability of the results depend heavily on data quality and management practices, which are often overlooked or addressed inconsistently. It is important to have a proper data pipeline that includes good practices for quality data and data management. This thesis introduces SPADES-ML (Scientific Pipeline Assessment and Data-Centric Evaluation Scorecard for Machine Learning), a structured assess- ment framework developed to evaluate the quality and transparency of data-related practices in machine learning based research. SPADES-ML is demonstrated through a case study of machine learning based environmental research. A total of 28 research papers were analysed using SPADES-ML. The framework was applied to assess five critical areas: data selection and suitability, data quality, adherence to the FAIR principles, data preprocessing, and challenges in preprocess- ing. A survey was conducted to validate the findings targeting practitioners in ma- chine learning based environmental research. Results from the literature and survey analyses revealed recurring challenges in ensuring data quality, reproducibility, and methodological excellence. The analysis of SPADES-ML and the survey revealed recurring challenges in ensuring data quality, reproducibility, and methodological excellence. Furthermore, this study provides initial recommendations to improve data practices in machine learning-based research by adhering software engineering principles in the results. This thesis contributes to the emerging field of research software engineering by offering a structured evaluation and guidelines for robust methodology pipelines in interdisciplinary, machine learning based research. Keywords: Data-Centric Evaluation, Data Management, Data Quality, Data Qual- ity Challenges, Environmental Research, FAIR, Machine Learning, Methodological Guidelines, Software engineering, SPADES-ML v Acknowledgements We would like to thank Hans-Martin Heyn, and Yi Peng our supervisors in software engineering, for the great support and assistance in guiding our work on this thesis. Both have been a great encouragement to us and have been very helpful throughout the thesis. We would also like to thank Eric Knauss, our examiner, for his valuable feedback throughout the research. Our appreciation also goes to Michelle Nerentorp for her insights into environmental research. Finally, we would like to thank our survey respondents for their invaluable time and insights provided. Devasinghage Sara Nirmani Mahagamarachchi, Gothenburg, June 2025 Hikkaduwa Liyanage Pamali Chathurika, Gothenburg, June 2025 vii Contents List of Figures xi List of Tables xiii 1 Introduction 1 1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Purpose of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background 5 2.1 Challenges of using ML in environmental research . . . . . . . . . . . 5 2.2 Data quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 FAIR Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Data producers and data consumers . . . . . . . . . . . . . . . . . . . 7 2.5 Data Pipeline Management . . . . . . . . . . . . . . . . . . . . . . . 8 3 Related Work 9 4 Methods 15 4.1 Analysis Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.2 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.3 Scientific Pipeline Assessment and Data-Centric Evaluation Score- card for Machine Learning (SPADES-ML) . . . . . . . . . . . . . . . 19 4.4 Analysis using SPADES-ML . . . . . . . . . . . . . . . . . . . . . . . 22 4.5 Survey to validate SPADES-ML . . . . . . . . . . . . . . . . . . . . . 23 4.6 Analysis of the survey . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.7 Analyse and create recommendations . . . . . . . . . . . . . . . . . . 26 5 Results 27 5.1 Analysis of literature using SPADES-ML . . . . . . . . . . . . . . . . 27 5.2 Analysis of Survey responses . . . . . . . . . . . . . . . . . . . . . . . 42 5.3 Findings on challenges and recommendations . . . . . . . . . . . . . . 56 6 Discussion 67 6.1 RQ1: How are data selection and data preparation done when apply- ing ML in environmental research? . . . . . . . . . . . . . . . . . . . 67 ix Contents 6.2 RQ2: What are the challenges of data selection and preparation for the application of ML in environmental research? . . . . . . . . . . . 69 6.3 RQ3: What solutions are reported in SE that could mitigate the challenges of data selection and preparation for the application of ML in environmental research? . . . . . . . . . . . . . . . . . . . . . 69 6.4 Contribution to Research Software Engineering . . . . . . . . . . . . 70 6.5 Threats to validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.6 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7 Conclusion 75 Bibliography 79 A Survey form to validate the point system I B Thematic analysis XVII C Referred table from SLR that guided our literature review XXIII x List of Figures 4.1 Overview Of Methodology Purple: SPADES-ML method for pipeline assessment, Yellow: Literature review, Orange: Survey, Red: Iden- tify good practices and recommendation creation, Green: Fulfilled research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.1 Dot plot illustrating how many referred research papers fulfilled the criteria in SPADES-ML. . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2 Score distribution based on published year. n(2017-2021): 7, n(2022): 14, n(2023-2024): 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.3 Score distribution based on environmental field n(Air): 3*, n(Energy): 8, n(Climate): 3*, n(Hydrological): 3*, n(Waste): 4* . . . . . . . . . 34 5.4 Score distribution based on ML techniques used n(NN): 8, n(SVM): 3*, n(Image processing): 3*, n(ELT): 9, n(Other): 3* . . . . . . . . . 37 5.5 Score distribution based on data format n(Time Series): 14, n(Mixed): 9, n(Mutimedia): 3* . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.6 Score distribution based on number of data sources n(Single): 18, n(Multiple): 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.7 Distribution of environmental fields of the respondents . . . . . . . . 42 5.8 Distribution of the years of experience as an author in environmental research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.9 Distribution of paper review frequency . . . . . . . . . . . . . . . . . 43 5.10 Distribution of the years of experience as a reviewer . . . . . . . . . . 43 5.11 Distribution of data producing experience . . . . . . . . . . . . . . . 44 5.12 Distribution of origin of data . . . . . . . . . . . . . . . . . . . . . . . 44 5.13 Grouped Bar Chart: Mean Importance Ratings of Evaluation Criteria by Role . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.14 Importance rating as a reader . . . . . . . . . . . . . . . . . . . . . . 46 5.15 Importance of making self-collected data publicly available to increase the credibility of the data . . . . . . . . . . . . . . . . . . . . . . . . 47 5.16 Importance rating as an author . . . . . . . . . . . . . . . . . . . . . 48 5.17 Importance of communicating with data producers before retrieving data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.18 Importance rating as a reviewer . . . . . . . . . . . . . . . . . . . . . 50 5.19 Importance rating considering all roles . . . . . . . . . . . . . . . . . 51 A.1 Section 1(a) of the survey . . . . . . . . . . . . . . . . . . . . . . . . II A.2 Section 1(b) of the survey . . . . . . . . . . . . . . . . . . . . . . . . III xi List of Figures A.3 Section 2 of the survey . . . . . . . . . . . . . . . . . . . . . . . . . . IV A.4 Section 3(a) of the survey . . . . . . . . . . . . . . . . . . . . . . . . V A.5 Section 3(b) of the survey . . . . . . . . . . . . . . . . . . . . . . . . VI A.6 Section 3(c) of the survey . . . . . . . . . . . . . . . . . . . . . . . . VII A.7 Section 3(d) of the survey . . . . . . . . . . . . . . . . . . . . . . . . VIII A.8 Section 4(a) of the survey . . . . . . . . . . . . . . . . . . . . . . . . IX A.9 Section 4(b) of the survey . . . . . . . . . . . . . . . . . . . . . . . . X A.10 Section 4(c) of the survey . . . . . . . . . . . . . . . . . . . . . . . . XI A.11 Section 4(d) of the survey . . . . . . . . . . . . . . . . . . . . . . . . XII A.12 Section 5(a) of the survey . . . . . . . . . . . . . . . . . . . . . . . . XIII A.13 Section 5(b) of the survey . . . . . . . . . . . . . . . . . . . . . . . . XIV A.14 Section 5(c) of the survey . . . . . . . . . . . . . . . . . . . . . . . . XV A.15 Section 5(d) of the survey . . . . . . . . . . . . . . . . . . . . . . . . XVI A.16 Section 6 of the survey . . . . . . . . . . . . . . . . . . . . . . . . . . XVI xii List of Tables 4.1 List of papers considered for the literature review . . . . . . . . . . . 19 4.2 Evaluation criteria - SPADES-ML . . . . . . . . . . . . . . . . . . . . 21 5.1 A summary of the analysis results . . . . . . . . . . . . . . . . . . . . 28 5.2 Summary of the normalised results . . . . . . . . . . . . . . . . . . . 29 5.3 Comparison of normalised and not normalised approaches . . . . . . . 30 5.4 Categorisation by environmental fields . . . . . . . . . . . . . . . . . 33 5.5 Categorisation by ML techniques used . . . . . . . . . . . . . . . . . 35 5.6 Categorisation by data format . . . . . . . . . . . . . . . . . . . . . . 38 5.7 Categorisation by the number of data sources . . . . . . . . . . . . . 40 5.8 Challenges identified in data selection and preparation for the appli- cation of ML in environmental disciplines . . . . . . . . . . . . . . . . 57 5.9 Guidelines for future ML-enabled research . . . . . . . . . . . . . . . 60 C.1 Table 2 - AI applications in environmental disciplines based on Konya et Nematzadeh [27] . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXIII xiii List of Tables xiv 1 Introduction Environmental research explores different natural environments using data from, among others, soil, water, air, organisms, and biological data [27] with a history tracing back to the 19th century [32]. It helps us to understand the complex sys- tem that shapes our environment, the impact of human activities, and their con- sequences. Areas of research such as climate change, biodiversity conservation, air quality monitoring, and water resource management require a large amount of data to be collected, stored, and analysed. Longitudinal feature of many subject areas, such as climate change, deforestation, biodiversity loss, etc., requires the collection of data over extended periods – often spanning decades [2] – to observe meaning- ful patterns and trends, contributing to the rapid growth of environmental data. Further, environmental data are collected from heterogeneous data sources, includ- ing remote sensing such as satellite imagery and drones, ground-based sensors like weather stations, air/water quality monitoring stations, government agencies such as NASA, and human scientists/ volunteers [52]. Fast data growth is also caused by increased sensor deployment due to advancements in Internet of Things (IoT), global collaboration and data-sharing all over the world. However, the complexity of these datasets, for example, due to various geophysical parameters, poses significant challenges for analysis and interpretation [27]. As a result, Artificial Intelligence (AI) has emerged as a useful tool for analysing and deriving insights from the vast and complex environmental data landscape in environmental research. Essentially, AI is about developing computer systems that can perform tasks which typically require human intelligence [61]. Over the past few decades, the recurring advances and efficiency of AI techniques have made them important in various fields and research areas [27]. Machine Learning (ML) is a subset of AI that focuses primarily on learning to model and predict based on past experience [32]. With its advanced decision-making and pattern recognition capabilities, it processes large amounts of data and provides valuable insights. Maganathan et al [32] discuss what kind of ML algorithms are used in different types of environmental research, and ML is inevitably used in different areas such as climate modelling, water quality and air quality monitoring, and natural disaster prediction. The accuracy and reliability of the results produced by these ML models are deter- mined by the data used for training during operation of the ML model, and the ML 1 1. Introduction model itself1 2 [1]. While the ML community used to agree that “the more data the better”, more recently the majority consensus has shifted to “garbage in garbage out” [31]. Meng [35] further discusses that “garbage in garbage out” is not the main concern, but “garbage in package out”, because if something is recognised as garbage then it can be fixed. However, when garbage is wrapped nicely as a package, it will sell uninformed and remains undetected until the model fails. Regardless of the ap- plication domain, numerous studies, such as [12], [25], [58], [23] have demonstrated the significant impact of data quality on model performance and reliability. 1.1 Problem Description Many recent research projects in environmental science that used ML were listed in the systematic literature review (SLR) by Konya and Nematzadeh [27]. The authors examined information such as ML algorithms, performance metrics and processing times in these research projects. However, they did not investigate how the data are prepared for the ML models and how the quality of the data used as input to these ML models is assessed and maintained throughout its journey [29] across different stakeholders. In Software Engineering (SE) however, researchers and practitioners routinely handle large volumes of data and have developed knowledge and practices for managing such complexity. For example, Munappy et al. [53], [54], [4], [41], [42] have extensively investigated data management in embedded systems, addressing challenges and proposing solutions. These include techniques and practices such as data pipelining and DataOps, which could be highly relevant for managing data in environmental ML applications. Such SE techniques can play an important role in environmental research: Easterbrook [16] showed in relation to work on climate modelling, and can therefore be referred to as environmental informatics. Almikaeel et al. [5] have used, for example, ML in hydrological drought forecasting, and this is referred to as hydroinformatics. Advanced concepts, such as data pipeline engineering proposed by Munappy et al [53], provide the ability to automate the processing of heterogeneous data from distributed data sources, intensify data life cycle activities and increase the pro- ductivity of data-driven operations. These concepts can be applicable and helpful when handling data for ML-supported environmental research. However, due to various technical errors that can occur during data collection, missing data are a common problem which adds to the challenges faced with data quality. Furthermore, if the data are noisy and error-prone, the processing of these data can also be chal- lenging [27]. In summary, existing data management processes for environmental research do not yet solve all data quality challenges associated with the use of ML. To identify potential solutions, such as guidelines, we need to look closely at how data are handled and managed in environmental research and what solutions from SE perspective are available to handle the identified environmental data challenges in ML-based systems. 1https://research.aimultiple.com/data-quality-ai/ 2https://hbr.org/2018/04/if-your-data-is-bad-your-machine-learning-tools-are- useless 2 1. Introduction 1.2 Purpose of the Study The complexity of environmental data has become a challenging obstacle for en- vironmental scientists [27]. To address this, researchers are increasingly adopting ML-based techniques to manage and analyse large datasets. However, they often struggle with managing and selecting quality data that are suitable for training ef- fective ML models. Currently, there is a lack of clear information regarding the data management processes used, the specific data quality requirements that need to be followed, or the challenges that environmental researchers face in managing and preparing data for ML-based applications [68]. This study aims to investigate how environmental researchers select and preprocess data, particularly when using SE techniques such as data pipelining in the context of ML-supported environmental research. It seeks to identify key stages in data pipelines within recent ML-enabled environmental research, examine the challenges researchers encounter, and explore potential solutions from SE. As a source of potential solutions from SE, this thesis investigates the work of Munappy et al. [53], [54], [4], [41], [42] in the field of ML- supported embedded systems and tries to map these solutions onto the challenges identified in ML-supported environmental research. 1.3 Research Questions The following research questions guide this thesis work: 1) How are data selection and data preparation done when applying ML in environ- mental research? Firstly, many recent research projects using ML in environmental research were listed in an SLR by Konya et Nematzadeh [27]. The authors extracted information such as ML algorithms, performance metrics and processing times. However, their research did not explore the role of data used in ML models. We intend to build on their work and extend their SLRs by investigating the origin of data used in these projects, how the data was selected, including the requirements and metrics used to assess the suitability and quality of data, and the preprocessing performed before data was used to train the ML algorithms. 2) What are the challenges of data selection and preparation for the application of ML in environmental research? Once we have a good understanding of the data selection and preparation processes used by researchers in environmental research using ML, we intend to investigate the challenges researchers face in these processes. Some initial indications might be given in the literature, but we intend to conduct a qualitative study, including a survey with researchers, to understand in more detail the challenges they faced. 3) What solutions are reported in SE that could mitigate the challenges of data selection and preparation for the application of ML in environmental research? 3 1. Introduction Based on the insights gained from the answers to RQ1 and RQ2, as SE researchers, we will use the existing solutions and techniques which are already reported in SE and map these solutions to the challenges we have identified. Delimitations The study aims to provide initial recommendations for data quality and manage- ment practices, considering the challenges faced and insights gained from other SE research. Our thesis primarily focused on identifying problems related to the data pipeline in environmental research and applying our new ranking methodology to examine how the data pipeline is defined in this field. In creating our recommenda- tions, we focused primarily on the findings of Munappy et al. [54], [4], [30], [42] in the papers cited below, as they had thorough processes for building data pipelines. We wanted to investigate whether we could transfer some of this knowledge to a com- pletely different field – from embedded systems to environmental research. However, this approach has not yet been validated. 1.4 Thesis Outline This report has the following sections. Name Content Background Introduce the concepts related to this study Related work Discuss relevant literature review Methodology Explain the methods used in this study Results Present the outcomes from this study Discussion Discuss the findings based on the outcomes Conclusion Provides the conclusion of the thesis 4 2 Background This section discusses the background of the challenges associated with ML-based applications in environmental research, along with the key concepts necessary to understand the analysis and proposed solutions of this thesis. 2.1 Challenges of using ML in environmental re- search ML tools have become extremely popular over time in the area of environmental re- search due to their ability to quickly process large amounts of data and analyse them efficiently in terms of processing needs and complexity of the necessary tools [27]. Environmental data come from multiple sources and are presented in different for- mats, increasing the overall complexity of managing and analysing the collected data. Therefore, the accuracy and reliability of these models highly depends on the quality and relevance of the data used to train the model [66]. As noted in a study by Priestley et al. [50], poor data quality and poorly defined data pipelines can affect ML systems in several ways. Inaccurate data can affect the efficiency of the solution and also discriminate against underrepresented entities in the data. This raises a need for well-defined data selection criteria. There may also be challenges due to a lack of domain expertise. Data scientists spe- cialising in ML may not have sufficient knowledge of environmental science, while environmental scientists may lack familiarity with ML tools and techniques. This significant knowledge gap between these two expert groups can lead to problems with data quality and selection procedures. The need for an interdisciplinary col- laboration between these parties is essential to maintain the quality outcome [27]. Furthermore, scientists collecting the data may not fully understand what kind of data are needed to train the specific ML models. This will eventually lead to a degradation of the quality of the ML model. In order to ensure high-quality ML models, it is necessary to follow good data man- agement practices that focus on data pipelines and DataOps from data collection to model deployment [41]. As data are collected from heterogeneous data sources, the dataset may contain missing values and outliers. Poorly defined data pipeplines can lead data pipelines can lead to inconsistencies in data processing and negatively im- 5 2. Background pact the accuracy of the final output. The need for data pre-processing is highlighted in ML-based applications in order to maintain the quality of the data to improve the learning ability of the model [41], which is a major concern with environmental data. Although ML offers many advantages in environmental research, a proper under- standing of data management and data quality is necessary to achieve good results. Well-defined data pipelines, including suitable data selection methods, must be im- plemented to address these challenges and improve the credibility and reliability of ML solutions in environmental research. 2.2 Data quality Data quality is a key factor that affects the effectiveness of ML models in environ- mental research in terms of accuracy and reliability of the results. Environmental research covers many disciplines where data may be collected from different sources, leading to several challenges with the complexity of the data. Different data can be used for different purposes, and different data quality requirements need to be defined and established to achieve this [44]. As stated by Wang et al. [64], high quality data should be clearly represented, intrinsically reliable, accessible to data consumers, and contextually appropriate for the intended purpose. To improve data quality, it is important to have a proper understanding of what data quality means to the respective data consumers. The paper also states that high-quality data can be defined as “data that are fit for use by data consumers”. The quality of the data can vary depending on the purpose of the research and the specific environmental discipline. Some researchers may focus on historical consistency when collecting data, while others may focus on time-sensitive data. Environmental data can often have missing data or rapid fluctuations due to sen- sor problems or other unavoidable situations. Incomplete data can always lead to inconsistencies in the results and directly affect the accuracy of the data model. Missing or inconsistent data often introduces biases that reduce the performance of the model [59]. In most scenarios, the environmental data collected must be real- time, as even a small delay can compromise the objective of the research and the model’s ability to provide timely predictions. Therefore, it is essential to collect quality data and maintain quality throughout the process to provide better insights. 2.3 FAIR Principles Environmental researchers define data of good quality also through the FAIR prin- ciples. The FAIR principles — findable, accessible, interoperable, and reusable — were introduced to increase the value and utility of research data by ensuring that it is managed and shared to maximise its potential for reuse and collaboration [65]. 6 2. Background By promoting transparency, reproducibility, and efficiency in research, these prin- ciples have become a cornerstone of the open science movement. In environmental research, it is important to ensure that datasets can be located and accessed by researchers worldwide to facilitate long-term monitoring and analysis, especially in areas such as climate change, biodiversity, and pollution [28]. In addition, when working with climate models, it is essential to be able to combine data sets from different sources, such as satellite imagery and field observations, to gain a compre- hensive understanding [21]. Findability requires that data and metadata be assigned unique and persistent identifiers so that they can be easily discovered. Accessibility requires that the data be retrievable through standardised protocols, which may include authentication and authorisation procedures that ensure that appropriate levels of access are maintained. Interoperability facilitates data integration across diverse systems by focusing on the use of standardised vocabularies and formats. Finally, reusability focuses on the need for clear licenses to use data and detailed provenance information to enable data to be reused in future research contexts.4 5 2.4 Data producers and data consumers In environmental research, two critical stakeholders are data providers and data consumers. Each plays a different role in the data ecosystem. Data providers are entities such as government agencies, research institutions, and citizens that gener- ate, collect, and share environmental data [20]. These stakeholders often make use of advanced technologies such as remote sensing, IoT sensors, and field surveys to collect high-quality data on variables such as air quality, water quality, biodiver- sity, and waste management [51]. For example, government agencies such as the Environmental Protection Agency (EPA), National Oceanic and Atmospheric Ad- ministration (NOAA), and European Space Agency (ESA) provide satellite-derived environmental data on air quality, climate change, and ocean conditions used in environmental research worldwide. Data consumers include researchers, policymakers, and industry professionals who are the end users of environmental data. These consumers use the data to anal- yse, model, and make informed decisions. For example, Nichol et al. [45], 2021 use data from multiple data sources, including the National Oceanic and Atmospheric Administration (NOAA), the National Snow and Ice Data Center (NSIDC), and the Pan-Arctic Ice Ocean Modelling and Assimilation System (PIOMAS), to calcu- late the discrepancy between E3SM climate models (Energy Exascale Earth System Model (E3SM) developed by the United States Department of Energy (DOE)) and observed climate change. The relationship between data providers and data consumers is critical to the progress of environmental research. Data providers must adhere to standards such as the 4https://www.vr.se/english/mandates/open-science/open-access-to-research- data/support-and-tools-/making-research-data-accessible-and-fair.html 5https://www.hb.se/en/about-ub/current/news-archive/2024/october/fair-resear ch-whats-that/ 7 2. Background FAIR Principles3 to ensure that the data they provide are accurate, comprehensive, and accessible. This, in turn, enables data consumers to find, understand, and inte- grate the data into their work in an efficient manner, thereby improving the quality and impact of their research. For example, the Environmental Data Initiative (EDI) emphasises the importance of making data FAIR to support reproducibility and col- laboration in environmental science [21]. Most of the research papers reviewed in the literature review of this study play the role of data consumers, while a few have collected data themselves. 2.5 Data Pipeline Management With the continued development of ML models and the use of large datasets, the need for a well-established data pipeline has increased significantly [4]. A data pipeline consists of a complex chain of interconnected processes that starts with a data source and ends at a destination with processed data. The destination can be either a storage or any visualisation tool [54]. With its automated flow of data, it helps in removing many manual steps that take more time and effort. Automated operations of a data pipeline can have data selection, extraction, transformation, ag- gregation, validation, and data loading for further analysis and visualisation [53]. It can vary according to the research area and the type of data that is being processed. For some data pipelines, the order of the processes is critical, while for others, it may be flexible depending on the processing logic and the data dependencies. Modern data pipelines emphasises scalability, reliability, fault tolerance, and cor- rectness. Studies highlight that well-designed pipelines reduce the complexity of data preparation processes and improve data accessibility for ML applications [53]. For ML-based environmental research, it is very important to identify a specialised and refined data pipeline. Some parts of the data pipeline cannot be fully auto- mated when it comes to data annotation. Data annotation techniques may not be automated to support all types of datasets. While data pipelines provide an effective tool for managing data by automating its flow, DataOps introduces a process-oriented approach that includes principles that emphasise collaboration, automation and monitoring in the management of data workflows. [42]. DataOps does not focus only on tools but also on people, as it requires a combination of collaboration and innovation. Studies have shown that it has improved team productivity and reduced error rates in pipelines. 3https://snd.se/en/manage-data/prepare-and-share/FAIR-data-principles 8 3 Related Work This section will discuss previous studies in ML-enabled environmental science. It will also discuss data quality and related principles in ML-enabled environmental research and data pipelines. Challenges of applying ML in environmental re- search due to data quality challenges The SLR by Konya et Nematzadeh [27] provides a valuable synthesis of recent applications of AI, in particular ML and Deep Learning(DL), in the environmental disciplines. They have referred to 26 research papers in their SLR and provided insights into several environmental domains, including air quality monitoring, water quality assessment, energy management, and waste management. The complexity and volume of data are common to many of the studies reviewed, often including time series data, sensor readings and geospatial information. These studies demonstrate the use of a variety of techniques, including shallow neural networks, support vector machines, and DL architectures such as Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNNs). In order to monitor air quality using IoT-based multisensor systems, De Vito et al. [15] used shallow neural networks (SNNs) and extreme learning machines (ELMs). In their study, data imputation and normalisation techniques were used to mitigate challenges related to sensor calibration and noise. Similarly, Fabregat et al. [17] used multilayer perceptron regressors to predict air quality, achieving high correlation coefficients and demonstrating the effectiveness of ML in environmental monitoring. The study highlighted the importance of preprocessing steps, such as outlier removal and feature selection, to improve model accuracy. These examples highlight the critical role that data preprocessing plays in ensuring the reliability of environmental datasets. In the field of water quality monitoring, Fijani et al. [18] developed a hybrid model combining extreme learning machines (ELMs) with decomposition algorithms for real-time monitoring of water quality parameters. Their approach demonstrated the potential of hybrid ML frameworks to handle complex datasets, with signif- icant reductions in mean square and mean absolute errors. Their preprocessing 9 3. Related Work pipeline decomposed non-stationary water quality signals into intrinsic mode func- tions (IMFs) to reduce noise and improve feature extraction. Saboe et al. [57] further advanced the field by using LSTM networks for multivariate time series forecasting. They achieved accurate predictions of water quality parameters with minimal errors. They used temporal alignment and normalisation in their preprocessing pipeline and addressed missing data through linear interpolation. These studies highlight the im- portance of using tailored preprocessing techniques in the management of complex environmental datasets. Bonofiglio et al. [9] focused on ocean monitoring, using DL algorithms for marine species detection and tracking from video data. Due to the large amount of video data and the need for real-time processing, their study faced challenges. They used YOLOv5, a state-of-the-art object detection model that significantly reduced processing time while maintaining high accuracy, to address these issues. One of the key challenges was dealing with the poor quality of underwater images, which often suffer from low visibility, colour distortion, and obstructions due to turbid water conditions. They addressed this issue by applying preprocessing techniques, such as contrast-limited adaptive histogram equalisation (CLAHE), to clear up the images. This example demonstrates the importance of choosing the right techniques to efficiently process large amounts of environmental data and overcome data quality challenges. In the field of waste management, there are several studies ( [3], [63], [39], [22]) that demonstrate the application of ML algorithms to predict waste generation patterns, to optimise collection routes, and to identify different types of waste for efficient sorting. They rely significantly on data preprocessing to manage diverse inputs, including socioeconomic datasets, Internet of Things (IoT) sensor readings, and waste container images. For example, studies address data quality challenges, such as missing values in municipal waste records or mislabeled waste categories, using techniques like interpolation and synthetic minority oversampling (SMOTE) to address class imbalance. Computer vision approaches use preprocessing pipelines that involve image normalisation and augmentation to improve the accuracy of waste detection. These works highlight the importance of structured data pipelines in transforming raw, noisy data into actionable insights for route optimisation and automated sorting. This ultimately reduces operational costs and environmental impact. The review by Konya and Nematzadeh [27] provides an overview of the current state of ML applications in environmental disciplines. Through an analysis of the strengths and limitations of different ML techniques, their work provides insights into how ML can be effectively integrated into environmental research through data pipelines, data preprocessing techniques and domain-specific data handling, paving the way for future advances in the field. However, the review also highlights sev- eral challenges, including the requirement of large amounts of labelled data and the lack of standardised data management practices. These results indicate that fur- ther research is needed on data quality and management practices in ML-enabled environmental research. 10 3. Related Work Approaches towards data quality management in environmental research When considering aspects of data quality in ML-enabled environmental research, the quality of the data is critical for the validity of the results. Wang et al. [64] pro- posed a multidimensional framework consisting of intrinsic data quality including accuracy, and consistency, contextual data quality including relevance, and timeli- ness, representational data quality including interpretability, ease of understanding, and accessibility data quality including availability, and security. Their empirical approach highlighted the complexity of data quality from the perspective of data consumers, identifying 20 dimensions and 118 attributes. This work is particularly relevant to environmental research, where data not only need to be accurate, but also need to be contextually appropriate and accessible to a variety of stakeholders. Meng [35] introduced the concept of data minding. This is a proactive approach to ensuring data quality throughout the lifecycle of data. Meng [35] emphasises the need for rigorous data quality practices, arguing that poor data quality leads to “garbage in, package out.” This is in line with environmental research, where data collection and preprocessing are often error-prone and complex. Robust quality assurance mechanisms are required. The ethical and practical implications of poor data quality are highlighted by Paullada et al. [47]. Their findings are relevant to environmental research, where data sets are often re-used across different studies, thereby amplifying the impact of data quality problems. Data smells, which are indicators of potential data quality issues in public data sets, have been discussed by Shome et al. [59]. These include missing values, inconsis- tencies, and outdated information. Their work is critical for environmental research because it often relies on publicly available datasets, and Shome et al.’s work provide a practical lens for diagnosing and addressing data quality issues. The Candidate Framework for Data Quality Assessment and Maintenance (CaF- DaQAM) proposed by Pradhan et al. [49] provides a structured approach to manag- ing data quality. The framework includes four components: data quality workflow, data quality challenge list, data quality attribute list, and solution candidates. This framework can be adapted to environmental research, where data quality challenges often arise from heterogeneous data sources, and varying methods of data collection. The ISO/IEC 25012 quality model [19] used by Nguyen et al. [44] provides a robust framework for assessing data quality, focusing in particular on the following inher- ent data quality dimensions: accuracy, timeliness, completeness, consistency, and credibility. These dimensions are critical to ensuring that data are not only reliable but also fit for purpose. Accuracy ensures that data are an accurate representation of real-world phenomena, which is essential for environmental research. Timeliness ensures that the data is up to date - a critical consideration for time-sensitive ap- 11 3. Related Work plications such as climate monitoring or disaster response. Completeness reduces the risk of gaps that could undermine analyses by ensuring that all required data are present. Consistency is especially important when integrating datasets from multiple sources, which ensures that the data are free of inconsistencies. Finally, credibility ensures that data are trustworthy. This is an essential attribute for in- forming policy decisions or long-term environmental strategies. These dimensions are particularly important in environmental research, where data are often used for long-term monitoring, ecosystem management, and policymaking. Researchers can systematically address data quality challenges and ensure that their datasets are robust, reliable, and reusable for multiple applications by following the ISO/IEC 25012 quality model. Towards data pipelines for ML in environmental research Furthermore, as data flows through the various stages of the data pipeline, such as collection, preprocessing, and analysis, the integration of data quality practices into data pipelines is essential to ensure the reliability and usability of environmental data. Data management has become increasingly important in ML-based applications over time, as the success of an ML model mainly depends on the quality of the data and how they are processed. Data management has a significant impact on model accu- racy and, to eliminate the challenges that come with poor data management, data management practices such as data pipelines and DataOps should be followed [41]. As noted by Munappy et al. [53], data pipelines can process multiple streams of data simultaneously, making them essential for applications that require real-time data processing. The authors also discuss the practical aspects of data pipeline manage- ment, highlighting the challenges and opportunities in maintaining efficient and reli- able data flows. Among the issues discussed are: complexity, as data pipelines often involve multiple stages and dependencies, making them difficult to design, imple- ment, and maintain; data quality, as ensuring accuracy, consistency, and complete- ness throughout the pipeline remains an ongoing challenge; scalability, as pipelines must handle increasing volumes of data without compromising performance; integra- tion, as combining data from heterogeneous sources (e.g., sensors, databases, APIs) requires robust mechanisms; and monitoring and troubleshooting, as identifying and fixing pipeline problems can be time-consuming and resource-intensive. These chal- lenges are particularly evident in environmental research, where data are often noisy, incomplete, or collected at different temporal and spatial resolutions. They high- light the importance of robust pipeline management in handling the complexities of integrating and processing data, which is particularly relevant in environmental research where data comes from diverse sources such as sensors, satellites, and field observations. In their papers, Munappy et al. [54], [4], [30], [42] propose several solutions and best 12 3. Related Work practices to address these challenges. Modular pipeline architectures are explored to address the challenges of designing, implementing and maintaining complex data pipelines [54]. A modular pipeline architecture breaks down the pipeline into smaller, self-contained components or modules. Each module performs a specific task such as data ingestion, data transformation, data validation or data storage. This provides significant advantages, including flexibility, as modules can be easily adapted to new data sources or requirements, which is important in environmental research where new sensors or data types are frequently introduced. It also improves reusability, allowing modules to be reused across different pipelines or projects, reducing redun- dancy. In addition, modular architectures improve scalability, interoperability and maintainability, all of which are particularly important in environmental research, for example, to facilitate integration with different data sources and systems, such as combining satellite imagery, ground-based sensor data and climate models. Further, it is important to automate and orchestrate, using tools like Apache Airflow and Kubernetes, to streamline pipeline execution and reduce manual intervention [4]. Munappy et al. also highlighted the need for continuous data quality monitoring at each stage of the pipeline to detect and correct errors early on [30]. Finally, the value of collaborative DataOps practices, such as iterative development, and cross- functional teaming to improve pipeline efficiency and reliability, was highlighted [42]. Taken together, these solutions provide a robust framework for overcoming the chal- lenges associated with data pipelines in complex domains such as embedded systems. Significance of the Study It is clear from the literature that environmental research relies on large amounts of data, and that the quality of the data and the selection process are important to the success of the research. Researchers must understand what kind of data they need and how to best use it. To this end, the data management and preparation processes in environmental disciplines must be carefully examined and addressed. There should be an effective communication channel between data producers and consumers to identify actual needs. This thesis aims to reduce the current data challenges faced by environmental researchers in ML-based research and propose solutions from SE while bridging the gap between these two fields. The study aims to facilitate mutual benefits for software engineering and environmental science while mitigating data management challenges by bringing these two fields closer together. 13 3. Related Work 14 4 Methods The methodology followed in this research is exploratory-based, focusing on under- standing and analysing the existing approaches taken by environmental researchers to ensure data quality requirements and data management in ML-based research. This is followed by constructive research in which we suggest SE-based solutions to overcome the current problems. An overview of the techniques used and their relation to RQs is shown in Figure 4.1. Here, we provide a brief summary of the methods and techniques applied, followed by a more detailed discussion of each step in this study. The study began by defin- ing an analysis protocol and conducting a literature review to identify current data selection and preparation practices. Then we analysed the papers found through the literature review using a scorecard called SPADES-ML (Scientific Pipeline Assess- ment and Data-Centric Evaluation Scorecard for Machine Learning). We conducted an online survey with environmental researchers and authors of the papers included in the literature review. Using the information collected from all steps, we identified current challenges in the ML-enabled environmental research data pipeline. Then, we studied the literature on pipeline design in the field of SE and formulated sugges- tions on how environmental researchers could apply techniques from SE to mitigate those challenges. 15 4. Methods 16 Figure 4.1: Overview Of Methodology Purple: SPADES-ML method for pipeline assessment, Yellow: Literature review, Orange: Survey, Red: Identify good practices and recommendation creation, Green: Fulfilled research questions 4. Methods 4.1 Analysis Protocol We defined an analysis protocol to capture all the necessary details of the data from the selected papers to meet our purpose. Based on a brainstorming session conducted by two of us, we identified the main areas we needed to analyse in terms of data quality and management. This was done after referring to a few studies and trying to identify important aspects to describe and confirm in the literature. The following points guided the extraction of necessary information from the selected papers: • Origin of the data - This records where the data originates from: some re- searchers collect the data by themselves (self-created), and some collect data from external sources. • How do data producers and data consumers communicate?- Data consumers can either contact data producers and acquire all the data they need or they can produce the data themselves. The other common scenario is for the data to be publicly available so that anyone can download and access them from the data portal or data repository. • Availability of the dataset - Can the data be downloaded from a public portal or repository? Were the data published alongside a paper, e.g., as a replication package? • How did the authors evaluate that the data were suitable? - We checked differ- ent forms of argument to assess the suitability, such as for example, we checked the arguments provided in the form of text and the necessary mathematical argumentation mentioned in the paper. We also checked for references to other publications that validate the selection. • Did the authors mention explicit data requirements? - – Data quality requirements - We focused on five main data quality require- ments defined in ISO/IEC 25012 quality model – Accuracy, Currentness, Completeness, Consistency, Credibility [19] – Data selection requirements • Training data - How did they split the dataset into training and validation data? Did they use cross-validation where the dataset is randomly mixed before training an ML model? • Did the authors provide an overview of their data pipeline, such as a diagram or textual explanation, including preprocessing steps? • Did the authors explain any challenges they encountered with the data or data processing? 17 4. Methods • How did they develop preprocessing steps to mitigate the challenges? • How did the authors handle missing data? – What did they do for invalid data? Any data imputation techniques used? How did they handled obstacles/ problems with the data? 4.2 Literature review The study continued by conducting a literature review, using the paper presented by Konya and Nematzadeh [27], Recent Applications of AI in Environmental Dis- ciplines. The paper focuses on systematically collected information, such as ML algorithms used, performance metrics, and processing times in research work across a variety of different environmental research disciplines. The research does not examine the details of the data and their quality or how they are managed in ML- based applications. Table 2 of their paper details 26 studies, categorising them by environmental domain, further showing ML techniques used, performance metrics, processing times and the techniques that performed best. They have referred to 26 research papers in their SLR and provided a brief overview of the different AI applications used in the environmental discipline. This is illustrated in Appendix C Table C.1. As our first step, we extended their literature review by investigating areas related to data quality requirements and data management pipelines by ex- amining each research paper individually. In addition, we used forward and backwards snowballing techniques to find other relevant research projects in different environmental disciplines that have used ML- based techniques, which consist of more details on data quality and data manage- ment aspects. After selecting the articles, we read them and extracted information according to the analysis protocol we defined. All the papers that were reviewed are listed in the Table 4.1. We based our literature review on the paper by Konya and Nematzadeh [27], because it was published in 2024 and included a variety of environmental subdomains from 2017 to 2024. Furthermore, working with Table 2 in their paper was a good fit for us. First, we screened the first six research papers together and discussed our findings in order to find a common approach. We then divided the remaining papers between us. We each analysed one paper from our allocated set and the other reviewed the analysis. Once we reached an agreement, we continued to analyse the remaining papers individually, providing each other with an update on a weekly basis. In addition, three of the records we analysed were reviewed by one of our supervisors to ensure that we applied our analysis steps correctly. All relevant information found in each paper was recorded in a spreadsheet under the headings defined in the analysis protocol. 18 4. Methods Table 4.1: List of papers considered for the literature review ID Environmental Field Publication I Hydrological modeling Natel de Moura et al.,2020 [43] II Energy forecasting and Management Condemi et al.,2021 [14] III Climate modelling Obahoundje et al., 2024 [46] IV Biodiversity & Ecosystem Monitoring Sittaro et al.,2023 [60] V Energy forecasting and Management Ramos et al.,2022 [55] VI Air Quality and Atmospheric Science De Vito et al.,2020 [15] VII Climate modelling Migallón et al.,2022 [36] VIII Hydrological modeling Almikaeel et al., 2022 [5] IX Waste management Agnew et al.,2023 [3] X Hydrological modeling Almikaeel et al. 2024 [6] XI Biodiversity & Ecosystem Monitoring Tuia et al.,2022 [62] XII Energy forecasting and Management Zhou et al.,2022 [67] XIII Air Quality and Atmospheric Science Fabregat et al.,2022 [17] XIV Marine and Aquatic Science Bonofiglio et al., 2022 [9] XV Energy forecasting and Management Miller et al.,2022 [37] XVI Biodiversity & Ecosystem Monitoring Moran et al.,2017 [34] XVII Waste management Moral et al.,2022 [39] XVIII Hydrological modeling Miro et al.,2021 [38] XIX Water Quality Monitoring Saboe et al.,2021 [57] XX Energy forecasting and Management Da Costa Alves Basílio et al., 2022 [7] XXI Waste management Gue et al., 2022 [22] XXII Air Quality and Atmospheric Science Chojer et al.,2022 [13] XXIII Energy forecasting and Management Pombo et al.,2022 [48] XXIV Water Quality Monitoring Fijani et al.,2019 [18] XXV Waste management Velis et al.,2023 [63] XXVI Energy forecasting and Management Manfren et al.,2022 [34] XXVII Climate modelling Nichol et al.,2021 [45] XXVIII Energy forecasting and Management Maltais and Gosselin, 2022 [33] 4.3 Scientific Pipeline Assessment and Data-Centric Evaluation Scorecard for Machine Learning (SPADES-ML) We applied the next stage of our analysis on the collected notes from the literature analysis. Specifically, we wanted to systematically evaluate and categorise the re- search papers we reviewed in order to present the results in a structured way. This would show how data quality and data management related practices are currently being applied or written in research papers. The purpose of the SPADES-ML is to evaluate the methodology in ML-enabled environmental research. Besides the aforementioned analysis criteria, we also incorporated the FAIR principles, given their significant role in environmental research data sharing [65]. As shown in the Table 4.2, we considered a total of five categories that took into account different aspects and stages of data in a scientific ML-driven research project: 1. Data selection requirements and suitability: These relate to require- 19 4. Methods ments describing why particular data sources were selected from a source for a given purpose. Data suitability reasoning may include statistical analysis performed to verify certain conditions, results of a previous study using the same data source, and so on. 2. Data quality requirements: We selected the five inherent data quality attributes from the ISO/IEC 25012 quality model and the definitions of these dimensions are as below [19]. • Accuracy: The degree to which data has attributes that correctly repre- sent the true value of the intended attribute of a concept or event in a specific context of use • Completeness: The degree to which subject data associated with an entity has values for all expected attributes and related entity instances in a specific context of use; • Credibility: The degree to which data has attributes that are regarded as true and believable by users in a specific context of use. Credibility includes the concept of authenticity (the truthfulness of origins, attribu- tions, commitments) • Currentness: The degree to which data has attributes that are of the right age in a specific context of use • Consistency: The degree to which data has attributes that are free from contradiction and are coherent with other data in a specific context of use. It can be either or both among data regarding one entity and across similar data for comparable entities 3. FAIR principles: FAIR principles are widely used in environmental science research as it was mentioned earlier in Chapter 2. We have added several criteria to ensure that FAIR principles are followed. For findability we looked at the availability of metadata about data, which allows the readers to find and understand data. We looked at the accessibility of both raw data and processed data. Where authors do not have the right to share data, it is possible to make the metadata available. The use of common file formats such as CSV, MS Excel, MS Word, PDF, JSON, XML, SQL, PNG, JPEG etc., was considered under interoperability. The selection of common file formats could be determined on the basis of data type6. All shared resources should be available in the long term. GitHub would be a good example of long-term reuse. The replication package could include data, data cleaning code, analysis code if applicable, and relevant documentation that would help replicate the work7. 6https://snd.se/en/manage-data/guides/choosing-file-format 7https://www.econometricsociety.org/publications/es-data-editor-website/packag e 20 4. Methods 4. Data preprocessing: Real-world data are often incomplete and inconsistent and may contain errors. So, preprocessing the data is an inevitable step in the entire process. 5. Challenges encountered: It is highly unlikely that one will not face any challenges during preprocessing. Numerous issues can arise during the data cleaning and imputation processes. Table 4.2: Evaluation criteria - SPADES-ML C1 Data selection requirements and arguments of data suitability 0-2 1.1 Mention of data selection criteria or requirements 1 1.2 Reason for suitability of data 1 C2 Data quality requirements argued / validated 0-2.5 2.1 Accuracy 0.5 2.2 Completeness 0.5 2.3 Credibility 0.5 2.4 Currentness 0.5 2.5 Consistency 0.5 C3 FAIR 0-2 3.1 F: Metadata linked or available otherwise 0.5 3.2.1 Raw data 0.25 A: Available to download 3.2.2 Processed data 0.25 3.3 I: Common file formats used 0.5 3.4.1 Data storage location: 0.25 long-term R: Possibility for future use 3.4.2 Replication package 0.25 C4 Data Preprocessing 0-2 4.1 Preprocessing described/ reasons given 1 4.2 Reproducible (e.g., code available or detailed steps) 1 C5 Challenges during data processing 0-2 5.1 Mention any challenges encountered during data processing 1 5.2 Provide reproducible solutions to these challenges 1 Scoring for each element: A great deal of thought has gone into assigning points to the criteria and requirements in SPADES-ML. As we cannot argue why any category should be more important than another, we decided that each category receives the same number of 2 points. One exception is the category data quality requirements: Here we allow 2.5 points out of two reasons. First, it was convenient to assign 0.5 points to each of the 5 quality attributes. But second, we also consider that the definition of clear data quality requirement is a critical element when describing the data preparation for ML, and therefore decided to weight this category slightly more. To balance the scores and make it fair, we normalised the scores after the analysis. It is shown in Chapter 5. As part of the design of SPADES-ML, we also examined how similar scoring systems are applied in SE. Carrillo de Gea et al. [11] have used a ranking system to evaluate RE tools in different categories in RE. For this purpose, a plus (+), zero and minus (-) based scoring system was used. We decided to use scores rather than +/- because we were interested in performing a statistical analysis to further interpret the results. In addition, minus (-) is not relevant to our evaluation scenario because we are either 21 4. Methods evaluating the availability of some criteria or not. For example, not mentioning the data selection criteria does not make the situation worse. Rather, it should be given zero points. However, in a way, we do mimic this system of assigning ++/– by assigning values in between the range, such as 0.5, when it is not quite possible to give full points. Several other papers used a similar scoring system. Achimugu et al. [2] had a quality assessment (QA) to identify and analyse existing software requirement prioritisation techniques. The primary studies that were selected were assigned scores based on the five QA questions. Binary or partial scores were assigned to the QA questions (e.g., “Yes” = 1, “Partly” = 0.5, and “No” = 0). The quality score for a particular study is computed by summing all the scores of the answers to the QA questions. In the SLR on the practices and challenges of agile RE, Inayat et al. [24] used a similar QA method with four criteria to assess the quality of the evidence present in the studies. To improve the rating and categorisation, the authors used a similar ordinal scale as in Achimugu et al. [2] instead of a dichotomous scale. Normalised scores were considered for further steps. Kitchenham and Charters [26] used a checklist consisting of twelve questions to identify, evaluate, and synthesise the SR method- ology in SE. For the questions in the checklist, the following values were assigned: “Yes” = 1, “Partly” = 0.5, “No” = 0, Not Applicable = NA. However, in-between values were possible, as interpolation of numerical values was permitted. Further- more, “NA” values were assigned a value of “0” for statistical analysis. SPADES-ML was built on the established practices. 4.4 Analysis using SPADES-ML Using the notes taken during our literature review according to our analysis protocol, we analysed each paper by going through each of the criteria in Table 4.2. We conducted a pair analysis, i.e., both researchers worked together, in order to reduce any personal bias in the evaluation. Most of the time, we referred back to the research papers and had to read through the lines to identify some criteria, such as data quality requirements and data processing challenges. A summary of the scores for each paper is shown in the Table 5.1 in Chapter 5. The following two papers were disregarded during the analysis using SPADES-ML. The paper of Natel de Moura et al. [43] was unfortunately unavailable to us, even when using the support of the library of our university. Instead, only a conference presentation of the paper could be found, which provides some details but not enough to allow for a fair judgement. We also excluded the paper by Tuia et al. [62] because it provides only a generic perspective, rather than a specific case study using a data source. It provides an overview of how ML can be used in wildlife conservation. As both papers do not provide sufficient insight into data quality and data management, they have been disregarded. Further, several factors were considered when interpreting the results, including the publication year, environmental field, ML techniques used, data format type, and 22 4. Methods number of data sources. Our goal was to determine how data quality and data man- agement vary with each of these factors. We grouped the research papers based on the aforementioned factors. A group of papers was considered if the group contained at least three papers, and the other groups which are not considered are marked with an “*”. As for the statistical analysis, mean, median, variance, standard deviation, 90% and 95% confidence intervals were calculated for each category: C1, C3, C4, and C5 in SPADES-ML and for the total column. 90% confidence interval was con- sidered adhering to conventions, but 95% confidence interval was also considered because it is the standard for scientific research. While both measurements show the spreading of the data, 95% confidence interval has a lesser margin of error. Box plots are used to visualise the results of the statistical analysis due to their clarity and simplicity. The results for each factor are discussed below in the respective sub-section. 4.5 Survey to validate SPADES-ML We designed a survey to be sent to the authors of the papers included in our literature review. Our aim was to obtain their expert opinion on the criteria we had included in SPADES-ML. However, to reduce bias in the responses of the participants, any details about SPADES-ML or the inclusion of their paper in our case study were not disclosed. Participants were asked to give their opinion from the perspective of three different academic roles most researchers take: reader of a paper, author of a paper and reviewer of a paper. This allowed the importance of the criteria to be gathered from three perspectives so that the overall points assigned for the criteria could be discussed. The survey consisted of four main sections. The first section aimed at collecting demographic information in the form of closed-ended questions. The following questions were included in this section: • Which field of environmental research are you specialised in? • How many years of experience do you have as an author in environmental research? • Do you regularly review papers for journals or conferences? • If yes: How many years of experience do you have in reviewing research papers? • Would you consider yourself a “data producers”, i.e., someone who creates datasets through e.g., field measurements? • Would you consider yourself a “data consumer”, i.e., someone who uses data for e.g. modelling tasks or other aspects in your research? • If yes to data consumer: Where does the data come from that you use in your research? In each of the remaining three sections, we asked the participant to judge the impor- tance of the different criteria of SPADES-ML from the perspective of being a reader, author, and reviewer of a scientific paper in environmental research. They consisted of Likert-scale questions and were designed with four options, two negatives, and two positives to ensure that the neutral option would be excluded. We provided definitions whenever necessary to clarify the terms we were looking for and included 23 4. Methods links to direct respondents to the source where we found the definitions we needed. Follow-up open-ended questions were added for each criterion to get more ideas and the thoughts of the respondents. Before the survey was sent to the authors, it was validated by two independent researchers, including one with an environmental science background. The validation provided by the independent researchers, together with feedback from our thesis supervisors, helped to refine the survey to its final version. The survey form can be found in Appendix A. Survey population selection: The survey was initially sent to 55 authors of the original papers listed in Table 4.1. This group included authors who had provided their contact information, and the corresponding authors of the papers included in our literature review. Due to the very low initial response rate of 5.45%, even after sending reminders, we decided to extend the survey population. Extending the survey population: The survey was sent to different groups of practitioners. A new survey link was sent to each group, and a duplicate of the original was used to track the response rates separately. We found the contact de- tails of the remaining 49 authors who had contributed to the original papers in the literature review and distributed the survey to them. This group had the lowest response rate of 4.08%. A reminder email was sent to all groups halfway through. To extend the survey population even further, we applied forward snowballing: we searched for papers citing our original set of papers and contacted the correspond- ing authors. This group included 123 practitioners, with a response rate of 11.38%. Finally, we also manually searched on the websites of universities and research in- stitutes to identify environmental researchers who apply ML. This group included 24 practitioners, and the response rate was 16.67%. The overall response rate was 9.16% with 23 responses after being sent to 251 practitioners. 4.6 Analysis of the survey The data collected from the survey were analysed to extract information and con- ceptualise knowledge in order to find solutions to the research questions. Our study followed a mixed approach with both qualitative and quantitative questions. Qualitative analysis Thematic analysis (TA) was used to analyse the qualitative data that we collected from the survey. This is a flexible method used to identify, analyse and report patterns or themes within qualitative data [10]. It is used to organise and describe the data we collected with the open-ended questions, which were presented as follow- up questions to each criterion. Familiarisation: The first step was to familiarise ourselves with the responses. To do this, we extracted the data into an Excel spreadsheet and cleaned it. We had to 24 4. Methods remove some data that contained details such as “No comments/ No/ None” and irrelevant data. The responses were then grouped according to the academic roles that we defined. We referred to them as a pair to understand and determine valuable insights. We had to refer to the responses several times to familiarise ourselves with the content and understand the initial idea. Initial Coding: As the second step, codes were created manually for each response and we had to go through the responses iteratively a number of times to finalise the coding we had assigned. This was done to capture the meaningful ideas behind the comments they had added. Some of the coding that we identified are "preprocessing crucial to explain", "Data to confirm credibility" and "quality for accuracy". Defining themes: The initial codes were then reviewed and similar codes were grouped to identify appropriate themes. This is a bottom-up approach where themes were identified through the data we collected and the criteria we defined. Identified themes are listed together with the findings in Chapter 5. Reviewing themes: Each theme was reviewed together to see if they were distinct from each other. This was done in order to merge any overlapping themes or to redefine them with different meanings. Defining and naming themes: The identified themes were labelled alongside the codes and respective quotes. Both the coding and theming processes were iterative. We managed them for each role separately to identify any role-specific patterns. Report: The final step was to formulate the findings of the themes and link them with the findings of the literature review and the survey according to each criterion. We were able to extract the essence of the findings from that approach and present the formal results in Chapter 5. Quantitative analysis The quantitative data generated by the Likert scale questions were passed to a data preparation stage, where we converted the responses into numerical values (1-4). Responses were grouped by role for in-depth analysis. There were some missing values, especially among the responses of the reviewers, as some may not have experience in reviewing papers and may want to skip answering these questions. Missing values were ignored so that they would not affect the final results. Initially, we calculated the mean value of each criterion for the 23 responses col- lected and visualised the findings using a grouped bar chart, assuming that this would provide the correct insights according to the roles we had defined. We were able to gather some interesting information, but then we realised that there could be misleading assumptions derived from the mean-based analysis that would affect the actual findings. Considering the number of responses and the need for a role-based analysis, a descriptive analysis was performed that highlights frequency distribu- tions. With this approach, we were able to preserve the original value and use it to 25 4. Methods derive more insights. A diverging stacked bar chart facilitated the approach appropriate for this scenario, and we created a chart for each role containing all the criteria we defined in table 4.2, including the number of responses for each. We then created another graph to evaluate the overall results with all responses regardless of role. In these graphs, the responses were split, with positive responses split to the right and negative responses split to the left. Each response was assigned a different colour to visualise the spread, and as there was no neutral point considered in the survey design, the graphs consist of four colours. Shades of red were used to represent the “Not at all important” and “Somewhat unimportant” options. Shades of green were used to represent the “Very important” and “Somewhat important” options. We were able to identify patterns with the findings and the relationship between different aspects and roles based on the generated output. The charts and their interpretations, including the role-based observations, are explained in detail in Chapter 5. 4.7 Analyse and create recommendations Through the analysis carried out using the literature review and SPADES-ML, we identified some of the challenges faced during the data pipeline process. Addition- ally, more input was gathered from the qualitative data analysis of the survey. We identified challenges by considering the perspectives of readers, authors, and re- viewers. We selected a few high-scoring papers on SPADES-ML and identified the approaches and good practices used by the authors. This provided insight into rec- ommendations for avoiding some of the identified challenges. Thus, a connection was made between SPADES-ML, the challenges, and the recommendations. It is also important to note that the challenges were rarely discussed in detail in the reviewed papers. Having gained a sufficient understanding of the chal- lenges, we looked for solutions through the solutions discussed by Munappy et al. [54], [4], [30], [42]. We focus on this work because it provides a deep insight into how SE techniques can be applied to find a thorough methodology for building data pipelines for ML in a specific application. We were interested in investigating whether these methods can also be applied in Environmental Research. These in- sights were used to generate the initial recommendations, including findings that we gathered from other literature. These recommendations were structured to answer existing problems in data quality and data pipeline in environmental ML research, with the insights that we can provide from the SE perspective. 26 5 Results This chapter presents the results, including the analysis of the reviewed research papers using SPADES-ML, the outcome of the survey studies, and the set of rec- ommendations based on the identified challenges and findings. 5.1 Analysis of literature using SPADES-ML As mentioned in Chapter 4, a total of 28 research papers were considered, includ- ing 26 from the paper by Konya et Nematzadeh [27] and two other studies that were included using forward and backwards snowballing. The results of each paper, obtained after reviewing the literature based on the analysis protocol, were system- atically evaluated using SPADES-ML. As stated in Table 4.2, we had defined five categories in SPADES-ML and the points were allocated considering the categories and sub-criterion we defined for each. Table 5.1 provides a summary of the analysis of the reviewed papers using SPADES-ML. The table shows the points allocated for each category. The last column sums the total points scored for each paper. The purpose of the total sum column is for further statistical analysis only. This column is not intended for comparing one study to another and drawing conclusions such as “study A is better than study B” based on the total score. Each study has its own unique strengths and weaknesses, so it is not possible to compare them based on total score. The maximum score for each category in the table is different: C1, C3, C4, and C5 can go up to 2.0, while C2 can go up to 2.5. To create balance among the categories, we decided to normalise the scores to the range of 0-1. Since all categories except C2 have a maximum score of 2.0, we divided the scores by 2.0. We divided the C2 scores by 2.5 to normalise them. We wanted to check whether direct comparisons between the categories would lead to discrepancies. Table 5.2 shows the results we found after normalisation. To compare the two approaches, we sorted the research papers in ascending order of total score shown in Figure 5.3. When we compared the two results, there were no significant differences in the order; both appear to be consistent with each other. So, we decided to use the normalised approach for further analysis. After careful consideration, as we noted in Chapter 4, we decided to exclude papers I and XI 27 5. Results Table 5.1: A summary of the analysis results Paper C1 C2 C3 C4 C5 Total ID Max 2 p. Max 2.5 p. Max 2 p. Max 2 p. Max 2 p. Out of 10.5 p. I 1 0 0 0 0 1 II 2 2 0 1 1.5 6.5 III 1 1.5 0 1 0 3.5 IV 2 2.5 2 2 1 9.5 V 2 2 1 0 0 5 VI 1 2 0 0.5 1 4.5 VII 2 2 1 0 0.5 5.5 VIII 1 2 0 0 1 4 IX 2 2 0 1 1 6 X 2 2.5 1.5 2 1 9 XI 0 0.5 0.75 0 1 2.25 XII 1.5 2.5 0 1 1 6 XIII 1 2 0 1 1 5 XIV 2 2.5 1.75 2 2 10.25 XV 1 0.5 1.5 1 0 4 XVI 1 2 0 1 1 5 XVII 2 2 0 1 1 6 XVIII 1 1.5 0 0.5 1 4 XIX 2 2.5 0 0 0 4.5 XX 1 1 1.75 1 0 4.75 XXI 2 2 0.25 1 2 7.25 XXII 1 1.5 0 1 1 4.5 XXIII 2 2.5 1.5 2 2 10 XXIV 1.5 0.5 0 0.5 1 3.5 XXV 2 1.5 1.25 2 1 7.75 XXVI 2 1.5 0 1 0.5 5 XXVII 0.5 2 1.5 1 1 6 XXVIII 1 1.5 0 0 1 3.5 C1= Data selection and data suitability; C2= Data quality requirements; C3= FAIR; C4=Preprocessing; C5= Challenges during data processing. because we were not able to use them for a detailed analysis due to the missing or generic details they contain. These exclusions did not affect the results and helped correctly analyse the content found. Overview of the fulfilment of SPADES-ML criterion To assess how well each criterion was fulfilled by the papers we referred to, we created a dot plot, as shown in Figure 5.1, with the results we gathered from our analysis. The aim was to quantify the extent to which these papers addressed the key aspects of SPADES-ML. The y-axis provides all the elements that are considered in SPADES-ML, while the x-axis shows how many papers fulfilled each of these ele- ments. The “Fully Fulfilled” column shows how many papers received the maximum score in each category with respect to SPADES-ML. The papers that received half points for the specified criterion are visualised in the “partially fulfilled” column, and the total shows the combination of the two. 28 5. Results Table 5.2: Summary of the normalised results Paper ID C1 C2 C3 C4 C5 TotalOut of 5 I 0.5 0 0 0 0 0.5 II 1 0.8 0 0.5 0.75 3.05 III 0.5 0.6 0 0.5 0 1.6 IV 1 1 1 1 0.5 4.5 V 1 0.8 0.5 0 0 2.3 VI 0.5 0.8 0 0.25 0.5 2.05 VII 1 0.8 0.5 0.25 0 2.55 VIII 0.5 0.8 0 0.5 0 1.8 IX 1 0.8 0 0.5 0.5 2.8 X 1 1 0.75 1 0.5 4.25 XI 0 0.2 0.375 0 0.5 1.075 XII 0.75 1 0 0.5 0.5 2.75 XIII 0.5 0.8 0 0.5 0.5 2.3 XIV 1 1 0.875 1 1 4.875 XV 0.5 0.2 0.75 0.5 0 1.95 XVI 0.5 0.8 0 0.5 0.5 2.3 XVII 1 0.8 0 0.5 0.5 2.8 XVIII 0.5 0.6 0 0.25 0.5 1.85 XIX 1 1 0 0 0 2 XX 0.5 0.4 0.875 0.5 0 2.275 XXI 1 0.8 0.125 0.5 1 3.425 XXII 0.5 0.6 0 0.5 0.5 2.1 XXIII 1 1 0.75 1 1 4.75 XXIV 0.75 0.2 0 0.25 0.5 1.7 XXV 1 0.6 0.625 1 0.5 3.725 XXVI 1 0.6 0 0.5 0.25 2.35 XXVII 0.25 0.8 0.75 0.5 0.5 2.8 XXVIII 0.5 0.6 0 0 0.5 1.6 C1= Data selection and data suitability; C2= Data quality requirements; C3= FAIR; C4=Preprocessing; C5= Challenges during data processing. The papers have focused more on criteria such as the mention of data selection cri- teria and the provision of reasons for data suitability. All attributes were prioritised when considering data quality, but completeness, currentness, and consistency re- ceived the most attention. Most of the papers poorly address aspects of the FAIR principle. Only 11 papers out of 26 provided raw data. Although the majority of papers provide details of the preprocessing stage and the challenges encountered, fewer papers provide reproducible preprocessing steps or solutions to the problems faced. The analysis revealed a significant shortfall in the provision of reproducible informa- tion to other researchers. These findings emphasise the need for clearer guidelines to be established for ML-based environmental research. 29 5. Results Table 5.3: Comparison of normalised and not normalised approaches Normalised Not normalised ID Total Out of 5 p. ID Out of 10.5 p. I 0.5 I 1 XI 1.075 XI 2.25 XXVIII 1.6 XXVIII 3.5 III 1.6 III 3.5 XXIV 1.7 XXIV 3.5 VIII 1.8 VIII 4 XVIII 1.85 XVIII 4 XV 1.95 XV 4 XIX 2 XIX 4.5 VI 2.05 VI 4.5 XXII 2.1 XXII 4.5 XX 2.28 XX 4.75 V 2.3 V 5 XVI 2.3 XVI 5 XXVI 2.35 XXVI 5 VII 2.55 VII 5.5 XIII 2.55 XIII 5.5 XII 2.75 XII 6 IX 2.8 IX 6 XVII 2.8 XVII 6 XXVII 2.8 XXVII 6 II 3.05 II 6.5 XXI 3.43 XXI 7.25 XXV 3.73 XXV 7.75 X 4.25 X 9 IV 4.5 IV 9.5 XXIII 4.75 XXIII 10 XIV 4.88 XIV 10.25 Analysis based on published year The research papers that we considered for the literature review were published between 2017 and 2024. In our analysis, we wanted to explore whether there is an association between the time of publication and the score in each of the categories of SPADES-ML. To investigate whether the research papers have evolved over time, we analysed the scores with respect to their publication year. 2022 has the highest number of papers published and we considered it as the middle point. We then formed year-groups that contain about the same number of papers as the year 2022. Figure 5.2 shows the distribution of scores for the total score and for the individual criteria (C1 to C5) of SPADES-ML across three publication year groups: 2017–2021, 2022 and 2023–2024. There is a clear upward trend in total scores for more recent publications. The median and mean scores have both increased from the earlier years (2017–2021) to the most recent period (2023–2024). However, uncertainty is high in the period of 2023-2024. Also, the overall quality seems to be high in the year 2022. These results 30 5. Results Total Fully fulfilled Partially fulfilled 23 18 5 Mention of data selection criteria 21 17 4 Reason for suitability 15 15 0 Accuracy 21 21 0 Completeness 17 17 0 Credibility 22 22 0 Currentness 20 20 0 Consistency 6 6 0 Metadata available 11 11 0 Raw data 5 5 0 Processed data 8 8 0 File formats 9 9 0 Long-term storage 7 7 0 Replication package 20 16 4 Preprocessing reasons 8 8 0 Reproducible code 19 18 1 Challenges mentioned 4 3 1 Solutions to challenges 0 10 20 0 10 20 0 10 20 Count Count Count Figure 5.1: Dot plot illustrating how many referred research papers fulfilled the criteria in SPADES-ML. suggest an improvement in publication quality over time. FAIR Principles (C3) scores have improved over time. The years 2017–2021 show low median and mean scores, which have increased since then. This suggests that researchers have recently started to pay more attention to data accessibility and reproducibility. However, mention of challenges and provision of reproducible solu- tions (C5) does not show any improvement over time. This suggests that, although other aspects have improved, researchers have not devoted much attention to pre- processing challenges. 31 5. Results Total score distribution - published year 5 Mean Median 90% CI 4 95% CI 3 2 1 0 2017 2021 2022 2023 2024 (a) Total score distribution based on published year Score distribution for C1 - published year Score distribution for C2 - published year 1.2 1.2 Mean Mean Median Median 1.0 90% CI 1.0 90% CI 95% CI 95% CI 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.2 0.2 2017 2021 2022 2023 2024 2017 2021 2022 2023 2024 (b) Score distribution for C1 (c) Score distribution for C2 Score distribution for C3 - published year Score distribution for C4 - published year 1.2 1.2 Mean Mean Median Median 1.0 90% CI 1.0 90% CI 95% CI 95% CI 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.2 0.2 2017 2021 2022 2023 2024 2017 2021 2022 2023 2024 (d) Score distribution for C3 (e) Score distribution for C4 Score distribution for C5 - published year 1.2 Mean Median 1.0 90% CI 95% CI 0.8 0.6 0.4 0.2 0.0 0.2 2017 2021 2022 2023 2024 (f) Score distribution for C5 Figure 5.2: Score distribution based on published year. n(2017-2021): 7, n(2022): 14, n(2023-2024): 5 C1= Data selection and data suitability; C2= Data quality requirements; C3= FAIR; C4=Preprocessing; C5= Challenges during data processing. 32 Score (Out of 1) Score (Out of 1) Score (Out of 1) Score (Out of 1) Score (Out of 1) Score (Out of 1) 5. Results Table 5.4: Categorisation by environmental fields Paper ID Environmental field VI XIII Air Quality and Atmospheric Science XXII XXIV XIX Water Quality Monitoring * XIV Marine and Aquatic Science * V XXVIII XXIII XII II Energy Forecasting and Management XXVI XV XX VII III Climate modelling XXVII IV XVI Biodiversity & Ecosystem Monitoring * XVIII VIII Hydrological Modeling X XXV XXI XVII Waste management IX Analysis based on environmental field The papers were categorised by environmental fields to observe any findings. There were eight categories as shown in the Table 5.4. These environmental fields were initially taken from Table 2 (Appendix C Table C.1) in Konya et Nematzadeh [27]. Furthermore, to maintain consistency, we refined the environmental fields identified in the literature review to align them with those used in the “Demographics” section of the survey, as shown in Figure 5.7. Figure 5.3 shows the distribution of scores across the different environmental fields. As stated earlier, groups were considered for the statistical analysis only if they contained at least three papers. Therefore, we had to remove the categories “Water Quality Monitoring”, “Marine and Aquatic Science” and “Biodiversity and Ecosys- tem Monitoring”, as they contained two, one and two papers, respectively. The distribution of total scores in waste management indicates higher median and mean scores, suggesting stronger overall quality. In contrast, air quality and at- mospheric science show the lowest mean and least variability, indicating consistent scores among the papers. Hydrological modelling indicates more variability with a lower median, indicating that most papers score low, but there are a few papers which scored high across all categories. 33 5. Results Total score distribution - environmental field 5 Mean Median 90% CI 4 95% CI 3 2 1 0 Air Quality and Energy Forecasting Climate modelling Hydrological Modeling Waste management Atmospheric Science and Management (a) Total score distribution based on environmental field Score distribution for C1- environmental field Score distribution for C2 - environmental field 1.2 1.2 Mean Mean Median Median 1.0 90% CI 1.0 90% CI 95% CI 95% CI 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.2 0.2 Air Quality and Energy Forecasting Climate modelling Hydrological Modeling Waste management Air Quality and Energy Forecasting Climate modelling Hydrological Modeling Waste management Atmospheric Science and Management Atmospheric Science and Management (b) Score distribution for C1 (c) Score distribution for C2 Score distribution for C3 - environmental field Score distribution for C4 - environmental field 1.2 1.2 Mean Mean Median Median 1.0 90% CI 1.0 90% CI 95% CI 95% CI 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.2 0.2 Air Quality and Energy Forecasting Climate modelling Hydrological Modeling Waste management Air Quality and Energy Forecasting Climate modelling Hydrological Modeling Waste management Atmospheric Science and Management Atmospheric Science and Management (d) Score distribution for C3 (e) Score distribution for C4 Score distribution for C5 - environmental field 1.2 Mean Median 1.0 90% CI 95% CI 0.8 0.6 0.4 0.2 0.0 0.2 Air Quality and Energy Forecasting Climate modelling Hydrological Modeling Waste management Atmospheric Science and Management (f) Score distribution for C5 Figure 5.3: Score distribution based on environmental field n(Air): 3*, n(Energy): 8, n(Climate): 3*, n(Hydrological): 3*, n(Waste): 4* C1= Data selection and data suitability; C2= Data quality requirements; C3= FAIR; C4=Preprocessing; C5= Challenges during data processing; *Below the standard thresholds of 5 34 Score (Out of 1) Score (Out of 1) Score (Out of 5) Score (Out of 1) Score (Out of 1) Score (Out of 1) 5. Results Energy forecasting and management and waste management scored highly in the C1 category, “Data Selection and Suitability”, which indicates the presence of clearly defined data selection criteria and the provision of proper arguments for data suit- ability. All the environmental fields performed better in terms of fulfilling data quality attributes (C2), which highlights the focus of researchers in addressing these attributes. Air Quality, Atmospheric Science and Waste Management score low in the C3 cri- teria of the FAIR principles, which may be due to accessibility concerns relating to security. The higher variation in energy forecasting and management indicates that there may be uncertainty. Most fields exhibit a similar scoring pattern for C4 (data pre-processing and repro- ducibility) criteria, while C5 (mention of challenges identified in pre-processing and providing solutions) scores lower across all fields. Analysis based on ML techniques used Table 5.5: Categorisation by ML techniques used ML Category ML Technique Paper ID SNN, ELM VI ANN XXVIII RNN+CNN X Neural Networks(NN) MLPR XIIILSTM XIX PGNN, NN, HPML XII RSML XXI SVM,ANN VIII Support Vector SVM, MLR, GBR, XGB XXII Machines (SVM) SVM, MLP, ELM IISVM, ANN, KNN, MLR VII YOLOv5 XIV Image processing YOLOv5, Computer vision XVIIYOLOF, SOLOv, Computer vision IX GTB, SVM, Random Forest V ET, DT XVI BRT, SVM IV Random Forest, SVM, ANN XXIII Ensemble Learning Random Forest XXVII Techniques (ELT) Random Forest, SVM, ANN XVIIIConditional Random Forest, Univariate Non-linear Regression XXV Random Forest III NN, XGBoost, LightGBM XV SVM,VMD,ELM,CEEMDAN, VMD-CEEMDAN-ELM XXIV Other ML-based regression model XXVI SVM, XGBoost, MARS, RR XX Out of curiosity, we categorised the papers based on the ML techniques used in the 35 5. Results studies. ML techniques used in each paper were identified from the literature review. Then we grouped them into general ML categories based on the techniques. As is it shown in the Table 5.5, most papers use several techniques belonging to different general categories. We considered two possible options: duplicating the content in all relevant categories or identifying a primary technique and sorting into that category. Option one would increase the sample size for each ML category and might introduce confounding bias. However, if we could identify a primary technique for each study, option two would be better. After careful consideration, we selected the primary technique based on the algorithm with the best performance. The selected primary technique for each study is marked in bold text in Table 5.5. The “other” category was used when the primary technique did not belong to any of the other categories. Figure 5.4 shows the distribution of total and criterion-wise (C1–C5) scores across the different ML techniques that were applied. Most ML techniques achieve rela- tively high C1 scores, indicating the right selection and suitability of the data. Image processing has a high mean, median and minimum variability, reflecting clearly de- fined data selection criteria. The overall score for C2 is comparatively high, and the ML techniques included in the other category seem to address data quality requirements less effectively than the other four ML techniques. Analysis based on type of data format During our literature review, we identified several common data formats used in the studies. The majority of the papers used time series data, while many used a mix of time series, text, numerical, and multimedia data. A few studies used only multimedia data. The categorisation into data formats is shown in Table 5.6. Figure 5.5 shows box plots of total scores and scores for individual categories, cate- gorised by data format. The total score distribution shows that the multimedia data format has the highest mean and median score, which indicates that it has better performance. This may be because papers using multimedia data (e.g., marine and aquatic science/waste management) have properly mentioned the criteria we defined in SPADES-ML. In C1, the time series and mixed data formats performed similarly. In this data suitability category, multimedia data has shown more promising performance, with a higher median. In C2, the median values of all three data formats are similar, but the variability is low in multimedia, indicating that the data quality attributes have been well addressed. The median values are quite low in C3, with higher variability across all data formats. Although the median is low for both time series and multimedia data, the variability is lower for multimedia data, suggesting more consistent scores in FAIR conditions. The median values in C4 pre-processing are similar, but multimedia data performs better than the other two. The results for C5 are similar, and the mixed data format has more variability and can therefore be uncertain. 36 5. Results Total core distribution - ML techniques used 5 Mean Median 90% CI 4 95% CI 3 2 1 0 Neural Networks Support Vector Machines Image processing Ensemble Learning Other Techniques (a) Total score distribution based on ML techniques used Score distribution for C1 - ML techniques used Score distribution for C2 - ML techniques used 1.2 1.2 Mean Median 1.0 1.0 90% CI 95% CI 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 Mean 0.0 Median 0.0 90% CI 95% CI 0.2 0.2 Neural Networks Support Vector Machines Image processing Ensemble Learning Other Neural Networks Support Vector Machines Image processing Ensemble Learning Other Techniques Techniques (b) Score distribution for C1 (c) Score distribution for C2 Score distribution for C3 - ML techniques used Score distribution for C4 - ML techniques used 1.2 1.2 Mean Mean Median Median 1.0 90% CI 1.0 90% CI 95% CI 95% CI 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.2 0.2 Neural Networks Support Vector Machines Image processing Ensemble Learning Other Neural Networks Support Vector Machines Image processing Ensemble Learning Other Techniques Techniques (d) Score distribution for C3 (e) Score distribution for C4 Score distribution for C5 - ML techniques used 1.2 Mean Median 1.0 90% CI 95% CI 0.8 0.6 0.4 0.2 0.0 0.2 Neural Networks Support Vector Machines Image processing Ensemble Learning Other Techniques (f) Score distribution for C5 Figure 5.4: Score distribution based on ML techniques used n(NN): 8, n(SVM): 3*, n(Image processing): 3*, n(ELT): 9, n(Other): 3* C1= Data selection and data suitability; C2= Data quality requirements; C3= FAIR; C4=Preprocessing; C5= Challenges during data processing; *Below the standard thresholds of 5 37 Score (Out of 1) Score (Out of 1) Score (Out of 5) Score (Out of 1) Score (Out of 1) Score (Out of 1) 5. Results Table 5.6: Categorisation by data format Paper ID Data format VI Time-Series XXVIII Time-Series X Time-Series XXII Time-Series XX Time-Series II Time-Series XXIV Time-Series XXVI Time-Series XIII Time-Series XIX Time-Series XXIII Time-Series XXVII Time-Series XII Time-Series VII Time-Series III Mixed XXI Mixed XV Mixed V Mixed XVI Mixed IV Mixed VIII Mixed XVIII Mixed XXV Mixed XIV Multimedia XVII Multimedia IX Multimedia 38 5. Results Total score distribution - data format 5 Mean Median 90% CI 4 95% CI 3 2 1 0 Time-Series Mixed Multimedia (a) Total score distribution based on data format Score distribution for C1 - data format Score distribution for C2 - data format 1.2 1.2 Mean Median 1.0 1.0 90% CI 95% CI 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 Mean 0.0 Median 0.0 90% CI 95% CI 0.2 0.2 Time-Series Mixed Multimedia Time-Series Mixed Multimedia (b) Score distribution for C1 (c) Score distribution for C2 Score distribution for C3 - data format Score distribution for C4 - data format 1.2 1.2 Mean Mean Median Median 1.0 90% CI 1.0 90% CI 95% CI 95% CI 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.2 0.2 Time-Series Mixed Multimedia Time-Series Mixed Multimedia (d) Score distribution for C3 (e) Score distribution for C4 Score distribution for C5 - data format 1.2 Mean Median 1.0 90% CI 95% CI 0.8 0.6 0.4 0.2 0.0 0.2 Time-Series Mixed Multimedia (f) Score distribution for C5 Figure 5.5: Score distribution based on data format n(Time Series): 14, n(Mixed): 9, n(Mutimedia): 3* C1= Data selection and data suitability; C2= Data quality requirements; C3= FAIR; C4=Preprocessing; C5= Challenges during data processing; *Below the standard thresholds of 5 39 Score (Out of 1) Score (Out of 1) Score (Out of 5) Score (Out of 1) Score (Out of 1) Score (Out of 1) 5. Results Table 5.7: Categorisation by the number of data sources Number of Data sources Paper ID Single VI Single XXII Single XXIV Single XIX Single XIV Single V Single XXIII Single XII Single II Single XXVI Single VII Single XX Single XXVIII Single VIII Single XXV Single XVII Single IX Single X Multiple XV Multiple XXVII Multiple IV Multiple XVI Multiple XVIII Multiple XIII Multiple XXI Multiple III Analysis based on the number of data sources The majority of studies have used a single data source. A few others have used mul- tiple data sources, including public portals and private data collected from various organisations. Categorisation is shown in Table 5.7. As Figure 5.6 shows, the distribution of total scores is quite similar for single and multiple data sources. Only a slight improvement can be seen for single data sources. However, when we consider the C1 score, there is a considerable difference: single data sources have higher median and mean values, indicating good performance, but with large variability, indicating uncertainty. Multiple data sources scored low in terms of data selection and suitability because we had to validate different sources, some of which did not include the required arguments. However, validating a single source was straightforward. In C2, the median of both data sources is quite similar, but the uncertainty in the single data source is quite high. C3 has a low median for both sources, with higher variability. Multiple sources have more variability than a single source, which indicates higher uncertainty. Variability seems to be low for C4 in multiple data sources, where most papers have scored partially. Even in C5, the variability is low in multiple data sources, but the uncertainty is higher compared with single data sources. 40 5. Results Total Score distribution - Data Source 5 Mean Median 90% CI 4 95% CI 3 2 1 0 single Multiple (a) Total score distribution based on number of data sources Score distribution for C1 - Data Source Score distribution for C2 - Data Source 1.2 1.2 Mean Mean Median Median 1.0 90% CI 1.0 90% CI 95% CI 95% CI 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.2 0.2 single Multiple single Multiple (b) Score distribution for C1 (c) Score distribution for C2 Score distribution for C3 - Data Source Score distribution for C4 - Data Source 1.2 1.2 Mean Mean Median Median 1.0 90% CI 1.0 90% CI 95% CI 95% CI 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.2 0.2 single Multiple single Multiple (d) Score distribution for C3 (e) Score distribution for C4 Score distribution for C5 - Data Source 1.2 Mean Median 1.0 90% CI 95% CI 0.8 0.6 0.4 0.2 0.0 0.2 single Multiple (f) Score distribution for C5 Figure 5.6: Score distribution based on number of data sources n(Single): 18, n(Multiple): 8 C1= Data selection and data suitability; C2= Data quality requirements; C3= FAIR; C4=Preprocessing; C5= Challenges during data processing. 41 Score (Out of 1) Score (Out of 1) Score (Out of 5) Score (Out of 1) Score (Out of 1) Score (Out of 1) 5. Results 5.2 Analysis of Survey responses As the survey consisted of both closed and open-ended questions, the survey data were analysed using both quantitative and qualitative methods. Not all of the 23 respondents answered the open-ended questions, as these were added as follow-up questions to a series of closed-ended questions targeting the SPADES-ML categories. As we asked participants to answer the questions from the perspective of three different academic roles: Reader, Author and Reviewer, a few participants had also chosen not to answer some closed-ended questions that were not relevant to them. Survey respondents demographics This provides information on demographic information that we collected from the survey respondents. This helped us to identify the professional diversity, their con- tribution to ML-based environmental research and how they usually collect data for their research. Figure 5.7: Distribution of environmental fields of the respondents Figure 5.7 shows the variety of experience of the respondents within different envi- ronmental fields. We used eight environmental fields to categorise, and the remain- ing three: urban water management, agriculture, and petroleum modelling were provided by the respondents. Figure 5.8 illustrates the distribution of years of experience that participants have as authors of environmental research. All participants have worked for more than one year, with the majority having more than six years’ experience. Another aspect we wanted to investigate was whether these authors regularly review research papers for journals or conferences. Figure 5.9 shows the distribution, with a significant number of participants — 19 in total — regularly reviewing papers. 42 5. Results How many years of experience do you have as an author in environmental research? Response None How many years of Less than 1 experience do you 1 - 5 years have as an author 4 9 10 6-10 years in environmental More than 10 years research? 0 5 10 15 20 Count Figure 5.8: Distribution of the years of experience as an author in environmental research Do you regularly review papers for journals or conferences? Response Yes No Do you regularly review papers for journals or 19 4 conferences? 0 5 10 15 20 Count Figure 5.9: Distribution of paper review frequency From all these 19 participants, 8 of them have more than 10 years of experience and the distribution can be found in Figure 5.10. How many years of experience do you have in reviewing research papers? Response None Less than 1 How many years of 1 - 5 years experience do you have in reviewing 4 2 8 1 8 6-10 years research papers? More than 10 years 0 5 10 15 20 Count Figure 5.10: Distribution of the years of experience as a reviewer These demographic questions were designed to highlight participants’ experience of, and active involvement in, authoring and reviewing environmental research. We then examined how participants access the data they use for their research. The main objective was to understand whether they primarily produce their own data, as this would minimise communication issues with external data producers and ensure a thorough understanding of the requirements and capabilities of their data sources. Out of 23 participants, 17 identified as data producers. Figure 5.11 presents the distribution of these responses. Figure 5.12 provides further analysis of the above aspect by distinguishing between participants who primarily generate their own data and those who mostly rely on data produced by external parties. The majority of participants indicated that they regularly use both self-produced and externally produced data, suggesting that they explore all possibilities for obtaining high-quality data. 43 5. Results Would you consider yourself a data producer? Response Yes No Would you consider yourself a 17 6 data producer? 0 5 10 15 20 Count Figure 5.11: Distribution of data producing experience Where do you usually get the data for your research? Response I mostly use data produced by myself Where do you I use data produced usually get the 3 14 5 by myself and datadata for your from other researchers research? I mostly use data produced from other researchers 0 5 10 15 20 Count Figure 5.12: Distribution of origin of data Analysis of survey results - closed-ended questions Quantitative analysis was performed to analyse the responses to the survey’s closed- ended questions. Likert scale questions that were formed to support the SPADES- ML were analysed according to the different roles that we defined. The 23 responses were analysed based on the importance provided by the respondents and the mean value for each criterion was calculated to obtain an overall value. Figure 5.13 shows the grouped bar chart of the mean importance of each criterion recorded for each role. The importance of mentioning data selection and suitability criteria was highlighted within the responses. Data accuracy, consistency and credibility were identified as the main data quality requirements prioritised among all three roles. The author role has given higher scores than the reader and reviewer across several criteria, such as the importance of metadata, using common file formats, providing a replication package, and providing reproducible preprocessing steps. This may be due to their awareness and responsibility in managing the data and providing details of the research carried out. The reviewer role tends to show slightly lower interest in some areas than in the other roles. Access to metadata and raw data, and details of reproducible solutions to the challenges encountered throughout the research, are important criteria that reviewers prioritise low. Upon closer inspection, we concluded that visualising Likert-scale data with a mean score would not be an effective approach in our scenario. This approach would result in important information and insights being lost, and some findings would not be accurately interpreted. Instead, we chose a visualisation that highlights the actual 44 5. Results Figure 5.13: Grouped Bar Chart: Mean Importance Ratings of Evaluation Criteria by Role ratings for each criterion, allowing us to see their relative importance. The stacked bar chart is used for visualisation, with “important” responses diverging to the right and “unimportant” responses diverging to the left. As there is no neutral value, the middle ground is not visualised. We visualised the response data for each role and for all roles combined. For this visualisation, we only considered the set of questions common to all three roles, which we repeated. Figure 5.14 shows the importance of each criterion from the reader’s perspective. Readers mainly focused their attention on the following: • Mention of data selection criteria • Need for accurate data • Importance of providing metadata • Long-term availability of research data, to ensure that it can be accessed and reused in future research • Providing details about preprocessing steps, including the reasons for doing so 45 5. Results 46 Reader Role: Importance Ratings Response How important is it to mention the data selection criteria Not important at all or requirements? 3 20 Somewhat unimportant How important is it to provide reasons for the suitability of data? 1 7 15 Somewhat important Very important Data accuracy 2 21 Data currentness 1 1 15 5 Data completeness 2 14 7 Data consistency 1 8 14 Data credibility 1 8 14 How important is metadata? 10 13 How important is it to gain access to raw data used in a paper? 2 6 15 How important is it to gain access to processed data for further analysis? 2 11 10 How important is it to use commonly supported data/file formats in a research project? 2 12 9 How useful is it to provide long-term availability and access to research data to support reusability in future 6 17 research? How useful is it to provide a replication package to support the reproducibility of research findings? 1 16 6 How important is it to describe the preprocessing actions performed on the data, including the reasons for those 10 13 actions? How important is it to provide reproducible preprocessing steps or code? 4 10 9 How important is it to mention the challenges encountered during data processing? 1 4 15 2 How important is it to provide reproducible solutions to these challenges? 4 9 9 0 5 10 15 20 Number of Responses Figure 5.14: Importance rating as a reader 5. Results How important do you think it is for researchers to make their self-collected data publicly available to increase the credibility of the data? Response How important Not important at all do you think it is Somewhat unimportant for researchers to Somewhat important make their self-collected 1 11 11 data publicly available Very important to increase the credibility of the data? 0 5 10 15 20 Count Figure 5.15: Importance of making self-collected data publicly available to increase the credibility of the data In the readers’ section, we included a question specific to the readers’ views to gauge the importance of making self-collected data publicly available to increase credibility. Many researchers agreed with this approach, either fully or partially. However, only one researcher thought it was somewhat unimportant for the credibility of the data. Figure 5.15 shows the visualisation of that result. Figure 5.16 shows the importance of each criterion from the author’s perspective. Authors have focused their attention mainly on the data quality attributes - accu- racy, currentness, consistency and credibility. They also fully agree on the impor- tance of providing reasons for the suitability of the data used in the research, as well as providing access to the raw data. They also strongly favour on providing a replication package to ensure reproducibility and providing reproducible preprocess- ing steps. They have slightly disagreed on providing commonly supported data/file formats and data completeness. 47 5. Results 48 Author Role: Importance Ratings Response How important is it to mention the data selection criteria Not important at all or requirements? 1 11 11 Somewhat unimportant How important is it to provide reasons for the suitability of data? 7 16 Somewhat important Very important Data accuracy 11 12 Data currentness 1 22 Data completeness 1 3 11 8 Data consistency 12 11 Data credibility 7 16 How important is metadata? 1 6 16 How important is it to gain access to raw data used in a paper? 7 16 How important is it to gain access to processed data for further analysis? 1 6 16 How important is it to use commonly supported data/file formats in a research project? 4 9 9 How useful is it to provide long-term availability and access to research data to support reusability in future 1 8 14 research? How useful is it to provide a replication package to support the reproducibility of research findings? 7 16 How important is it to describe the preprocessing actions performed on the data, including the reasons for those 1 8 14 actions? How important is it to provide reproducible preprocessing steps or code? 9 14 How important is it to mention the challenges encountered during data processing? 1 11 11 How important is it to provide reproducible solutions to these challenges? 3 10 10 0 5 10 15 20 Number of Responses Figure 5.16: Importance rating as an author 5. Results We also included a specific question in the authors’ section to emphasise the impor- tance of communicating properly with data producers before retrieving data. This allows authors to gain useful insights into the fields necessary for their work and to check whether the data matches their requirements. Twenty researchers identified this as useful, and only three replied that it was somewhat unimportant. Figure 5.17 shows the visualisation of this result. How important is it to be able to communicate with data producers before retrieving data? Response Not important at all How important Somewhat unimportant is it to be able 3 9 11 Somewhat important to communicate with data producers Very important before retrieving data? 0 5 10 15 20 Count Figure 5.17: Importance of communicating with data producers before retrieving data From a reviewer’s perspective, the emphasis is on data quality attributes similar to those of the authors, but they mainly consider data accuracy, currentness, complete- ness and credibility. As they need to assess the relevance and temporal resolution of the data when reviewing papers, these attributes are prioritised. They have also emphasised the importance of metadata. Figure 5.18 displays the distribution of importance ratings assigned by the participants when considering their role as re- viewers. 49 5. Results 50 Reviewer Role: Importance Ratings Response How important is it to mention the data selection criteria Not important at all or requirements? 1 15 7 Somewhat unimportant How important is it to provide reasons for the suitability of data? 3 9 11 Somewhat important Very important Data accuracy 6 16 Data currentness 7 15 Data completeness 3 19 Data consistency 1 2 12 7 Data credibility 12 10 How important is metadata? 9 13 How important is it to gain access to raw data used in a paper? 1 7 14 How important is it to gain access to processed data for further analysis? 1 10 11 How important is it to use commonly supported data/file formats in a research project? 1 10 11 How useful is it to provide long-term availability and access to research data to support reusability in future 2 11 9 research? How useful is it to provide a replication package to support the reproducibility of research findings? 3 10 9 How important is it to describe the preprocessing actions performed on the data, including the reasons for those 2 7 13 actions? How important is it to provide reproducible preprocessing steps or code? 2 13 7 How important is it to mention the challenges encountered during data processing? 1 8 13 How important is it to provide reproducible solutions to these challenges? 3 9 10 0 5 10 15 20 Number of Responses Figure 5.18: Importance rating as a reviewer 5. Results 51 All Roles Combined: Importance Ratings Response How important is it to mention the data selection criteria Not important at all or requirements? 2 29 38 Somewhat unimportant How important is it to provide reasons for the suitability of data? 4 23 42 Somewhat important Very important Data accuracy 19 49 Data currentness 1 1 23 42 Data completeness 1 5 28 34 Data consistency 1 3 32 32 Data credibility 1 27 40 How important is metadata? 1 25 42 How important is it to gain access to raw data used in a paper? 3 20 45 How important is it to gain access to processed data for further analysis? 4 27 37 How important is it to use commonly supported data/file formats in a research project? 7 31 29 How useful is it to provide long-term availability and access to research data to support reusability in future 3 25 40 research? How useful is it to provide a replication package to support the reproducibility of research findings? 4 33 31 How important is it to describe the preprocessing actions performed on the data, including the reasons for those 3 25 40 actions? How important is it to provide reproducible preprocessing steps or code? 6 32 30 How important is it to mention the challenges encountered during data processing? 1 6 34 26 How important is it to provide reproducible solutions to these challenges? 10 28 29 0 10 20 30 40 50 60 70 Number of Responses Figure 5.19: Importance rating considering all roles 5. Results Figure 5.19 shows the responses from all participants, regardless of their role, pro- viding a generic view on the aspects in SPADES-ML. Higher importance is given to data accuracy, where every user considers it important. Data credibility and the importance of metadata have also been prioritised; only one person disagreed about the importance of these aspects. Providing reproducible solutions to process- ing challenges and highlighting these challenges was still seen as less of a priority than other criteria. Analysis of survey results - open-ended questions Thematic analysis of the responses to open-ended questions revealed six main themes related to the categories defined in the SPADES-ML. These themes reflect the priori- ties of environmental researchers and the challenges they encounter when processing environmental science data. The following themes were derived from participant re- sponses, along with supporting evidence. Only the most important quotes are listed here; all quotes relating to each theme can be found in Appendix B. Theme 1: Transparency and justification of data selection This theme emphasises the importance of researchers justifying their data selection criteria, particularly with regard to suitability, to ensure relevance and quality. The issue of introducing bias has also been raised and the importance of transparency and justification of data selection has been highlighted. Reader perspective R3: “Very important to select/use data that is acceptable and reliable since some data may misguide the prediction tool” R9: “Data availability is often the only factor for selecting a data source due to limited availability.” Author perspective A3: “Important to ensure model accuracy, improves model interpretability, supports reproducibility” A6: “Sometimes, data availability is the problem, so we don’t have the luxury of choosing” Reviewer perspective V3: “Clear data selection criteria, relevant to the problem being solved. Im- portant to check target population, format, quality, and completeness.” V7: “As a reviewer, I’m always looking for an answer to ’why’. If data selec- tion is anything but random, I look for possible sources of biases and what was done to mitigate them or how the authors acknowledge it later on.” From the reader’s perspective, researchers should focus on providing clarity regard- ing the origin of the dataset and offering a proper explanation in order to build 52 5. Results trust. Since the accuracy of the model depends on the suitability and relevance of the data, readers consider it important to have details of the data selection process and an assessment of the data’s suitability. From both the reader’s and the author’s perspectives, it was highlighted that some researchers may have to prioritise avail- ability when selecting a data source due to limitations in their field of research. From a reviewer’s perspective, the reason for selecting the dataset is often investigated to ensure that no bias has occurred and that the reasoning is justified. Theme 2: Data quality and reliability The importance of data quality requirements was emphasised in this theme, with participants focusing on data trustworthiness. They emphasised that achieving qual- ity attributes such as accuracy and credibility is essential, since poor quality data can negatively impact modelling outcomes. Reader perspective R6: “The data should be recorded by a credible source, the uncertainty of any type while collecting the data should be provided. Removing outliers should be done by experts not automated. Consistent measurement is crucial” R9: “Data accuracy and consistency are the most important factors, especially when using ML, as gaps can be filled, but bad data makes this tough to do.” Author perspective A3: “High-quality data leads to more accurate predictions and saves time and resources. Very important to check some data that has incomplete informa- tion, which may lead to inaccurate models.” A6: “Bad data, bad results.” Reviewer perspective V3: “Important to see high-quality data, such as accuracy, completeness, consistency” Data quality remains a common concern for everyone, especially when the results are affected by negligence. The main quality attributes highlighted are accuracy, completeness, consistency, and credibility. However, in some fields, such as waste management, data from the most credible sources can be unreliable. Readers high- light the importance of a clear definition and standardisation, along with the need for a credible source. Authors have given focus on practical implications and conse- quences that happen when the data quality is not addressed properly. Theme 3: Data accessibility and constraints This theme highlights the importance of data access and discusses practical acces- sibility challenges. In order to verify or use research data/findings as the basis for 53 5. Results new research, other researchers need to be able to access the data and information about it. Reader perspective R3: “Many real-time application does not have data accessibility” R6: “It’s one of the most important steps for the credibility of the research. For the research to be conducted correctly, the same results should be obtained by any researcher who is following the same steps as in the paper, thus without data, that cannot be confirmed” Author perspective A3: “Easy access to data allows to quickly explore, preprocess, and train models without delays” A6: “Data is the only way to verify the creditability of the research” Reviewer perspective V5: “Usually, I have no time to review a paper to the level of analysis repli- cation” V7: “This is quite an idealistic perspective. I don’t personally believe that all data needs to be accessible at the time of review, but it is needed for reproducibility. I deem access very important, only to ensure reproducibility.” From a reader’s perspective, access to raw or processed data is important, as they need to be able to validate the results before using them in their own research. Without access to the data, nothing can be confirmed or reproduced. However, as mentioned, data accessibility is not guaranteed most of the time in real-time situations. To ensure the credibility of their research, authors tend to make their data accessible. The reviewer emphasised that they would not have enough time to replicate the findings due to the limited time available for reviewing a paper. Theme 4: Transparency in pre-processing It was considered essential that the preprocessing steps were documented transpar- ently. In all roles, the importance of clear, detailed and reproducible preprocessing steps for ensuring data quality and model accuracy was emphasised. Reader perspective R6: “Sometimes, data preprocessing is more important than the modeling step itself. It is crucial to explain the process” R9: “Its important to provide these reproducible steps / code, but I find contacting the authors and starting a dialogue is often the best way to truly understand what they’ve done.” 54 5. Results Author perspective A3: “Very important in cleaning, transforming, and organizing raw data into a format that can be used by ML models” A9: “Much of the data in waste management is poor and we need to provide justification as to why these incorrect data points should be removed from the analysis” Reviewer perspective V3: “Important to make sure the raw data is clean and usable format” The reader highlights the critical role that transparency plays in data pre-processing, considering it to be more important than modelling itself. They emphasise that pre- processing is the most important part and that, without it, the entire objective would be missed. Readers also value direct communication with the authors in order to gain deeper insights and understand the research properly. The authors’ perspec- tive provides practical guidance on preparing data for ML models, emphasising its importance for achieving accurate results. They highlight the importance of justify- ing preprocessing steps to give readers a clear understanding of the reasons behind them. The reviewer focuses on ensuring that the data used for modelling is in a usable condition. Theme 5: Challenges in data processing It is important to highlight that researchers face challenges in the pre-processing phase, which can be due to various reasons such as noise, missing values, and other integration difficulties. Reader perspective R3: “Incomplete datasets with missing values can skew the model’s under- standing, data might contain errors, outliers, or irrelevant information, ML models can’t directly handle non-numeric data” Author perspective A7: “There are way too many challenges to be document everything. It could be useful to include, maybe as comments in code, but not sure how else one would include these.” Reviewer perspective V3: “Some issues should be mentioned e.g., missing or inconsistent data, noise, high dimensionality, class imbalance, and difficulties in integrating data from multiple sources, etc.” The focus in all three roles is on the challenges encountered in preprocessing, such 55 5. Results as missing values, outliers, and incorrect data. Although the authors recognised that including all the details in the paper would not be practical, they emphasised the importance of providing information on how these issues were resolved. The recurring challenges they face emphasise the need for a proper data pipeline in ML- based environmental research. Theme 6: Institutional standards and practices This theme highlights how adherence to formal regulations and committees shapes data practices and the measures taken to ensure data quality. Reader perspective R1: “Remember that ’Monitoring data’ always need to meet criteria defined by the agencies / EU. Therefore the selection of monitoring data is not a big issue because they already (mostly) meet quality criteria.” R5: “Embargo timing could be needed to protect ongoing publishing actions and MSc-PhD Theses” Author perspective A4: “Scientific evaluation committees must take into consideration the value of publishing datasets. Otherwise, the scientific community won’t publish them.” It highlights that readers can rely on trustworthy validation and need not worry about certain aspects, as these will be automatically fulfilled by adhering to the standards. Readers are aware that some research will be made available with a delay in order to protect future interests. The authors have emphasised the importance of publishing datasets and of reaching an agreement within the scientific community. 5.3 Findings on challenges and recommendations This section presents the challenges identified during the study and initial recom- mendations that could mitigate those challenges. Table 5.8 provides an overview of the challenges we identified in the data selection and preparation stages of ML- based environmental research. The “Source” column indicates the step at which the challenge was identified. The “Severity” column was added to indicate how severe the identified challenges are, considering the findings gathered from the SPADES- ML analysis. If the challenge was barely addressed in the analysed research papers — in fewer than 10 papers — we identified it as having a high level of severity. If the challenge is addressed in more than 20 papers, we identified it as low severity. Anything in between falls under medium severity. The “Notes” column explains the challenge. 56 5. Results Table 5.8: Challenges identified in data selection and preparation for the application of ML in environmental disciplines Challenge Source Severity Notes A1:Unclear han- Literature re- Medium Author 9: “Much of the dling of miss- view/ Survey data in waste management ing/null data is poor and we need to pro- vide justification as to why these incorrect data points should be removed from the analysis”” A2:Lack of clarity Literature re- Medium “High-frequency fluctua- in handling noisy view/ Survey tions in time-series data data (e.g., meteorological and pollutant concentration data) could introduce noise.” [17], Author 3: “Some data might have incomplete information, irrelevant, incorrect, or random data points, or multiple records of the same data” A3:Unclear han- Literature re- Medium “Dataset having a large dling of high view number of input variables dimensionality and (plants and meteorological redundancy data), leading to high di- mensionality and potential redundancy due to correla- tions among variables” [14] A4:Lack of clar- Literature re- Medium ity of handling view datasets that use different scales and metrics B:Difficulty locat- SPADES-ML High Many datasets lacked suffi- ing metadata about cient metadata, making it data and documen- difficult for researchers to tation understand important de- tails about data Continued on next page 57 5. Results Challenge Source Severity Notes C:Lack of details SPADES-ML High Many research papers do for replication and not provide all the informa- future use tion needed to replicate the results. Access to raw data and scripts with all the nec- essary codes, including doc- umentation, is needed. D:Difficulty locat- SPADES-ML Low Sometimes, only the names ing data used when of public portals were men- it is available on tioned from which data were public portals acquired without providing specific details. E:Difficulty locat- SPADES-ML Medium Data pipeline process was ing proper details rarely mentioned in full, in- about data pipeline cluding all the necessary steps and transformations. F:Unclarity of data SPADES-ML Medium Often details about data quality require- quality were not mentioned ments while focusing attention on model performance G:Lack of details SPADES-ML Low Sometimes, when data were to ensure data suit- collected from multiple ability when mul- sources, the same level of tiple data sources details were not given for were used data suitability H:Ensuring credi- SPADES-ML/ Medium Reader 6: “Data should bility of a dataset Survey be recorded by a credible source, the uncertainty of any type while collecting data should be provided” I:Restricted access Survey Medium Special permission may to data in real-time be needed to access some applications data. Such restrictions often demotivate re- searchers.Reader 3: “Many real-time applications do not have data accessibility” Continued on next page 58 5. Results Challenge Source Severity Notes J:The most credible Survey Low Reader 9: “In waste man- data in worst qual- agement, often the most ity credible data (eg., from na- tional governments) is the worst quality as they have no means to measure it at a national level. Data ac- curacy and consistency are the most important factors, especially when using ML, as gaps can be filled, but bad data makes this tough to do” K:Limited data Survey Medium Reader 9: “Data availability availability in some is often the only factor for environmental selecting a data source due fields to limited availability.”, Au- thor 6: “Sometimes, data availability is the problem, so we don’t have the luxury of choosing” L: Dirty data - Literature re- High Reader 3: “Incomplete Noisy, missing, or view/ Survey datasets with missing val- invalid entries ues can skew the model’s understanding, data might contain errors, outliers, or irrelevant information, ML models can’t directly handle non- numeric data”, Author 3: “Very important to check some data that has incomplete informa- tion, which may lead to inaccurate models. Can be overcome by removal, or using algorithms that handle missing values” We have provided some initial recommendations for possible methods and prac- tices to tackle the challenges identified above in Table 5.9. These recommendations are based on four sources: good practices identified in our analysis of papers us- ing SPADES-ML(L), analysis results from the survey(S), the work of Munappy et al.(M) [53], [54], [4], [41], [42], [40], and other literature(O). The guidelines are dis- cussed following the structure of SPADES-ML. The “Code” column contains the 59 5. Results reference to the SPADES-ML. All of the proposed recommendations are valuable in addressing one or more of the identified challenges. They can differ based on the effort required for implemen- tation. Therefore, the “Effort” column was roughly determined based on the time required for completing the task. Recommendations that introduce new practices or tools that are unfamiliar to the environmental research domain are categorised as “High” in terms of effort, as more time will be required for training and onboarding. “Low” effort was allocated to recommendations that could be implemented within a few hours. Anything requiring more time was allocated as “Medium” effort. The “Reference to challenge” column provides details of the challenges that are partially or fully fulfilled by these recommendations. Table 5.9: Guidelines for future ML-enabled research Code Guideline Examples Effort Reference Source (in to environ- challenge mental science) C1: Data selection & data suitability 1.1 Provide a clear study Location, Low G - L, S, scope - reason for depth Partially M [40] selecting the data source(s) 1.1 Justification when Time Low G - L, S particular data range, Partially selections are made frequency 1.1 Define clear criteria for Manually Medium G - L ground truth data verified or Partially sampling directly observed data 1.1 Mention any Low G - L thresholds or criteria Partially that were used to include/exclude samples 1.1 Manage proper Medium A4 - S, communication with Partially O [44], data collectors when M [40] applicable 1.2 Clearly state the Low G - L practical or scientific Partially question the data are intended to address 60 5. Results Code Guideline Examples Effort Reference Source (in to environ- challenge mental science) 1.2 Explain how data Different Medium G - L represents diversity of location, Partially conditions lightings 1.2 Justify the relevance of Low G - L the dataset to the Partially specific case 1.2 Mention the data Low H - Fully L, S source and its credibility 1.2 Ensure measurements Medium G - L and variables capture Partially relevant details C2: Data quality requirements 2.1 Clearly describe Medium F - L methods used to Partially identify and remove erroneous or unlikely values 2.1 Use expert-verified Medium F - L annotations where Partially possible 2.2 Considering multiple Sites, Medium F - L states/conditions to periods Partially ensure coverage 2.2 State the proportion of Deletion, Medium F - L, S, missing data and how imputa- Partially M [4] it was handled tion 2.2 Justify if certain Medium F - L variables were Partially excluded and explain their absence 2.3 Use reputable data Medium F - L, S sources where possible Partially 2.3 Make the data openly Medium F - L, S, accessible when Partially O [44] possible 2.3 Describe data Sensors Low F - L collection instruments Partially 2.4 Clearly define the data Low F - L collection period Partially 61 5. Results Code Guideline Examples Effort Reference Source (in to environ- challenge mental science) 2.4 Justify how the data’s Seasonal Low F - L age matches the tracking Partially research objective and clearly indicate if parts are outdated 2.4 Mention any temporal Low F - L gaps and their Partially relevance to the study 2.5 Maintain uniform Sensor Medium F - L settings, annotation configura- Partially protocols, formatting tions and categorisation standards throughout the dataset 2.5 Describe how Medium F - L inconsistencies (e.g., Partially contradictory values from different sources) were handled 2.1/ Data Linter: a new High J - M [41] 2.2/ class of tools that Partially 2.5 automatically inspects ML datasets to identify potential issues in the data C3: FAIR 3.1 Provide structured High B - L, S, metadata describing Partially M [41] variables, timeframes, equipment, and locations 3.1 Link datasets to Medium C - L persistent identifiers Partially (DOIs, ORCIDs) 3.1 Link metadata in the Low B - L paper or in Partially supplementary material 62 5. Results Code Guideline Examples Effort Reference Source (in to environ- challenge mental science) 3.2.1 Clearly state where Low D - Fully L raw data can be accessed (include DOI, URL, or API) 3.2.2 Share processed Medium C - L outputs Partially 3.2 Clearly state access Low J - L conditions (open, Partially restricted, institutional) 3.3 Use open, Low C - L, S non-proprietary file Partially formats (CSV, JSON) 3.4.1 Use institutional or Low C - L domain-specific Partially repositories that ensure long-term access 3.4.1 Use version-controlled Medium C - L platforms (e.g., Partially GitHub) 3.4.2 Provide a replication High C - L, S package including Partially data, scripts, models, and instructions to reproduce C4: Data Preprocessing 4.1 Data cleaning, High L - Fully L, S, scrubbing - procedure M [41] to modify or remove incomplete, incorrect, inaccurately formatted, or repeated data in a dataset and apply data imputation - to solve the missing data problem 63 5. Results Code Guideline Examples Effort Reference Source (in to environ- challenge mental science) 4.1 Clearly document all Medium A1, A2, L, S, preprocessing steps, A3 - Fully M [41] including: A4, E - - removed data (nulls, Partially duplicates, outliers, temporal exclusion) - transformed data (normalisation, scaling) or filtered - feature engineering techniques (derived or composite variables) 4.1 Describe why each step Low C - L, S was necessary Partially (according to the domain knowledge) 4.2 Provide open access to: Medium C, E - L, S - Source code or Partially notebooks - Annotation tools and formats - Step-by-step documentation to replicate results 4.2 Include enough details Medium E - L to rerun the pipeline Partially from raw input to final output 4.2 Include software Low C - L versions, libraries, and Partially environment setup C5: Challenges during data processing 64 5. Results Code Guideline Examples Effort Reference Source (in to environ- challenge mental science) 5.1 Clearly state technical Poor Medium C, E - L, S or relevant challenges lighting, Partially during preprocessing noisy or analysis data, hardware limits, missing values, misclassi- fications, anoma- lies, sensor errors, data gaps, misalign- ment in multi- source datasets (e.g., temporal) 5.1 Describe how issues Medium C, E - L were detected (e.g., Partially thresholds, visual inspection, statistical tests) 65 5. Results Code Guideline Examples Effort Reference Source (in to environ- challenge mental science) 5.2 Document how each Medium E - L, S challenge was solved, Partially including: - Code snippets or links to GitHub - Threshold adjustments or parameter tuning - Algorithmic workarounds - Use of external tools (e.g., Colab for scaling up processing) - Justify exclusions or imputation 66 6 Discussion In this chapter, we discuss the results of our research and address our research questions. We take a closer look at the challenges and recommendations. This chapter also presents a discussion of the contribution of this study to SE. Lastly, we address threats to validity and future work. 6.1 RQ1: How are data selection and data prepa- ration done when applying ML in environ- mental research? Based on the findings of the analysis through SPADES-ML and survey analysis, we concluded that aspects of data quality and management are significantly underre- ported in ML-based applications in environmental research. Papers primarily focus on providing details about ML algorithms and performance metrics, while neglecting to address data quality and preprocessing. However, for the model to perform well, high-quality data must be acquired, and the quality of the data must be ensured through appropriate processing. Keeping this in mind, we analysed how data were selected and prepared in ML-based environmental research applications. Data selection Most papers collected their data from public datasets, pre-existing scientific databases, or other reputable third parties. Researchers also tend to collect their own data to ensure reliability. Most researchers use a combination of data produced by them- selves and other researchers, while some strictly use data produced by themselves. Since data accuracy is essential, finding reliable sources is important for ensuring an accurate model. Approximately 85% of the papers we referred to in our literature review mentioned the data selection criteria, which is a positive aspect because it gives the reader a good idea of the argumentation to why a dataset was used. When it came to provide a set of justification to why a selected dataset is applicable in the given research, around 80% of the papers managed to do so, either fully or partially. While most of the papers provide descriptive reasoning, some have applied mathematical or 67 6. Discussion statistical justification to validate the reasoning. Environmental researchers mainly focused on spatial and temporal resolution and sensor reliability when selecting the data. We would like to acknowledge that some authors have highlighted known biases in certain datasets, and have accounted for these in their analysis. Managing data quality The papers contained few references to ISO/IEC 25012 data quality attributes [19], despite the researchers’ emphasis on this issue in their survey responses. Although data accuracy was highlighted as the main data quality attribute, only 57% of the papers have emphasised the need to maintain data accuracy in relation to existing literature. While nearly all the papers focus on model accuracy, the manner in which data accuracy was handled is not properly documented. The reviewed papers mainly highlight data completeness, currentness, and consistency quality attributes. While Wang et al. [64] has proposed a multidimensional framework consisting of intrinsic data quality, and further Pradhan et al. [49] provides a structured approach to managing data quality, our analysis found that approximately 69% of the studies address data quality. The survey analysis revealed that, regardless of their roles, all participants generally agree on the importance of data quality in the study. Based on the information we gathered, it appears that environmental researchers tend to eliminate anomalous data to ensure accuracy. This contrasts with the ap- proach of ensuring completeness, in which researchers focus on covering large time- frames and considering multiple variables when acquiring data. To maintain consis- tency, researchers standardise variables or sometimes perform cross-validation. They ensure credibility by sourcing data from trusted organisations. Many researchers fo- cus on acquiring recent data because environmental conditions can change rapidly, and results depend on how recent the data are. Data preprocessing As shown in Munappy et al. [53], [41], ensuring the reproducibility and reliability of ML results begins with transparent and well-documented data pipelines. Although preprocessing steps such as data cleaning, normalisation, and handling missing val- ues were mentioned in studies that were analysed, they often lacked sufficient detail to support reproducibility. Only 30% of the papers provided reproducible prepro- cessing steps, either by describing them in detail or by providing the codebase, while 70% of the studies mentioned preprocessing, though not in detail. This finding cor- roborates the observation by Munappy et al. that current ML studies in applied domains often lack the engineering standards necessary for tackling data-related challenges. However, the survey findings highlighted the importance of preprocessing and men- tioning preprocessing actions with reasoning. While providing reproducible prepro- cessing steps was not prioritised in the same way, 58% of respondents said that mentioning preprocessing steps is very important, while only 44% said that pro- viding reproducible preprocessing steps is very important. Nevertheless, providing 68 6. Discussion reproducible preprocessing steps was considered somewhat important, yet often not reflected in practice. The discrepancy between the perceived importance and the actual practice suggests the need for methodological changes across the field. 6.2 RQ2: What are the challenges of data selec- tion and preparation for the application of ML in environmental research? SPADES-ML was specifically designed to uncover hidden information relating to the challenges of selecting and preparing data for ML-based environmental research. It allowed us to systematically evaluate data pipeline practices across five key areas: data selection and suitability, data quality requirements, adherence to the FAIR principles, data preprocessing, and challenges encountered during preprocessing. The analysis enabled us to detect patterns that would have been difficult to iden- tify through qualitative reading alone. With the help of a literature review and its analysis of SPADES-ML, we were able to gather both qualitative insights and quantitative comparisons across papers. Several challenges were identified through the literature review and the SPADES-ML analysis, which are listed in Table 5.8. Furthermore, the analysis through SPADES-ML helped us systematically identify data-related methodological issues in ML-based environmental research. These is- sues include the lack of adherence to the FAIR principles and the minimal provision of information on reproducibility in preprocessing. We found that many papers do not provide access to raw datasets, metadata, or detailed documentation on data preprocessing steps. These issues have not previously been the focus of research, yet they pose a threat to the reusability and credibility of data. The analysis performed using SPADES-ML was further validated by the survey, in which respondents confirmed the findings and flaws identified in relation to the real-world scenarios. Many of the challenges identified by SPADES-ML were con- firmed by researchers as genuine obstacles to ML-based environmental science. This approach can be used in other ML-based domains without limiting its ability to thoroughly identify and evaluate data. It can serve as a checklist to help researchers justify their work in their research papers, enhancing the reusability and credibility of their work. 6.3 RQ3: What solutions are reported in SE that could mitigate the challenges of data selection and preparation for the application of ML in environmental research? In RQ1, we explored current data quality and data management practices in ML- enabled environmental science research. Through RQ2, we identified the challenges 69 6. Discussion one could face during this process. RQ1 and RQ2 were the main focus of our re- search. For RQ3, we made some initial recommendations to help mitigate the chal- lenges identified in RQ2. These recommendations are presented in Table 5.9. The recommendations follow the SPADES-ML structure because analysing the papers with SPADES-ML made identification easy. Some recommendations were identified from the analysis of the survey, from literature in SE, and specifically from the work of Munappy et al. [53], [54], [4], [41], [42], [40]. Throughout Munappy’s PhD thesis, many challenges in data pipelines were identified and thoroughly investigated in an industrial setting within the field of embedded systems. However, some of the solu- tions discussed in the aforementioned papers could not be mapped out due to their irrelevance to the domain. The recommendations presented in Table 5.9 serve as constructive work. While we acknowledge the need to validate them, we present these recommendations as a set of guidelines and a methodology pipeline for future environmental science and SE research that utilises ML. The recommendations primarily focus on addressing the gaps identified through the SPADES-ML analysis and the survey. Researchers highlighted the main concerns in their survey feedback, which are not properly addressed in research papers in practical situations. The gaps identified (low reproducibility, lack of data qual- ity requirements), therefore, naturally mapped onto the proposed SE approaches. Our recommendations focus on filling these gaps. These recommendations reflect concerns raised by environmental researchers and are grounded in widely validated SE methods. They are supported by our findings from the literature review, the SPADES-ML analysis and the survey analysis. Following these guidelines in ML- enabled research has the potential to improve data management practices and reduce challenges in this process. 6.4 Contribution to Research Software Engineer- ing Research Software Engineering (RSE) is an emerging discipline which bridges the gap between software engineering and academic research. RSE focuses on applying software engineering principles to research contexts. Research software engineers often work in research environments and address domain-specific challenges such as algorithmic efficiency, data-intensive workflows, and high-performance comput- ing. This role has gained importance as software has become increasingly important across disciplines, from computational biology to climate modelling [8]. The disci- pline is critical in order to advance scientific and technological innovation by creating efficient, reliable and sustainable software solutions tailored to researchers’ needs. Due to the increasing complexity and scale of research data and computational re- quirements, the importance of RSE has grown significantly in recent years. Research Software Engineers play a critical role in the design, implementation, and mainte- nance of these software tools. They ensure that they meet the high standards of 70 6. Discussion performance, reliability, and usability required for research applications. Balancing the flexibility and adaptability of software solutions with the rigour and discipline of software engineering practices is a key challenge for RSE. Research projects often make it difficult to adhere to traditional software development processes because of rapidly changing requirements and the need to explore new methodologies. It is also important to validate and discuss the methodologies used in RSE in order to improve the reliability of research projects. By considering ML-based environmental research as a case study, we explored the aspects of data quality, data management and how they are handled and reported. We were able to provide structured guidelines for use in RSE. These guidelines can be used to evaluate the quality of data and the data pipeline across ML projects. The findings highlight the need for reproducibility and transparency for other re- searchers, as well as how this can be achieved. Our findings also suggest that, even when the code and models are well documented and available for public use, a lack of transparency in the data pipeline reduces opportunities for reproducibility. Throughout the process, we identified a discrepancy between what was expected and what was practically available. With the challenges and recommendations identified, this can act as a checklist and guideline for SE researchers to validate their approach when conducting research using ML. They can also identify areas for improvement and invest time in refining the data pipeline. Proper version control systems that ensure the long-term availability of data and the necessary code would encourage contributions from other researchers. 6.5 Threats to validity This section discusses threats to validity, focusing on how the research questions were answered, and how this undermines the reliability and validity of the findings. Internal validity This refers to concerns that arise when the causal relations within a study are exam- ined and it occurs without the knowledge of the researchers [56]. When conducting the literature review, we were concerned that the data acquired might depend on the reader’s point of view. This could result in bias, which we wanted to eliminate from the outset. Therefore, we analysed the first six research papers and discussed them, taking notes as necessary. This was done to maintain a common approach. We agreed on the aspects to be checked and the extent to which they should be checked. To confirm our approach, each of us analysed one paper, which the other then reviewed. We also discussed our findings on a weekly basis. To validate our approach, we asked one of our supervisors to review three of the papers we had analysed. When analysing our literature review findings using SPADES-ML, we aimed to min- imise bias, as providing scores individually can be subjective. Therefore, we analysed all the findings from the research papers together and provided scores. In regard 71 6. Discussion to the design of the survey, the phrasing of the question can have influenced the responses since our focus was on validating SPADES-ML. Therefore, in order to address this issue, the survey questions were reviewed and validated by two inde- pendent researchers, one of whom had an environmental science background. The survey was also pilot tested by the supervisor and refined based on the feedback received. We invited the authors of the papers referred to in our literature review to participate in the survey, as we required their expertise in order to evaluate SPADES-ML from a generic, study-independent perspective. The survey was con- ducted anonymously to reduce bias and to prevent us from mapping their responses to their study, which we referred to. Also, when conducting thematic analysis, to reduce the bias in coding and themes interpretation, to reduce subjective judgment, both of us discussed each other’s coding and iteratively checked and grouped them. External validity This mainly concerns whether the results can be generalised from the specific en- vironment in which the study was conducted to other environments [56]. For our literature review, we limited our analysis to papers presented by Konya and Ne- matzadeh [27], as well as two other papers identified through snowballing. This approach may have overlooked valuable insights presented in other relevant research papers. However, all of the articles were published after 2017, making them recent. Only 23 of the 251 researchers to whom we sent our survey responded. Neverthe- less, we covered experts from a few environmental fields, exploring their knowledge and experience as readers, authors and reviewers. Through forward snowballing sampling, we identified many other environmental researchers, not just the corre- sponding and other involved authors of the selected research papers. We also sent our survey to environmental researchers who use ML-enabled techniques and were found on university websites and other relevant websites. Thanks to this, we were able to analyse diverse environmental research data, mitigate limitations, and offer meaningful insights. Construct validity Construct validity focuses on generalisation of the experiment result to a concept or theory behind it [56]. To ensure an accurate interpretation of the concepts addressed in SPADES-ML and the survey, we provided the survey participants with definitions and relevant sources to prevent any misinterpretation and to help them understand the ideas behind the concepts. We also used the ISO/IEC 25012 data quality frame- work [19] and the FAIR data principles in our analysis. These were supported by prior literature and knowledge among the participants. We kept these aspects in mind when designing the survey, which helped us to reach a shared understanding of the areas that needed clarification. The survey consisted of five main sections, 72 6. Discussion three of which contained repeated questions designed to gain an understanding of the perspectives of readers, authors and reviewers. Every term was explained re- peatedly in all sections, with extra resources provided where necessary. We ensured proper interpretations were given and understandability was achieved by reviewing and validating the survey with two independent researchers before sending it to the participants. Conclusion validity Conclusion validity refers to threats to the validity of conclusions derived from anal- yses. Sample size is one consideration here. Since we only referred to 26 papers for our literature review, when conducting the case study, there were situations in which we had to consider a minimum of three data points for statistical analysis and visualisation. Similarly, we received only 23 responses to our survey, despite sending it to 251 practitioners. Due to the small sample size, the number of data points that could be used was limited, which may impact reliability. For compara- tive purposes, we had to consider the minimum threshold of three data points, which is below the standard of five points required for box plots. To ensure transparency, we added warnings to the figures that fall below this threshold. We used matplotlib and numpy to generate box plots, and seaborn, pandas, and matplotlib were used to generate the dot plot. The stacked bar chart was generated using pandas and matplotlib. As we only included open-ended follow-up questions to allow for further comments on the aspects we checked with closed-ended questions, the overall number of comments was low. Consequently, we were only able to gain a few insights from the thematic analysis, so we had to focus more on statistical analysis to interpret the results in detail. Nevertheless, we were able to draw some interesting conclusions from the details we gathered regarding the three roles. To increase the reliability of the findings, we individually referred to and discussed the first six papers in order to find a common interpretation. After agreeing on a set of common terms, we continued to extract information from other research papers. We also performed the analysis using SPADES-ML in pairs to increase reliability. This approach enabled us to reduce bias. Information extraction was carried out carefully to minimise miscalculations, so that anyone using the same methods to derive the results would ultimately reach the same conclusions. As we used a systematic approach to allocate points as either total or partial in SPADES- ML, the results can be reproduced. 6.6 Future work Future work on this study will include validating and extending the proposed find- ings and recommendations. We have provided recommendations based on the SE aspect, considering ML-based environmental research in terms of data selection, 73 6. Discussion quality, and management. We need to validate the proposed guidelines in a practi- cal scenario to check their acceptability in a real-world research context. We need to measure the effectiveness of these recommendations against how practical they are in actual use. We can collaborate with environmental researchers specialising in ML or SE to validate and analyse the guidelines. From this, we will be able to see what they encounter in real time that prevents them from adhering to these recommendations from their point of view. They may require additional resources or training in certain areas, and through proper collaboration, we will be able to identify what we should change in the real-time scenario. We have adopted a gen- eralised approach to providing recommendations that consider all environmental fields. These recommendations could be improved further by tailoring the findings to specific environmental fields. 74 7 Conclusion This research aimed to investigate how data are selected and prepared for use in ML applications within environmental research, and to identify key challenges and potential solutions to improve data quality and management. Building on the liter- ature review conducted by Konya and Nematzadeh [27], we adopted a data quality and management perspective and used SPADES-ML to analyse the findings, which were then validated through a survey involving environmental science experts. From our analysis of 26 research papers as well from the survey findings, we were able to see that current research in environmental field based on ML ignore some critical aspects of data quality, preprocessing, and providing reproducible solutions. The survey highlighted that experts consider these aspects to be important, but when a paper is actually published, these aspects are not properly reported. While most re- searchers describe data sources and mention quality attributes, they provide limited justification for preprocessing steps and the challenges encountered. Furthermore, only a small number of papers provided details regarding reproducibility and ensured that their findings aligned with FAIR principles. Through the survey, we were able to validate the importance of the criteria we defined in SPADES-ML, with input from environmental researchers who actively incorporate ML into their work. There was a clear consensus among them about the importance of justifiable data selection, maintaining data quality, adhering to FAIR principles and providing reproducible preprocessing details. Some participants also highlighted data accessibility issues and other challenges relating to missing and incomplete data in their area of research. To address the challenges identified, we also examined the solutions proposed by Munappy et al. [54], [4], [30], [42] which allowed us to gain deeper insight into our recommendations from SE techniques for identifying proper data pipelines. We have proposed a set of actionable recommendations that take into account SE practices and the specific needs and concerns raised by environmental researchers. However, these recommendations need to be validated in real-world projects by assessing their impact on model performance and improving reproducibility. The recommendations can be refined through ongoing feedback from environmental re- searchers and validated as they are implemented. By incorporating good data prac- tices into ML-enabled environmental research, we can develop more reliable solutions to address these challenges. 75 7. Conclusion 76 Declaration of Generative AI Generative AI tools were used to check for grammatical errors in the language and to polish the language. It was not used to generate content. 77 7. Conclusion 78 Bibliography [1] Data quality and artificial intelligence – mitigating bias and error to protect fundamental rights. In Helping to make fundamental rights a reality for every- one in the European Union, 2019. [2] Philip Achimugu, Ali Selamat, Roliana Ibrahim, and Mohd Naz’ri Mahrin. A systematic literature review of software requirements prioritization research. Inf. Softw. Technol., 56(6):568–585, June 2014. [3] Cathaoir Agnew, Dishant Mewada, Eoin M. Grua, Ciarán Eising, Patrick Denny, Mark Heffernan, Ken Tierney, Pepijn Van de Ven, and Anthony Scan- lan. Detecting the overfilled status of domestic and commercial bins using computer vision. Intelligent Systems with Applications, 18:200229, 2023. [4] M Aiswarya Raj, Jan Bosch, Helena Holmström Olsson, and Anders Jansson. On the impact of ml use cases on industrial data pipelines. In 2021 28th Asia- Pacific Software Engineering Conference (APSEC), pages 463–472, 2021. [5] Wael Almikaeel, Lea Čubanová, and Andrej Šoltész. Hydrological drought forecasting using machine learning—gidra river case study. Water, 14(3), 2022. [6] Wael Almikaeel, Andrej Šoltész, Lea Čubanová, and Dana Baroková. Hydro- informer: A deep learning model for accurate water level and flood predictions, 07 2024. [7] Samuel da Costa Alves Basílio, Camila Martins Saporetti, Zaher Mundher Yaseen, and Leonardo Goliatt. Global horizontal irradiance modeling from environmental inputs using machine learning with automatic model selection. Environ. Dev., 44(100766):100766, December 2022. [8] Susan M Baxter, StevenW Day, Jacquelyn S Fetrow, and Stephanie J Reisinger. Scientific software development is not an oxymoron. PLOS Computational Bi- ology, 2(9):1–4, 09 2006. [9] Federico Bonofiglio, Fabio C. De Leo, Connor Yee, Damianos Chatzievangelou, Jacopo Aguzzi, and Simone Marini. Machine learning applied to big data from marine cabled observatories: A case study of sablefish monitoring in the ne 79 Bibliography pacific. Frontiers in Marine Science, 9, 2022. [10] Virginia Braun and Victoria Clarke. Using thematic analysis in psychology. Qualitative Research in Psychology, 3:77–101, 01 2006. [11] Juan Manuel Carrillo de Gea, Joaquín Nicolás Ros, José Fernández-Alemán, Ambrosio Toval, and Christof Ebert. Requirements engineering tools. Software, IEEE, 28:86 – 91, 09 2011. [12] L. Cattaneo, A. Polenghi, M. Macchi, and V. Pesenti. On the role of data quality in ai-based prognostics and health management. IFAC-PapersOnLine, 55(19):61–66, 2022. 5th IFAC Workshop on Advanced Maintenance Engineer- ing, Services and Technologies AMEST 2022. [13] H Chojer, P T B S Branco, F G Martins, M C M Alvim-Ferraz, and S I V Sousa. Can data reliability of low-cost sensor devices for indoor air particulate matter monitoring be improved? – an approach using machine learning. Atmos. Environ. (1994), 286(119251):119251, October 2022. [14] C Condemi, D Casillas-Pérez, L Mastroeni, S Jiménez-Fernández, and S Salcedo-Sanz. Hydro-power production capacity prediction based on ma- chine learning regression techniques. Knowl. Based Syst., 222(107012):107012, June 2021. [15] Saverio De Vito, Girolamo Di Francia, Elena Esposito, Sergio Ferlito, Fabrizio Formisano, and Ettore Massera. Adaptive machine learning strategies for net- work calibration of IoT smart air quality monitoring devices. Pattern Recognit. Lett., 136:264–271, August 2020. [16] Steve Easterbrook. Climate change: A grand software challenge. pages 99–104, 11 2010. [17] Alexandre Fabregat, Anton Vernet, Marc Vernet, Lluís Vázquez, and Josep A. Ferré. Using machine learning to estimate the impact of different modes of transport and traffic restriction strategies on urban air quality. Urban Climate, 45:101284, 2022. [18] Elham Fijani, Rahim Barzegar, Ravinesh Deo, Evangelos Tziritis, and Kon- stantinos Skordas. Design and implementation of a hybrid model based on two-layer decomposition method coupled with extreme learning machines to support real-time environmental monitoring of water quality parameters. Sci- ence of The Total Environment, 648:839–853, 2019. [19] International Organization for Standardization and International Electrotech- nical Commission. ISO/IEC 25012: Software Engineering : Software Prod- uct Quality Requirements and Evaluation (SQuaRE) : Data Quality Model. ISO/IEC, 2008. 80 Bibliography [20] Cristina Gouveia, Alexandra Fonseca, António Câmara, and Francisco Fer- reira. Promoting the use of environmental data collected by concerned citizens through information and communication technologies. Journal of Environmen- tal Management, 71(2):135–154, 2004. [21] Corinna Gries, Mark Servilla, Margaret O’Brien, Kristin Vanderbilt, Colin Smith, Duane Costa, and Susanne Grossman-Clarke. Achieving fair data prin- ciples at the environmental data initiative, the us-lter data repository. Biodi- versity Information Science and Standards, 3, 06 2019. [22] Ivan Henderson V. Gue, Neil Stephen A. Lopez, Anthony S.F. Chiu, Aristo- tle T. Ubando, and Raymond R. Tan. Predicting waste management system performance from city and country attributes. Journal of Cleaner Production, 366:132951, 2022. [23] M. Hino, E. Benami, and Nina Brooks. Machine learning for environmental monitoring. Nature Sustainability, 1, 10 2018. [24] Irum Inayat, Siti Salwah Salim, Sabrina Marczak, Maya Daneva, and Shaha- boddin Shamshirband. A systematic literature review on agile requirements engineering practices and challenges. Comput. Human Behav., 51:915–929, Oc- tober 2015. [25] Loso Judijanto, Donny Priyangan, Hanifah Muthmainah, and I Jata. The in- fluence of data quality and machine learning algorithms on ai prediction perfor- mance in business analysis in indonesia. The Eastasouth Journal of Information System and Computer Science, 1:75–86, 12 2023. [26] Barbara Kitchenham and Pearl Brereton. A systematic review of systematic re- view process research in software engineering. Inf. Softw. Technol., 55(12):2049– 2075, December 2013. [27] Aniko Konya and Peyman Nematzadeh. Recent applications of ai to envi- ronmental disciplines: A review. The Science of The Total Environment, 906:167705, 01 2024. [28] Larry Lannom, Dimitris Koureas, and Alex R. Hardisty. Fair data and services in biodiversity science and geoscience. Data Intelligence, 2(1-2):122–130, 01 2020. [29] Sabina Leonelli and Niccolò Tempini. Data Journeys in the Sciences. springer, 07 2020. [30] Aiswarya M, Jan Bosch, and Helena Olsson. Maturity assessment model for industrial data pipelines. pages 503–513, 12 2023. [31] Kilkenny M F and Robinson K M. Data quality: “garbage in – garbage out”, May 2018. 81 Bibliography [32] Tharsanee Maganathan, Soundariya Senthilkumar, and Vishnupriya Balakrish- nan. Machine learning and data analytics for environmental science: A review, prospects and challenges. IOP Conference Series: Materials Science and Engi- neering, 955(1):012107, nov 2020. [33] Louis-Gabriel Maltais and Louis Gosselin. Energy management of domestic hot water systems with model predictive control and demand forecast based on machine learning. Energy Conversion and Management: X, 15(100254):100254, August 2022. [34] Massimiliano Manfren, Patrick AB. James, and Lamberto Tronchin. Data- driven building energy modelling – an analysis of the potential for generalisation through interpretable machine learning. Renewable and Sustainable Energy Reviews, 167:112686, 2022. [35] Xiao-Li Meng. Enhancing (publications on) data quality: Deeper data minding and fuller data confession. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184, 10 2021. [36] Violeta Migallón, Francisco J. Navarro-González, Héctor Penadés, José Pe- nadés, and Yolanda Villacampa. A parallel methodology using radial basis functions versus machine learning approaches applied to environmental mod- elling. Journal of Computational Science, 63:101817, 2022. [37] Clayton Miller, Bianca Picchetti, Chun Fu, and Jovan Pantelic. Limitations of machine learning for building energy prediction: ASHRAE great energy predictor III kaggle competition error analysis. Sci. Technol. Built Environ., 28(5):610–627, May 2022. [38] Michelle E Miro, David Groves, Bob Tincher, James Syme, Stephanie Tanver- akul, and David Catt. Adaptive water management in the face of uncertainty: Integrating machine learning, groundwater modeling and robust decision mak- ing. Clim. Risk Manag., 34(100383):100383, 2021. [39] Paula Moral, Álvaro García-Martín, Marcos Escudero-Viñolo, José M. Martínez, Jesús Bescós, Jesús Peñuela, Juan Carlos Martínez, and Gonzalo Alvis. Towards automatic waste containers management in cities via computer vision: containers localization and geo-positioning in city maps. Waste Man- agement, 152:59–68, 2022. [40] Aiswarya Munappy, Jan Bosch, Helena Holmstrom Olsson, Anders Arpteg, and Bjorn Brinne. Data management challenges for deep learning. In 2019 45th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, August 2019. [41] Aiswarya Raj Munappy, Jan Bosch, Helena Holmström Olsson, Anders Arpteg, and Björn Brinne. Data management for production quality deep learning mod- els: Challenges and solutions. Journal of Systems and Software, 191:111359, 82 Bibliography 2022. [42] Aiswarya Raj Munappy, David Issa Mattos, Jan Bosch, Helena Holmström Ols- son, and Anas Dakkak. From ad-hoc data analytics to dataops. In Proceedings of the International Conference on Software and System Processes, ICSSP ’20, page 165–174, New York, NY, USA, 2020. Association for Computing Machin- ery. [43] Carolina Natel de Moura, Jan Seibert, Miriam Rita Moro Mine, and Ricardo Carvalho de Almeida. Are machine learning methods robust enough for hy- drological modeling under changing conditions? EGU General Assembly 2020, 2019. [44] Ngoc-Thanh Nguyen, Keila Lima, Astrid Marie Skålvik, Rogardt Heldal, Eric Knauss, Tosin Daniel Oyetoyan, Patrizio Pelliccione, and Camilla Sætre. Syn- thesized data quality requirements and roadmap for improving reusability of in-situ marine data. In 2023 IEEE 31st International Requirements Engineer- ing Conference (RE), pages 65–76, 2023. [45] J Jake Nichol, Matthew G Peterson, Kara J Peterson, G Matthew Fricke, and Melanie E Moses. Machine learning feature analysis illuminates disparity be- tween E3SM climate models and observed climate change. J. Comput. Appl. Math., 395(113451):113451, October 2021. [46] Salomon Obahoundje, Arona Diedhiou, Komlavi Akpoti, Kouakou Kouassi, Eric Ofosu, and Didier Kouame. Predicting climate-driven changes in reser- voir inflows and hydropower in côte d’ivoire using machine learning modeling. Energy, 302:131849, 05 2024. [47] Amandalynne Paullada, Inioluwa Deborah Raji, Emily M. Bender, Emily Den- ton, and Alex Hanna. Data and its (dis)contents: A survey of dataset develop- ment and use in machine learning research. Patterns, 2(11):100336, November 2021. [48] Daniel Vazquez Pombo, Oliver Gehrke, and Henrik W Bindner. Solete, a 15- month long holistic dataset including: Meteorology, co-located wind and solar pv power from denmark with various resolutions. Data in Brief, 42, 2022. [49] Shameer Pradhan, Hans-Martin Heyn, and Eric Knauss. Identifying and man- aging data quality requirements: a design science study in the field of automated driving. Software Quality Journal, 32:1–48, 05 2023. [50] Maria Priestley, Fionntán O’donnell, and Elena Simperl. A survey of data quality requirements that matter in ml development pipelines. J. Data and Information Quality, 15(2), June 2023. [51] Martí Puig and Rosa Mari Darbra. Innovations and insights in environmental monitoring and assessment in port areas. Current Opinion in Environmental 83 Bibliography Sustainability, 70:101472, 2024. [52] Tharsanee R M, Soundariya Senthilkumar, and Vishnupriya Balakrishnan. Ma- chine learning and data analytics for environmental science: A review, prospects and challenges. IOP Conference Series: Materials Science and Engineering, 955:012107, 11 2020. [53] Aiswarya Raj, Jan Bosch, and Helena Olsson. Data Pipeline Management in Practice: Challenges and Opportunities, pages 168–184. springer, 11 2020. [54] Aiswarya Raj, Jan Bosch, Helena Holmström Olsson, and Tian J. Wang. Mod- elling data pipelines. In 2020 46th Euromicro Conference on Software Engi- neering and Advanced Applications (SEAA), pages 13–20, 2020. [55] Lucas Ramos, Marilaine Colnago, and Wallace Casaca. Data-driven analysis and machine learning for energy prediction in distributed photovoltaic gener- ation plants: A case study in queensland, australia. Energy Rep., 8:745–751, April 2022. [56] Per Runeson and Martin Höst. Guidelines for conducting and reporting case study research in software engineering. Empirical Softw. Engg., 14(2):131–164, April 2009. [57] Daniel Saboe, Hamidreza Ghasemi, Ming Ming Gao, Mirjana Samardzic, Kiril D. Hristovski, Dragan Boscovic, Scott R. Burge, Russell G. Burge, and David A. Hoffman. Real-time monitoring and prediction of water quality pa- rameters and algae concentrations using microbial potentiometric sensor signals and machine learning tools. Science of The Total Environment, 764:142876, 2021. [58] Daniel Schwabe, Katinka Becker, Martin Seyferth, Andreas Klaß, and Tobias Schaeffter. The metric-framework for assessing data quality for trustworthy ai in medicine: a systematic review. npj Digital Medicine, 7, 08 2024. [59] Arumoy Shome, Luís Cruz, and Arie van Deursen. Data smells in public datasets. In Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI, CAIN ’22, page 205–216, New York, NY, USA, 2022. Association for Computing Machinery. [60] Fabian Sittaro, Christopher Hutengs, and Michael Vohland. Which factors determine the invasion of plant species? machine learning based habitat mod- elling integrating environmental factors and climate scenarios. Int. J. Appl. Earth Obs. Geoinf., 116(103158):103158, February 2023. [61] Amrish Solanki. Advancements in artificial intelligence: A comprehensive re- view and future prospects, Apr 2024. [62] Devis Tuia, Benjamin Kellenberger, Sara Beery, Blair R Costelloe, Silvia 84 Bibliography Zuffi, Benjamin Risse, Alexander Mathis, Mackenzie W Mathis, Frank van Langevelde, Tilo Burghardt, Roland Kays, Holger Klinck, Martin Wikelski, Iain D Couzin, Grant van Horn, Margaret C Crofoot, Charles V Stewart, and Tanya Berger-Wolf. Perspectives in machine learning for wildlife conservation. Nat. Commun., 13(1):792, February 2022. [63] Costas A. Velis, David C. Wilson, Yoni Gavish, Sue M. Grimes, and Andrew Whiteman. Socio-economic development drives solid waste management per- formance in cities: A global analysis using machine learning. Science of The Total Environment, 872:161913, 2023. [64] Richard Y. Wang and Diane M. Strong. Beyond accuracy: what data quality means to data consumers. J. Manage. Inf. Syst., 12(4):5–33, March 1996. [65] Mark Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gaby Apple- ton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Olavo Bonino da Silva Santos, Philip Bourne, Jildau Bouwman, Anthony Brookes, Tim Clark, Merce Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris Evelo, Richard Finkers, and Barend Mons. The fair guiding principles for sci- entific data management and stewardship. Scientific Data, 3, 03 2016. [66] Huanyu Zhou, Yingning Qiu, Yanhui Feng, and Jing Liu. Power prediction of wind turbine in the wake using hybrid physical process and machine learning models. Renewable Energy, 198:568–586, 2022. [67] Huanyu Zhou, Yingning Qiu, Yanhui Feng, and Jing Liu. Power prediction of wind turbine in the wake using hybrid physical process and machine learning models. Renew. Energy, 198:568–586, October 2022. [68] Jun-Jie Zhu, Meiqi Yang, and Zhiyong Jason Ren. Machine learning in environ- mental research: Common pitfalls and best practices. Environmental Science & Technology, 57(46):17671–17689, 2023. PMID: 37384597. 85 Bibliography 86 A Survey form to validate the point system This appendix contains the SPADES-ML validation survey form. I A. Survey form to validate the point system Figure A.1: Section 1(a) of the survey II A. Survey form to validate the point system Figure A.2: Section 1(b) of the survey III A. Survey form to validate the point system Figure A.3: Section 2 of the survey IV A. Survey form to validate the point system Figure A.4: Section 3(a) of the survey V A. Survey form to validate the point system Figure A.5: Section 3(b) of the survey VI A. Survey form to validate the point system Figure A.6: Section 3(c) of the survey VII A. Survey form to validate the point system Figure A.7: Section 3(d) of the survey VIII A. Survey form to validate the point system Figure A.8: Section 4(a) of the survey IX A. Survey form to validate the point system Figure A.9: Section 4(b) of the survey X A. Survey form to validate the point system Figure A.10: Section 4(c) of the survey XI A. Survey form to validate the point system Figure A.11: Section 4(d) of the survey XII A. Survey form to validate the point system Figure A.12: Section 5(a) of the survey XIII A. Survey form to validate the point system Figure A.13: Section 5(b) of the survey XIV A. Survey form to validate the point system Figure A.14: Section 5(c) of the survey XV A. Survey form to validate the point system Figure A.15: Section 5(d) of the survey Figure A.16: Section 6 of the survey XVI B Thematic analysis This appendix presents the results of the thematic analysis performed on the quali- tative data. All the quotes under each theme are listed here. Theme 1: Transparency and justification of data selection Reader perspective R1: “Remember that ‘Monitoring data’ always needs to meet criteria defined by the agencies / EU. Therefore, the selection of monitoring data is not a big issue because they already (mostly) meet quality criteria.” R2: “The purpose and samling design should be presented.” R3: “Very important to select/use data that is acceptable and reliable since some data may misguide the prediction tool” R7: “It’s very important as it affects the model accuracy, and also whether your model can accurately capture what’s relevant through the noisy data. However, this is also a source of bias. The reader should be made aware of this source of bias.” R9: “Data availability is often the only factor for selecting a data source due to limited availability.” Author perspective A3: “Important to ensure model accuracy, improves model interpretability, supports reproducibility” A6: “Sometimes, data availability is the problem, so we don’t have the luxury of choosing” Reviewer perspective V3: “Clear data selection criteria, relevant with the problem being solved. Impor- tant to check target population, format, quality, and completeness.” V7: “As a reviewer, I’m always looking for an answer to ’why’. If data selection is XVII B. Thematic analysis anything but random, I look for possible sources of biases and what was done to mitigate them or how the authors acknowledge it later on.” Theme 2: Data quality and reliability Reader perspective R1: “Why is not precision mentioned? That is an important criterion.” R2: “Clear definition of units and standardisation is of importance.” R6: “The data should be recorded by a credible source, the uncertainty of any type while collecting the data should be provided. Removing outliers should be done by experts not automated. Consistent measurement is crucial” R6: ‘It depends on the purpose of the research, however, the correctness of the data is more important than the quantity” R9: “In waste management, often the most credible data (e.g. from national gov- ernments) is of the worst quality as they have no means to measure it at a national level. Data accuracy and consistency are the most important factors, especially when using ML, as gaps can be filled, but bad data makes this tough to do.” Author perspective A3: “High-quality data leads to more accurate predictions and saves time and re- sources. Vary important to check some data that has incomplete information, which may lead to inaccurate models. Can be overcome by removal, or using algorithms that handle missing values” A6: “Bad data, bad results.” Reviewer perspective V3: “Important to see high-quality data, such as accuracy, completeness, consis- tency” V7: “As ML models can deal with some amount of noise, these details are not that relevant, but it’s nice to include anyway” Theme 3: Data accessibility and constraints Reader perspective R3: “Many real-time application does not have data accessibility” R5: “Without we cannot validate published methods/results independently” R6: “It’s one of the most important steps for the credibility of the research. For XVIII B. Thematic analysis the research to be conducted correctly, the same results should be obtained by any researcher who is following the same steps as in the paper, thus without data, that cannot be confirmed” R7: “It really depends on the application. If I’m continuing the research, or testing a different model or applying different techniques, then I can work with the processed data, but otherwise, I would need the raw data. In my opinion, raw data along with the script(s) to process the data is sufficient. Tools such as Large Language Models make working with others’ code easier.” R9: “more for metadata than data accessibility - The method is usually the most important thing to me as the manner in which a data point was measured impacts how we can interpret the data, Unfortunately this is rarely provided in my field. For data accessibility, both are very important as you often want to go beyond what analysis the paper reported.” Author perspective A3: “Easy access to data allows to quickly explore, preprocess, and train models without delays” A6: “Data is the only way to verify the creditability of the research” A9: “I work a lot with countries in the Global South. Data formats have to be readily accessible otherwise they will not be able to access (e.g. if expensive / technical software is required)” Reviewer perspective V3: “Data accessibility is important and easily accessible if possible” V5: “Usually, I have no time to review a paper to the level of analysis replication” V5: “If I need to re-use the data (gaining access) then is not for revision but for additional analyses plus new data” V7: “This is quite an idealistic perspective. I don’t personally believe that all data needs to be accessible at the time of review, but it is needed for reproducibility. I deem access very important, only to ensure reproducibility.” Theme 4: Transparency in pre-processing Reader perspective R3: “Very important when used for predicting models” R4: “It would be interesting, if possible, to include also raw data in order to allow another scientist to provide additional or alternative methods for data preprocess- ing.” XIX B. Thematic analysis R6: “Sometimes, data preprocessing is more important than the modeling step itself. It is crucial to explain the process” R7: “It’s very critical and can either help train accurate models or make the model completely miss the target.” R9: “Its important to provide these reproducible steps / code, but I find contacting the authors and starting a dialogue is often the best way to truly understand what they’ve done.” Author perspective A3: “Very important in cleaning, transforming, and organizing raw data into a format that can be used by ML models” A3: “Very important to check some data that has incomplete information, which may lead to inaccurate models. Can be overcome by removal, or using algorithms that handle missing values” A9: “Much of the data in waste management is poor and we need to provide justi- fication as to why these incorrect data points should be removed from the analysis” Reviewer perspective V3: “Important to make sure the raw data is clean and usable format” Theme 5: Challenges in data processing Reader perspective R3: “Incomplete datasets with missing values can skew the model’s understand- ing, data might contain errors, outliers, or irrelevant information, ML models can’t directly handle non-numeric data” Author perspective A3: “Some data might have incomplete information, irrelevant, incorrect, or random data points, or multiple records of the same data” A7: “There are way too many challenges to be document everything. It could be useful to include, maybe as comments in code, but not sure how else one would include these.” Reviewer perspective V3: “Some issues should be mentioned e.g. missing or inconsistent data, noise, high dimensionality, class imbalance, and difficulties in integrating data from multiple sources, etc.” XX B. Thematic analysis Theme 6: Institutional standards and practices Reader perspective R1: “Remember that ’Monitoring data’ always need to meet criteria defined by the agencies / EU. Therefore the selection of monitoring data is not a big issue because they already (mostly) meet quality criteria.” R5: “Embargo timing could be needed to protect ongoing publishing actions and MSc-PhD Theses” Author perspective A4: “Scientific evaluation committees must take into consideration the value of publishing datasets. Otherwise, the scientific community won’t publish them.” XXI B. Thematic analysis XXII C Referred table from SLR that guided our literature review This appendix presents Table 2 of SLR by Konya et Nematzadeh [27] which contains details of the papers referred to in our literature review. Table C.1: Table 2 - AI applications in environmental disciplines based on Konya et Nematzadeh [27] Targeted envi- AI tools Performance Processing Best per- Publication ronmental field metrics time formance Air quality moni- Shallow Neu- Mean absolute 4 weeks cali- Shallow Neu- De Vito et toring ral Networks error (MAE), bration set, 4 ral Networks al.,2020 [15] (SNN), Ex- mean relative weeks offline (SNN) treme Learning error, normal- training Machine (ELM) ized MAE, root mean square error (RMSE), nRMSE Air quality predic- Multilayer Per- Correlation 2 h train- n.a. Fabregat et tion ceptron Regres- coefficient (R2), ing CPU al.,2022 [17] sor (MLPR) factor of two (Central of observa- Processing tions (FAC2), Unit) geometric mean bias (GeoMean), geometric stan- dard deviation (GeoSTD), RMSE, mean bias XXIII C. Referred table from SLR that guided our literature review Indoor air partic- Multiple Lin- R2, RMSE, n.a. Support Vec- Chojer et ulate matter (PM) ear Regression mean bias error tor Machine al.,2022 [13] monitoring (MLR), Support (MBE) (SVM) Vector Machine (SVM), Gra- dient Boosting Regression (GBR), Ex- treme Gradient Boosting (XGB) Real-time monitor- Least Square R2, RMSE, n.a. VMD- Fijani et ing of water quality Support Vector MAE, normal- CEEMDAN- al.,2019 [18] Machine, com- ized RMSE, ELM plete ensemble normalized empirical mode MAE, bias error decomposition algorithm with adaptive noise CEEMDAN, Variational Mode De- composition (VMD), Ex- treme Learning Machine, and their combina- tion Real-time monitor- Long short- RMSE, normal- n.a. n.a. Saboe et ing and prediction term memory ized RMSE al.,2021 [57] of water quality (LSTM) neural network for multivariate single-step and multivariate multi-step time series forecast- ing Ocean monitoring: YOLOv5 (deep Average preci- Reduced n.a. Bonofiglio et animal tracking learning) sion (AP) video pro- al., 2022 [9] cessing time from 20 min to 3 min Energy prediction Random Forest, MAE, RMSE, n.a. Gradient Ramos et Support Vector mean absolute Tree Boost- al.,2022 [55] Machine, Gradi- percentage error ing ent Tree Boost- (MAPE), pre- ing diction accuracy Solar photovoltaic Random Forest, MAE, RMSE, n.a. Random Pombo et (PV) power fore- Support Vector residual sum of Forest al.,2022 [48] casting Machine, Artifi- squares (RSS), cial Neural Net- R2 works XXIV C. Referred table from SLR that guided our literature review Power prediction of Deep Neural RMSE, MAE, n.a. Physical Zhou et wind turbines Network, Trans- MAPE, R2 Guided Neu- al.,2022 [67] fer Learning ral Network (PGNN) Hydro-power pro- Multilayer RMSE, MAE, n.a. Support Vec- Condemi et duction capacity perceptron, correlation tor Machine al.,2021 [14] prediction Extreme Learn- ing Machine, Support Vector Machine Building energy Machine R2, MAPE, n.a. n.a. Manfren et modeling Learning- normalized al.,2022 [34] based regression mean bias error model (NMBE) Building energy Neural Net- RMSE, MBE n.a. n.a. Miller et prediction works, XGBoost al.,2022 [37] Environmental Multiple Linear Relative er- n.a. Support Vec- Migallón et modeling Regression, ror, R2, MSE, tor Machine al.,2022 [36] K-nearest RMSE, MAE, neighbor, Ar- MAPE tificial Neural Networks, Sup- port Vector Machine Hydrological mod- Long short-term Nash–Sutcliffe n.a. Long short- Natel de eling memory, trans- efficiency (NSE) term mem- Moura et fer learning ory al.,2020 [43] Solar radiation pre- Multivariate R2, RMSE, n.a. All of them Da Costa diction Adaptive Re- MAE, MAPE, Alves Basílio gression Spline, mean abso- et al., Support Vec- lute deviation 2022 [7] tor Machine, (MAD), uncer- XGBoost, poly- tainty nomial Ridge Regression Climate models Random Forest R2, MAE, n.a. n.a. Nichol et average test al.,2021 [45] anomaly corre- lation coefficient (ACC) Invasive plant Support Vec- Area under the n.a. Boosted Sittaro et species tor Machine, curve (AUC), Regression al.,2023 [60] Boosted Regres- RMSE Trees sion Trees XXV C. Referred table from SLR that guided our literature review Wildlife conserva- Computer Vi- For Machine n.a. Deep learn- Tuia et tion sion, Bayesian Learning: n.a. ing algo- al.,2022 [62] estimation, For deep learn- rithms: Con- Decision Tree, ing: accuracy, volutional Random Forest, recall, precision Neural Net- Support Vec- work, Vision tor Machine, Transform- Deep learning: ers, Long Artificial Neu- short-term ral Network, memory, Convolutional Gated Re- Neural Net- current Unit work, Vision Transformers, Long short-term memory, Gated Recurrent Unit Nature conserva- Decision Tree Accuracy, preci- n.a. Extra Tree Moran et tion Classifier, Extra sion, recall, F- Classifier al.,2017 [34] Tree Classifier score Energy manage- Artificial Neural R2 n.a. n.a. Maltais and ment on domestic Networks Gosselin, hot water 2022 [33] Water manage- Random Forest, n.a. n.a. Random Miro et ment, groundwater Support Vector Forest al.,2021 [38] modeling Machine, Artifi- cial Neural Net- works Hydrological Deep learning: Confusion n.a. All of them Almikaeel et drought forecasting Artificial Neural matrix al., 2022 [5] Networks and Machine Learn- ing: Support Vector Machine Waste management Conditional RMSE, sym- n.a. All of them Velis et Random For- metric mean provided al.,2023 [63] est, Univariate absolute per- different Non-linear centage error insights Regression (SMAPE), Akaike infor- mation crite- ria corrected (AICc) Waste management Rough Set- Accuracy n.a. n.a. Gue et al., based Machine 2022 [22] Learning Waste management Computer Vi- Intersection n.a. YOLOv5 Moral et sion, Deep over union al.,2022 [39] learning: Ef- (IoU), precision, ficientDet, recall, average YOLOv5 precision (AP) XXVI C. Referred table from SLR that guided our literature review Waste management Computer Vi- Mean aver- n.a. YOLOF, Agnew et sion, Deep age precision SOLOv2 al.,2023 [3] learning: (mAP), aver- Faster-RCNN, age precision, RetinaNet, normalized con- Grid-RCNN, fusion matrix YOLOF, Mask- RCNN, Cascade Mask-RCNN, YOLACT, SOLOv2 XXVII