An exploratory field study on the use of
data management and data quality
requirements in ML-enabled software
applied in environmental research
Master’s Thesis in Computer science and engineering
Devasinghage Sara Nirmani Mahagamarachchi
Hikkaduwa Liyanage Pamali Chathurika
Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2025

Master’s Thesis 2025
An exploratory field study on the use of data
management and data quality
requirements in ML-enabled software
applied in environmental research
Devasinghage Sara Nirmani Mahagamarachchi
Hikkaduwa Liyanage Pamali Chathurika
Department of Computer Science and Engineering
Chalmers University of Technology
University of Gothenburg
Gothenburg, Sweden 2025
An exploratory field study on the use of data management and data quality require-
ments in ML-enabled software applied in environmental research
Devasinghage Sara Nirmani Mahagamarachchi
Hikkaduwa Liyanage Pamali Chathurika
© Devasinghage Sara Nirmani Mahagamarachchi, 2025.
© Hikkaduwa Liyanage Pamali Chathurika, 2025.
Supervisor: Hans-Martin Heyn, Department of Computer Science and Engineering
Supervisor: Yi Peng, Department of Computer Science and Engineering
Examiner: Eric Knauss, Department of Computer Science and Engineering
Master’s Thesis 2025
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000
Typeset in LATEX
Gothenburg, Sweden 2025
iv
Devasinghage Sara Nirmani Mahagamarachchi
Hikkaduwa Liyanage Pamali Chathurika
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
Abstract
Integrating machine learning into environmental science has shown great promise in
improving research outcomes. However, the effective application of machine learning
and the reliability of the results depend heavily on data quality and management
practices, which are often overlooked or addressed inconsistently. It is important to
have a proper data pipeline that includes good practices for quality data and data
management. This thesis introduces SPADES-ML (Scientific Pipeline Assessment
and Data-Centric Evaluation Scorecard for Machine Learning), a structured assess-
ment framework developed to evaluate the quality and transparency of data-related
practices in machine learning based research. SPADES-ML is demonstrated through
a case study of machine learning based environmental research.
A total of 28 research papers were analysed using SPADES-ML. The framework
was applied to assess five critical areas: data selection and suitability, data quality,
adherence to the FAIR principles, data preprocessing, and challenges in preprocess-
ing. A survey was conducted to validate the findings targeting practitioners in ma-
chine learning based environmental research. Results from the literature and survey
analyses revealed recurring challenges in ensuring data quality, reproducibility, and
methodological excellence. The analysis of SPADES-ML and the survey revealed
recurring challenges in ensuring data quality, reproducibility, and methodological
excellence. Furthermore, this study provides initial recommendations to improve
data practices in machine learning-based research by adhering software engineering
principles in the results. This thesis contributes to the emerging field of research
software engineering by offering a structured evaluation and guidelines for robust
methodology pipelines in interdisciplinary, machine learning based research.
Keywords: Data-Centric Evaluation, Data Management, Data Quality, Data Qual-
ity Challenges, Environmental Research, FAIR, Machine Learning, Methodological
Guidelines, Software engineering, SPADES-ML
v

Acknowledgements
We would like to thank Hans-Martin Heyn, and Yi Peng our supervisors in software
engineering, for the great support and assistance in guiding our work on this thesis.
Both have been a great encouragement to us and have been very helpful throughout
the thesis. We would also like to thank Eric Knauss, our examiner, for his valuable
feedback throughout the research. Our appreciation also goes to Michelle Nerentorp
for her insights into environmental research. Finally, we would like to thank our
survey respondents for their invaluable time and insights provided.
Devasinghage Sara Nirmani Mahagamarachchi, Gothenburg, June 2025
Hikkaduwa Liyanage Pamali Chathurika, Gothenburg, June 2025
vii

Contents
List of Figures xi
List of Tables xiii
1 Introduction 1
1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Purpose of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 Challenges of using ML in environmental research . . . . . . . . . . . 5
2.2 Data quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 FAIR Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Data producers and data consumers . . . . . . . . . . . . . . . . . . . 7
2.5 Data Pipeline Management . . . . . . . . . . . . . . . . . . . . . . . 8
3 Related Work 9
4 Methods 15
4.1 Analysis Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Scientific Pipeline Assessment and Data-Centric Evaluation Score-
card for Machine Learning (SPADES-ML) . . . . . . . . . . . . . . . 19
4.4 Analysis using SPADES-ML . . . . . . . . . . . . . . . . . . . . . . . 22
4.5 Survey to validate SPADES-ML . . . . . . . . . . . . . . . . . . . . . 23
4.6 Analysis of the survey . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.7 Analyse and create recommendations . . . . . . . . . . . . . . . . . . 26
5 Results 27
5.1 Analysis of literature using SPADES-ML . . . . . . . . . . . . . . . . 27
5.2 Analysis of Survey responses . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Findings on challenges and recommendations . . . . . . . . . . . . . . 56
6 Discussion 67
6.1 RQ1: How are data selection and data preparation done when apply-
ing ML in environmental research? . . . . . . . . . . . . . . . . . . . 67
ix
Contents
6.2 RQ2: What are the challenges of data selection and preparation for
the application of ML in environmental research? . . . . . . . . . . . 69
6.3 RQ3: What solutions are reported in SE that could mitigate the
challenges of data selection and preparation for the application of
ML in environmental research? . . . . . . . . . . . . . . . . . . . . . 69
6.4 Contribution to Research Software Engineering . . . . . . . . . . . . 70
6.5 Threats to validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.6 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7 Conclusion 75
Bibliography 79
A Survey form to validate the point system I
B Thematic analysis XVII
C Referred table from SLR that guided our literature review XXIII
x
List of Figures
4.1 Overview Of Methodology Purple: SPADES-ML method for pipeline
assessment, Yellow: Literature review, Orange: Survey, Red: Iden-
tify good practices and recommendation creation, Green: Fulfilled
research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.1 Dot plot illustrating how many referred research papers fulfilled the
criteria in SPADES-ML. . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Score distribution based on published year. n(2017-2021): 7, n(2022):
14, n(2023-2024): 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3 Score distribution based on environmental field n(Air): 3*, n(Energy):
8, n(Climate): 3*, n(Hydrological): 3*, n(Waste): 4* . . . . . . . . . 34
5.4 Score distribution based on ML techniques used n(NN): 8, n(SVM):
3*, n(Image processing): 3*, n(ELT): 9, n(Other): 3* . . . . . . . . . 37
5.5 Score distribution based on data format n(Time Series): 14, n(Mixed):
9, n(Mutimedia): 3* . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.6 Score distribution based on number of data sources n(Single): 18,
n(Multiple): 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.7 Distribution of environmental fields of the respondents . . . . . . . . 42
5.8 Distribution of the years of experience as an author in environmental
research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.9 Distribution of paper review frequency . . . . . . . . . . . . . . . . . 43
5.10 Distribution of the years of experience as a reviewer . . . . . . . . . . 43
5.11 Distribution of data producing experience . . . . . . . . . . . . . . . 44
5.12 Distribution of origin of data . . . . . . . . . . . . . . . . . . . . . . . 44
5.13 Grouped Bar Chart: Mean Importance Ratings of Evaluation Criteria
by Role . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.14 Importance rating as a reader . . . . . . . . . . . . . . . . . . . . . . 46
5.15 Importance of making self-collected data publicly available to increase
the credibility of the data . . . . . . . . . . . . . . . . . . . . . . . . 47
5.16 Importance rating as an author . . . . . . . . . . . . . . . . . . . . . 48
5.17 Importance of communicating with data producers before retrieving
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.18 Importance rating as a reviewer . . . . . . . . . . . . . . . . . . . . . 50
5.19 Importance rating considering all roles . . . . . . . . . . . . . . . . . 51
A.1 Section 1(a) of the survey . . . . . . . . . . . . . . . . . . . . . . . . II
A.2 Section 1(b) of the survey . . . . . . . . . . . . . . . . . . . . . . . . III
xi
List of Figures
A.3 Section 2 of the survey . . . . . . . . . . . . . . . . . . . . . . . . . . IV
A.4 Section 3(a) of the survey . . . . . . . . . . . . . . . . . . . . . . . . V
A.5 Section 3(b) of the survey . . . . . . . . . . . . . . . . . . . . . . . . VI
A.6 Section 3(c) of the survey . . . . . . . . . . . . . . . . . . . . . . . . VII
A.7 Section 3(d) of the survey . . . . . . . . . . . . . . . . . . . . . . . . VIII
A.8 Section 4(a) of the survey . . . . . . . . . . . . . . . . . . . . . . . . IX
A.9 Section 4(b) of the survey . . . . . . . . . . . . . . . . . . . . . . . . X
A.10 Section 4(c) of the survey . . . . . . . . . . . . . . . . . . . . . . . . XI
A.11 Section 4(d) of the survey . . . . . . . . . . . . . . . . . . . . . . . . XII
A.12 Section 5(a) of the survey . . . . . . . . . . . . . . . . . . . . . . . . XIII
A.13 Section 5(b) of the survey . . . . . . . . . . . . . . . . . . . . . . . . XIV
A.14 Section 5(c) of the survey . . . . . . . . . . . . . . . . . . . . . . . . XV
A.15 Section 5(d) of the survey . . . . . . . . . . . . . . . . . . . . . . . . XVI
A.16 Section 6 of the survey . . . . . . . . . . . . . . . . . . . . . . . . . . XVI
xii
List of Tables
4.1 List of papers considered for the literature review . . . . . . . . . . . 19
4.2 Evaluation criteria - SPADES-ML . . . . . . . . . . . . . . . . . . . . 21
5.1 A summary of the analysis results . . . . . . . . . . . . . . . . . . . . 28
5.2 Summary of the normalised results . . . . . . . . . . . . . . . . . . . 29
5.3 Comparison of normalised and not normalised approaches . . . . . . . 30
5.4 Categorisation by environmental fields . . . . . . . . . . . . . . . . . 33
5.5 Categorisation by ML techniques used . . . . . . . . . . . . . . . . . 35
5.6 Categorisation by data format . . . . . . . . . . . . . . . . . . . . . . 38
5.7 Categorisation by the number of data sources . . . . . . . . . . . . . 40
5.8 Challenges identified in data selection and preparation for the appli-
cation of ML in environmental disciplines . . . . . . . . . . . . . . . . 57
5.9 Guidelines for future ML-enabled research . . . . . . . . . . . . . . . 60
C.1 Table 2 - AI applications in environmental disciplines based on Konya
et Nematzadeh [27] . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXIII
xiii
List of Tables
xiv
1
Introduction
Environmental research explores different natural environments using data from,
among others, soil, water, air, organisms, and biological data [27] with a history
tracing back to the 19th century [32]. It helps us to understand the complex sys-
tem that shapes our environment, the impact of human activities, and their con-
sequences. Areas of research such as climate change, biodiversity conservation, air
quality monitoring, and water resource management require a large amount of data
to be collected, stored, and analysed. Longitudinal feature of many subject areas,
such as climate change, deforestation, biodiversity loss, etc., requires the collection
of data over extended periods – often spanning decades [2] – to observe meaning-
ful patterns and trends, contributing to the rapid growth of environmental data.
Further, environmental data are collected from heterogeneous data sources, includ-
ing remote sensing such as satellite imagery and drones, ground-based sensors like
weather stations, air/water quality monitoring stations, government agencies such
as NASA, and human scientists/ volunteers [52]. Fast data growth is also caused
by increased sensor deployment due to advancements in Internet of Things (IoT),
global collaboration and data-sharing all over the world. However, the complexity of
these datasets, for example, due to various geophysical parameters, poses significant
challenges for analysis and interpretation [27]. As a result, Artificial Intelligence
(AI) has emerged as a useful tool for analysing and deriving insights from the vast
and complex environmental data landscape in environmental research.
Essentially, AI is about developing computer systems that can perform tasks which
typically require human intelligence [61]. Over the past few decades, the recurring
advances and efficiency of AI techniques have made them important in various fields
and research areas [27]. Machine Learning (ML) is a subset of AI that focuses
primarily on learning to model and predict based on past experience [32]. With
its advanced decision-making and pattern recognition capabilities, it processes large
amounts of data and provides valuable insights. Maganathan et al [32] discuss what
kind of ML algorithms are used in different types of environmental research, and
ML is inevitably used in different areas such as climate modelling, water quality and
air quality monitoring, and natural disaster prediction.
The accuracy and reliability of the results produced by these ML models are deter-
mined by the data used for training during operation of the ML model, and the ML
1
1. Introduction
model itself1 2 [1]. While the ML community used to agree that “the more data the
better”, more recently the majority consensus has shifted to “garbage in garbage
out” [31]. Meng [35] further discusses that “garbage in garbage out” is not the main
concern, but “garbage in package out”, because if something is recognised as garbage
then it can be fixed. However, when garbage is wrapped nicely as a package, it will
sell uninformed and remains undetected until the model fails. Regardless of the ap-
plication domain, numerous studies, such as [12], [25], [58], [23] have demonstrated
the significant impact of data quality on model performance and reliability.
1.1 Problem Description
Many recent research projects in environmental science that used ML were listed in
the systematic literature review (SLR) by Konya and Nematzadeh [27]. The authors
examined information such as ML algorithms, performance metrics and processing
times in these research projects. However, they did not investigate how the data are
prepared for the ML models and how the quality of the data used as input to these
ML models is assessed and maintained throughout its journey [29] across different
stakeholders. In Software Engineering (SE) however, researchers and practitioners
routinely handle large volumes of data and have developed knowledge and practices
for managing such complexity. For example, Munappy et al. [53], [54], [4], [41], [42]
have extensively investigated data management in embedded systems, addressing
challenges and proposing solutions. These include techniques and practices such as
data pipelining and DataOps, which could be highly relevant for managing data
in environmental ML applications. Such SE techniques can play an important role
in environmental research: Easterbrook [16] showed in relation to work on climate
modelling, and can therefore be referred to as environmental informatics. Almikaeel
et al. [5] have used, for example, ML in hydrological drought forecasting, and this
is referred to as hydroinformatics.
Advanced concepts, such as data pipeline engineering proposed by Munappy et
al [53], provide the ability to automate the processing of heterogeneous data from
distributed data sources, intensify data life cycle activities and increase the pro-
ductivity of data-driven operations. These concepts can be applicable and helpful
when handling data for ML-supported environmental research. However, due to
various technical errors that can occur during data collection, missing data are a
common problem which adds to the challenges faced with data quality. Furthermore,
if the data are noisy and error-prone, the processing of these data can also be chal-
lenging [27]. In summary, existing data management processes for environmental
research do not yet solve all data quality challenges associated with the use of ML.
To identify potential solutions, such as guidelines, we need to look closely at how
data are handled and managed in environmental research and what solutions from
SE perspective are available to handle the identified environmental data challenges
in ML-based systems.
1https://research.aimultiple.com/data-quality-ai/
2https://hbr.org/2018/04/if-your-data-is-bad-your-machine-learning-tools-are-
useless
2
1. Introduction
1.2 Purpose of the Study
The complexity of environmental data has become a challenging obstacle for en-
vironmental scientists [27]. To address this, researchers are increasingly adopting
ML-based techniques to manage and analyse large datasets. However, they often
struggle with managing and selecting quality data that are suitable for training ef-
fective ML models. Currently, there is a lack of clear information regarding the
data management processes used, the specific data quality requirements that need
to be followed, or the challenges that environmental researchers face in managing
and preparing data for ML-based applications [68]. This study aims to investigate
how environmental researchers select and preprocess data, particularly when using
SE techniques such as data pipelining in the context of ML-supported environmental
research. It seeks to identify key stages in data pipelines within recent ML-enabled
environmental research, examine the challenges researchers encounter, and explore
potential solutions from SE. As a source of potential solutions from SE, this thesis
investigates the work of Munappy et al. [53], [54], [4], [41], [42] in the field of ML-
supported embedded systems and tries to map these solutions onto the challenges
identified in ML-supported environmental research.
1.3 Research Questions
The following research questions guide this thesis work:
1) How are data selection and data preparation done when applying ML in environ-
mental research?
Firstly, many recent research projects using ML in environmental research were
listed in an SLR by Konya et Nematzadeh [27]. The authors extracted information
such as ML algorithms, performance metrics and processing times. However, their
research did not explore the role of data used in ML models. We intend to build on
their work and extend their SLRs by investigating the origin of data used in these
projects, how the data was selected, including the requirements and metrics used to
assess the suitability and quality of data, and the preprocessing performed before
data was used to train the ML algorithms.
2) What are the challenges of data selection and preparation for the application of
ML in environmental research?
Once we have a good understanding of the data selection and preparation processes
used by researchers in environmental research using ML, we intend to investigate
the challenges researchers face in these processes. Some initial indications might be
given in the literature, but we intend to conduct a qualitative study, including a
survey with researchers, to understand in more detail the challenges they faced.
3) What solutions are reported in SE that could mitigate the challenges of data
selection and preparation for the application of ML in environmental research?
3
1. Introduction
Based on the insights gained from the answers to RQ1 and RQ2, as SE researchers,
we will use the existing solutions and techniques which are already reported in SE
and map these solutions to the challenges we have identified.
Delimitations
The study aims to provide initial recommendations for data quality and manage-
ment practices, considering the challenges faced and insights gained from other SE
research. Our thesis primarily focused on identifying problems related to the data
pipeline in environmental research and applying our new ranking methodology to
examine how the data pipeline is defined in this field. In creating our recommenda-
tions, we focused primarily on the findings of Munappy et al. [54], [4], [30], [42] in the
papers cited below, as they had thorough processes for building data pipelines. We
wanted to investigate whether we could transfer some of this knowledge to a com-
pletely different field – from embedded systems to environmental research. However,
this approach has not yet been validated.
1.4 Thesis Outline
This report has the following sections.
Name Content
Background Introduce the concepts related to this study
Related work Discuss relevant literature review
Methodology Explain the methods used in this study
Results Present the outcomes from this study
Discussion Discuss the findings based on the outcomes
Conclusion Provides the conclusion of the thesis
4
2
Background
This section discusses the background of the challenges associated with ML-based
applications in environmental research, along with the key concepts necessary to
understand the analysis and proposed solutions of this thesis.
2.1 Challenges of using ML in environmental re-
search
ML tools have become extremely popular over time in the area of environmental re-
search due to their ability to quickly process large amounts of data and analyse them
efficiently in terms of processing needs and complexity of the necessary tools [27].
Environmental data come from multiple sources and are presented in different for-
mats, increasing the overall complexity of managing and analysing the collected
data. Therefore, the accuracy and reliability of these models highly depends on the
quality and relevance of the data used to train the model [66]. As noted in a study
by Priestley et al. [50], poor data quality and poorly defined data pipelines can affect
ML systems in several ways. Inaccurate data can affect the efficiency of the solution
and also discriminate against underrepresented entities in the data. This raises a
need for well-defined data selection criteria.
There may also be challenges due to a lack of domain expertise. Data scientists spe-
cialising in ML may not have sufficient knowledge of environmental science, while
environmental scientists may lack familiarity with ML tools and techniques. This
significant knowledge gap between these two expert groups can lead to problems
with data quality and selection procedures. The need for an interdisciplinary col-
laboration between these parties is essential to maintain the quality outcome [27].
Furthermore, scientists collecting the data may not fully understand what kind of
data are needed to train the specific ML models. This will eventually lead to a
degradation of the quality of the ML model.
In order to ensure high-quality ML models, it is necessary to follow good data man-
agement practices that focus on data pipelines and DataOps from data collection to
model deployment [41]. As data are collected from heterogeneous data sources, the
dataset may contain missing values and outliers. Poorly defined data pipeplines can
lead data pipelines can lead to inconsistencies in data processing and negatively im-
5
2. Background
pact the accuracy of the final output. The need for data pre-processing is highlighted
in ML-based applications in order to maintain the quality of the data to improve
the learning ability of the model [41], which is a major concern with environmental
data.
Although ML offers many advantages in environmental research, a proper under-
standing of data management and data quality is necessary to achieve good results.
Well-defined data pipelines, including suitable data selection methods, must be im-
plemented to address these challenges and improve the credibility and reliability of
ML solutions in environmental research.
2.2 Data quality
Data quality is a key factor that affects the effectiveness of ML models in environ-
mental research in terms of accuracy and reliability of the results. Environmental
research covers many disciplines where data may be collected from different sources,
leading to several challenges with the complexity of the data. Different data can
be used for different purposes, and different data quality requirements need to be
defined and established to achieve this [44].
As stated by Wang et al. [64], high quality data should be clearly represented,
intrinsically reliable, accessible to data consumers, and contextually appropriate for
the intended purpose. To improve data quality, it is important to have a proper
understanding of what data quality means to the respective data consumers. The
paper also states that high-quality data can be defined as “data that are fit for use by
data consumers”. The quality of the data can vary depending on the purpose of the
research and the specific environmental discipline. Some researchers may focus on
historical consistency when collecting data, while others may focus on time-sensitive
data.
Environmental data can often have missing data or rapid fluctuations due to sen-
sor problems or other unavoidable situations. Incomplete data can always lead to
inconsistencies in the results and directly affect the accuracy of the data model.
Missing or inconsistent data often introduces biases that reduce the performance of
the model [59]. In most scenarios, the environmental data collected must be real-
time, as even a small delay can compromise the objective of the research and the
model’s ability to provide timely predictions. Therefore, it is essential to collect
quality data and maintain quality throughout the process to provide better insights.
2.3 FAIR Principles
Environmental researchers define data of good quality also through the FAIR prin-
ciples. The FAIR principles — findable, accessible, interoperable, and reusable —
were introduced to increase the value and utility of research data by ensuring that
it is managed and shared to maximise its potential for reuse and collaboration [65].
6
2. Background
By promoting transparency, reproducibility, and efficiency in research, these prin-
ciples have become a cornerstone of the open science movement. In environmental
research, it is important to ensure that datasets can be located and accessed by
researchers worldwide to facilitate long-term monitoring and analysis, especially in
areas such as climate change, biodiversity, and pollution [28]. In addition, when
working with climate models, it is essential to be able to combine data sets from
different sources, such as satellite imagery and field observations, to gain a compre-
hensive understanding [21]. Findability requires that data and metadata be assigned
unique and persistent identifiers so that they can be easily discovered. Accessibility
requires that the data be retrievable through standardised protocols, which may
include authentication and authorisation procedures that ensure that appropriate
levels of access are maintained. Interoperability facilitates data integration across
diverse systems by focusing on the use of standardised vocabularies and formats.
Finally, reusability focuses on the need for clear licenses to use data and detailed
provenance information to enable data to be reused in future research contexts.4 5
2.4 Data producers and data consumers
In environmental research, two critical stakeholders are data providers and data
consumers. Each plays a different role in the data ecosystem. Data providers are
entities such as government agencies, research institutions, and citizens that gener-
ate, collect, and share environmental data [20]. These stakeholders often make use
of advanced technologies such as remote sensing, IoT sensors, and field surveys to
collect high-quality data on variables such as air quality, water quality, biodiver-
sity, and waste management [51]. For example, government agencies such as the
Environmental Protection Agency (EPA), National Oceanic and Atmospheric Ad-
ministration (NOAA), and European Space Agency (ESA) provide satellite-derived
environmental data on air quality, climate change, and ocean conditions used in
environmental research worldwide.
Data consumers include researchers, policymakers, and industry professionals who
are the end users of environmental data. These consumers use the data to anal-
yse, model, and make informed decisions. For example, Nichol et al. [45], 2021 use
data from multiple data sources, including the National Oceanic and Atmospheric
Administration (NOAA), the National Snow and Ice Data Center (NSIDC), and
the Pan-Arctic Ice Ocean Modelling and Assimilation System (PIOMAS), to calcu-
late the discrepancy between E3SM climate models (Energy Exascale Earth System
Model (E3SM) developed by the United States Department of Energy (DOE)) and
observed climate change.
The relationship between data providers and data consumers is critical to the progress
of environmental research. Data providers must adhere to standards such as the
4https://www.vr.se/english/mandates/open-science/open-access-to-research-
data/support-and-tools-/making-research-data-accessible-and-fair.html
5https://www.hb.se/en/about-ub/current/news-archive/2024/october/fair-resear
ch-whats-that/
7
2. Background
FAIR Principles3 to ensure that the data they provide are accurate, comprehensive,
and accessible. This, in turn, enables data consumers to find, understand, and inte-
grate the data into their work in an efficient manner, thereby improving the quality
and impact of their research. For example, the Environmental Data Initiative (EDI)
emphasises the importance of making data FAIR to support reproducibility and col-
laboration in environmental science [21]. Most of the research papers reviewed in
the literature review of this study play the role of data consumers, while a few have
collected data themselves.
2.5 Data Pipeline Management
With the continued development of ML models and the use of large datasets, the
need for a well-established data pipeline has increased significantly [4]. A data
pipeline consists of a complex chain of interconnected processes that starts with a
data source and ends at a destination with processed data. The destination can be
either a storage or any visualisation tool [54]. With its automated flow of data, it
helps in removing many manual steps that take more time and effort. Automated
operations of a data pipeline can have data selection, extraction, transformation, ag-
gregation, validation, and data loading for further analysis and visualisation [53]. It
can vary according to the research area and the type of data that is being processed.
For some data pipelines, the order of the processes is critical, while for others, it
may be flexible depending on the processing logic and the data dependencies.
Modern data pipelines emphasises scalability, reliability, fault tolerance, and cor-
rectness. Studies highlight that well-designed pipelines reduce the complexity of
data preparation processes and improve data accessibility for ML applications [53].
For ML-based environmental research, it is very important to identify a specialised
and refined data pipeline. Some parts of the data pipeline cannot be fully auto-
mated when it comes to data annotation. Data annotation techniques may not be
automated to support all types of datasets.
While data pipelines provide an effective tool for managing data by automating its
flow, DataOps introduces a process-oriented approach that includes principles that
emphasise collaboration, automation and monitoring in the management of data
workflows. [42]. DataOps does not focus only on tools but also on people, as it
requires a combination of collaboration and innovation. Studies have shown that it
has improved team productivity and reduced error rates in pipelines.
3https://snd.se/en/manage-data/prepare-and-share/FAIR-data-principles
8
3
Related Work
This section will discuss previous studies in ML-enabled environmental science. It
will also discuss data quality and related principles in ML-enabled environmental
research and data pipelines.
Challenges of applying ML in environmental re-
search due to data quality challenges
The SLR by Konya et Nematzadeh [27] provides a valuable synthesis of recent
applications of AI, in particular ML and Deep Learning(DL), in the environmental
disciplines. They have referred to 26 research papers in their SLR and provided
insights into several environmental domains, including air quality monitoring, water
quality assessment, energy management, and waste management. The complexity
and volume of data are common to many of the studies reviewed, often including time
series data, sensor readings and geospatial information. These studies demonstrate
the use of a variety of techniques, including shallow neural networks, support vector
machines, and DL architectures such as Long Short-Term Memory (LSTM) and
Convolutional Neural Networks (CNNs).
In order to monitor air quality using IoT-based multisensor systems, De Vito et
al. [15] used shallow neural networks (SNNs) and extreme learning machines (ELMs).
In their study, data imputation and normalisation techniques were used to mitigate
challenges related to sensor calibration and noise. Similarly, Fabregat et al. [17] used
multilayer perceptron regressors to predict air quality, achieving high correlation
coefficients and demonstrating the effectiveness of ML in environmental monitoring.
The study highlighted the importance of preprocessing steps, such as outlier removal
and feature selection, to improve model accuracy. These examples highlight the
critical role that data preprocessing plays in ensuring the reliability of environmental
datasets.
In the field of water quality monitoring, Fijani et al. [18] developed a hybrid model
combining extreme learning machines (ELMs) with decomposition algorithms for
real-time monitoring of water quality parameters. Their approach demonstrated
the potential of hybrid ML frameworks to handle complex datasets, with signif-
icant reductions in mean square and mean absolute errors. Their preprocessing
9
3. Related Work
pipeline decomposed non-stationary water quality signals into intrinsic mode func-
tions (IMFs) to reduce noise and improve feature extraction. Saboe et al. [57] further
advanced the field by using LSTM networks for multivariate time series forecasting.
They achieved accurate predictions of water quality parameters with minimal errors.
They used temporal alignment and normalisation in their preprocessing pipeline and
addressed missing data through linear interpolation. These studies highlight the im-
portance of using tailored preprocessing techniques in the management of complex
environmental datasets.
Bonofiglio et al. [9] focused on ocean monitoring, using DL algorithms for marine
species detection and tracking from video data. Due to the large amount of video
data and the need for real-time processing, their study faced challenges. They
used YOLOv5, a state-of-the-art object detection model that significantly reduced
processing time while maintaining high accuracy, to address these issues. One of
the key challenges was dealing with the poor quality of underwater images, which
often suffer from low visibility, colour distortion, and obstructions due to turbid
water conditions. They addressed this issue by applying preprocessing techniques,
such as contrast-limited adaptive histogram equalisation (CLAHE), to clear up the
images. This example demonstrates the importance of choosing the right techniques
to efficiently process large amounts of environmental data and overcome data quality
challenges.
In the field of waste management, there are several studies ( [3], [63], [39], [22])
that demonstrate the application of ML algorithms to predict waste generation
patterns, to optimise collection routes, and to identify different types of waste for
efficient sorting. They rely significantly on data preprocessing to manage diverse
inputs, including socioeconomic datasets, Internet of Things (IoT) sensor readings,
and waste container images. For example, studies address data quality challenges,
such as missing values in municipal waste records or mislabeled waste categories,
using techniques like interpolation and synthetic minority oversampling (SMOTE)
to address class imbalance. Computer vision approaches use preprocessing pipelines
that involve image normalisation and augmentation to improve the accuracy of waste
detection. These works highlight the importance of structured data pipelines in
transforming raw, noisy data into actionable insights for route optimisation and
automated sorting. This ultimately reduces operational costs and environmental
impact.
The review by Konya and Nematzadeh [27] provides an overview of the current
state of ML applications in environmental disciplines. Through an analysis of the
strengths and limitations of different ML techniques, their work provides insights
into how ML can be effectively integrated into environmental research through data
pipelines, data preprocessing techniques and domain-specific data handling, paving
the way for future advances in the field. However, the review also highlights sev-
eral challenges, including the requirement of large amounts of labelled data and the
lack of standardised data management practices. These results indicate that fur-
ther research is needed on data quality and management practices in ML-enabled
environmental research.
10
3. Related Work
Approaches towards data quality management in
environmental research
When considering aspects of data quality in ML-enabled environmental research,
the quality of the data is critical for the validity of the results. Wang et al. [64] pro-
posed a multidimensional framework consisting of intrinsic data quality including
accuracy, and consistency, contextual data quality including relevance, and timeli-
ness, representational data quality including interpretability, ease of understanding,
and accessibility data quality including availability, and security. Their empirical
approach highlighted the complexity of data quality from the perspective of data
consumers, identifying 20 dimensions and 118 attributes. This work is particularly
relevant to environmental research, where data not only need to be accurate, but
also need to be contextually appropriate and accessible to a variety of stakeholders.
Meng [35] introduced the concept of data minding. This is a proactive approach to
ensuring data quality throughout the lifecycle of data. Meng [35] emphasises the
need for rigorous data quality practices, arguing that poor data quality leads to
“garbage in, package out.” This is in line with environmental research, where data
collection and preprocessing are often error-prone and complex. Robust quality
assurance mechanisms are required.
The ethical and practical implications of poor data quality are highlighted by Paullada
et al. [47]. Their findings are relevant to environmental research, where data sets are
often re-used across different studies, thereby amplifying the impact of data quality
problems.
Data smells, which are indicators of potential data quality issues in public data sets,
have been discussed by Shome et al. [59]. These include missing values, inconsis-
tencies, and outdated information. Their work is critical for environmental research
because it often relies on publicly available datasets, and Shome et al.’s work provide
a practical lens for diagnosing and addressing data quality issues.
The Candidate Framework for Data Quality Assessment and Maintenance (CaF-
DaQAM) proposed by Pradhan et al. [49] provides a structured approach to manag-
ing data quality. The framework includes four components: data quality workflow,
data quality challenge list, data quality attribute list, and solution candidates. This
framework can be adapted to environmental research, where data quality challenges
often arise from heterogeneous data sources, and varying methods of data collection.
The ISO/IEC 25012 quality model [19] used by Nguyen et al. [44] provides a robust
framework for assessing data quality, focusing in particular on the following inher-
ent data quality dimensions: accuracy, timeliness, completeness, consistency, and
credibility. These dimensions are critical to ensuring that data are not only reliable
but also fit for purpose. Accuracy ensures that data are an accurate representation
of real-world phenomena, which is essential for environmental research. Timeliness
ensures that the data is up to date - a critical consideration for time-sensitive ap-
11
3. Related Work
plications such as climate monitoring or disaster response. Completeness reduces
the risk of gaps that could undermine analyses by ensuring that all required data
are present. Consistency is especially important when integrating datasets from
multiple sources, which ensures that the data are free of inconsistencies. Finally,
credibility ensures that data are trustworthy. This is an essential attribute for in-
forming policy decisions or long-term environmental strategies. These dimensions
are particularly important in environmental research, where data are often used for
long-term monitoring, ecosystem management, and policymaking. Researchers can
systematically address data quality challenges and ensure that their datasets are
robust, reliable, and reusable for multiple applications by following the ISO/IEC
25012 quality model.
Towards data pipelines for ML in environmental
research
Furthermore, as data flows through the various stages of the data pipeline, such as
collection, preprocessing, and analysis, the integration of data quality practices into
data pipelines is essential to ensure the reliability and usability of environmental
data.
Data management has become increasingly important in ML-based applications over
time, as the success of an ML model mainly depends on the quality of the data and
how they are processed. Data management has a significant impact on model accu-
racy and, to eliminate the challenges that come with poor data management, data
management practices such as data pipelines and DataOps should be followed [41].
As noted by Munappy et al. [53], data pipelines can process multiple streams of data
simultaneously, making them essential for applications that require real-time data
processing. The authors also discuss the practical aspects of data pipeline manage-
ment, highlighting the challenges and opportunities in maintaining efficient and reli-
able data flows. Among the issues discussed are: complexity, as data pipelines often
involve multiple stages and dependencies, making them difficult to design, imple-
ment, and maintain; data quality, as ensuring accuracy, consistency, and complete-
ness throughout the pipeline remains an ongoing challenge; scalability, as pipelines
must handle increasing volumes of data without compromising performance; integra-
tion, as combining data from heterogeneous sources (e.g., sensors, databases, APIs)
requires robust mechanisms; and monitoring and troubleshooting, as identifying and
fixing pipeline problems can be time-consuming and resource-intensive. These chal-
lenges are particularly evident in environmental research, where data are often noisy,
incomplete, or collected at different temporal and spatial resolutions. They high-
light the importance of robust pipeline management in handling the complexities
of integrating and processing data, which is particularly relevant in environmental
research where data comes from diverse sources such as sensors, satellites, and field
observations.
In their papers, Munappy et al. [54], [4], [30], [42] propose several solutions and best
12
3. Related Work
practices to address these challenges. Modular pipeline architectures are explored
to address the challenges of designing, implementing and maintaining complex data
pipelines [54]. A modular pipeline architecture breaks down the pipeline into smaller,
self-contained components or modules. Each module performs a specific task such as
data ingestion, data transformation, data validation or data storage. This provides
significant advantages, including flexibility, as modules can be easily adapted to new
data sources or requirements, which is important in environmental research where
new sensors or data types are frequently introduced. It also improves reusability,
allowing modules to be reused across different pipelines or projects, reducing redun-
dancy. In addition, modular architectures improve scalability, interoperability and
maintainability, all of which are particularly important in environmental research,
for example, to facilitate integration with different data sources and systems, such as
combining satellite imagery, ground-based sensor data and climate models. Further,
it is important to automate and orchestrate, using tools like Apache Airflow and
Kubernetes, to streamline pipeline execution and reduce manual intervention [4].
Munappy et al. also highlighted the need for continuous data quality monitoring
at each stage of the pipeline to detect and correct errors early on [30]. Finally, the
value of collaborative DataOps practices, such as iterative development, and cross-
functional teaming to improve pipeline efficiency and reliability, was highlighted [42].
Taken together, these solutions provide a robust framework for overcoming the chal-
lenges associated with data pipelines in complex domains such as embedded systems.
Significance of the Study
It is clear from the literature that environmental research relies on large amounts
of data, and that the quality of the data and the selection process are important to
the success of the research. Researchers must understand what kind of data they
need and how to best use it. To this end, the data management and preparation
processes in environmental disciplines must be carefully examined and addressed.
There should be an effective communication channel between data producers and
consumers to identify actual needs. This thesis aims to reduce the current data
challenges faced by environmental researchers in ML-based research and propose
solutions from SE while bridging the gap between these two fields. The study aims
to facilitate mutual benefits for software engineering and environmental science while
mitigating data management challenges by bringing these two fields closer together.
13
3. Related Work
14
4
Methods
The methodology followed in this research is exploratory-based, focusing on under-
standing and analysing the existing approaches taken by environmental researchers
to ensure data quality requirements and data management in ML-based research.
This is followed by constructive research in which we suggest SE-based solutions
to overcome the current problems. An overview of the techniques used and their
relation to RQs is shown in Figure 4.1.
Here, we provide a brief summary of the methods and techniques applied, followed
by a more detailed discussion of each step in this study. The study began by defin-
ing an analysis protocol and conducting a literature review to identify current data
selection and preparation practices. Then we analysed the papers found through the
literature review using a scorecard called SPADES-ML (Scientific Pipeline Assess-
ment and Data-Centric Evaluation Scorecard for Machine Learning). We conducted
an online survey with environmental researchers and authors of the papers included
in the literature review. Using the information collected from all steps, we identified
current challenges in the ML-enabled environmental research data pipeline. Then,
we studied the literature on pipeline design in the field of SE and formulated sugges-
tions on how environmental researchers could apply techniques from SE to mitigate
those challenges.
15
4. Methods
16
Figure 4.1: Overview Of Methodology
Purple: SPADES-ML method for pipeline assessment, Yellow: Literature review, Orange: Survey, Red: Identify good practices and
recommendation creation, Green: Fulfilled research questions
4. Methods
4.1 Analysis Protocol
We defined an analysis protocol to capture all the necessary details of the data
from the selected papers to meet our purpose. Based on a brainstorming session
conducted by two of us, we identified the main areas we needed to analyse in terms
of data quality and management. This was done after referring to a few studies and
trying to identify important aspects to describe and confirm in the literature. The
following points guided the extraction of necessary information from the selected
papers:
• Origin of the data - This records where the data originates from: some re-
searchers collect the data by themselves (self-created), and some collect data
from external sources.
• How do data producers and data consumers communicate?- Data consumers
can either contact data producers and acquire all the data they need or they
can produce the data themselves. The other common scenario is for the data
to be publicly available so that anyone can download and access them from
the data portal or data repository.
• Availability of the dataset - Can the data be downloaded from a public portal
or repository? Were the data published alongside a paper, e.g., as a replication
package?
• How did the authors evaluate that the data were suitable? - We checked differ-
ent forms of argument to assess the suitability, such as for example, we checked
the arguments provided in the form of text and the necessary mathematical
argumentation mentioned in the paper. We also checked for references to other
publications that validate the selection.
• Did the authors mention explicit data requirements? -
– Data quality requirements - We focused on five main data quality require-
ments defined in ISO/IEC 25012 quality model – Accuracy, Currentness,
Completeness, Consistency, Credibility [19]
– Data selection requirements
• Training data - How did they split the dataset into training and validation
data? Did they use cross-validation where the dataset is randomly mixed
before training an ML model?
• Did the authors provide an overview of their data pipeline, such as a diagram
or textual explanation, including preprocessing steps?
• Did the authors explain any challenges they encountered with the data or data
processing?
17
4. Methods
• How did they develop preprocessing steps to mitigate the challenges?
• How did the authors handle missing data? – What did they do for invalid
data? Any data imputation techniques used? How did they handled obstacles/
problems with the data?
4.2 Literature review
The study continued by conducting a literature review, using the paper presented
by Konya and Nematzadeh [27], Recent Applications of AI in Environmental Dis-
ciplines. The paper focuses on systematically collected information, such as ML
algorithms used, performance metrics, and processing times in research work across
a variety of different environmental research disciplines. The research does not
examine the details of the data and their quality or how they are managed in ML-
based applications. Table 2 of their paper details 26 studies, categorising them by
environmental domain, further showing ML techniques used, performance metrics,
processing times and the techniques that performed best. They have referred to
26 research papers in their SLR and provided a brief overview of the different AI
applications used in the environmental discipline. This is illustrated in Appendix C
Table C.1. As our first step, we extended their literature review by investigating
areas related to data quality requirements and data management pipelines by ex-
amining each research paper individually.
In addition, we used forward and backwards snowballing techniques to find other
relevant research projects in different environmental disciplines that have used ML-
based techniques, which consist of more details on data quality and data manage-
ment aspects. After selecting the articles, we read them and extracted information
according to the analysis protocol we defined. All the papers that were reviewed are
listed in the Table 4.1.
We based our literature review on the paper by Konya and Nematzadeh [27], because
it was published in 2024 and included a variety of environmental subdomains from
2017 to 2024. Furthermore, working with Table 2 in their paper was a good fit for
us.
First, we screened the first six research papers together and discussed our findings in
order to find a common approach. We then divided the remaining papers between
us. We each analysed one paper from our allocated set and the other reviewed the
analysis. Once we reached an agreement, we continued to analyse the remaining
papers individually, providing each other with an update on a weekly basis. In
addition, three of the records we analysed were reviewed by one of our supervisors
to ensure that we applied our analysis steps correctly. All relevant information
found in each paper was recorded in a spreadsheet under the headings defined in
the analysis protocol.
18
4. Methods
Table 4.1: List of papers considered for the literature review
ID Environmental Field Publication
I Hydrological modeling Natel de Moura et al.,2020 [43]
II Energy forecasting and Management Condemi et al.,2021 [14]
III Climate modelling Obahoundje et al., 2024 [46]
IV Biodiversity & Ecosystem Monitoring Sittaro et al.,2023 [60]
V Energy forecasting and Management Ramos et al.,2022 [55]
VI Air Quality and Atmospheric Science De Vito et al.,2020 [15]
VII Climate modelling Migallón et al.,2022 [36]
VIII Hydrological modeling Almikaeel et al., 2022 [5]
IX Waste management Agnew et al.,2023 [3]
X Hydrological modeling Almikaeel et al. 2024 [6]
XI Biodiversity & Ecosystem Monitoring Tuia et al.,2022 [62]
XII Energy forecasting and Management Zhou et al.,2022 [67]
XIII Air Quality and Atmospheric Science Fabregat et al.,2022 [17]
XIV Marine and Aquatic Science Bonofiglio et al., 2022 [9]
XV Energy forecasting and Management Miller et al.,2022 [37]
XVI Biodiversity & Ecosystem Monitoring Moran et al.,2017 [34]
XVII Waste management Moral et al.,2022 [39]
XVIII Hydrological modeling Miro et al.,2021 [38]
XIX Water Quality Monitoring Saboe et al.,2021 [57]
XX Energy forecasting and Management Da Costa Alves Basílio et al.,
2022 [7]
XXI Waste management Gue et al., 2022 [22]
XXII Air Quality and Atmospheric Science Chojer et al.,2022 [13]
XXIII Energy forecasting and Management Pombo et al.,2022 [48]
XXIV Water Quality Monitoring Fijani et al.,2019 [18]
XXV Waste management Velis et al.,2023 [63]
XXVI Energy forecasting and Management Manfren et al.,2022 [34]
XXVII Climate modelling Nichol et al.,2021 [45]
XXVIII Energy forecasting and Management Maltais and Gosselin, 2022 [33]
4.3 Scientific Pipeline Assessment and Data-Centric
Evaluation Scorecard for Machine Learning
(SPADES-ML)
We applied the next stage of our analysis on the collected notes from the literature
analysis. Specifically, we wanted to systematically evaluate and categorise the re-
search papers we reviewed in order to present the results in a structured way. This
would show how data quality and data management related practices are currently
being applied or written in research papers. The purpose of the SPADES-ML is
to evaluate the methodology in ML-enabled environmental research. Besides the
aforementioned analysis criteria, we also incorporated the FAIR principles, given
their significant role in environmental research data sharing [65]. As shown in the
Table 4.2, we considered a total of five categories that took into account different
aspects and stages of data in a scientific ML-driven research project:
1. Data selection requirements and suitability: These relate to require-
19
4. Methods
ments describing why particular data sources were selected from a source for
a given purpose. Data suitability reasoning may include statistical analysis
performed to verify certain conditions, results of a previous study using the
same data source, and so on.
2. Data quality requirements: We selected the five inherent data quality
attributes from the ISO/IEC 25012 quality model and the definitions of these
dimensions are as below [19].
• Accuracy: The degree to which data has attributes that correctly repre-
sent the true value of the intended attribute of a concept or event in a
specific context of use
• Completeness: The degree to which subject data associated with an entity
has values for all expected attributes and related entity instances in a
specific context of use;
• Credibility: The degree to which data has attributes that are regarded
as true and believable by users in a specific context of use. Credibility
includes the concept of authenticity (the truthfulness of origins, attribu-
tions, commitments)
• Currentness: The degree to which data has attributes that are of the
right age in a specific context of use
• Consistency: The degree to which data has attributes that are free from
contradiction and are coherent with other data in a specific context of
use. It can be either or both among data regarding one entity and across
similar data for comparable entities
3. FAIR principles: FAIR principles are widely used in environmental science
research as it was mentioned earlier in Chapter 2. We have added several
criteria to ensure that FAIR principles are followed. For findability we looked
at the availability of metadata about data, which allows the readers to find
and understand data. We looked at the accessibility of both raw data and
processed data. Where authors do not have the right to share data, it is
possible to make the metadata available. The use of common file formats such
as CSV, MS Excel, MS Word, PDF, JSON, XML, SQL, PNG, JPEG etc.,
was considered under interoperability. The selection of common file formats
could be determined on the basis of data type6. All shared resources should
be available in the long term. GitHub would be a good example of long-term
reuse. The replication package could include data, data cleaning code, analysis
code if applicable, and relevant documentation that would help replicate the
work7.
6https://snd.se/en/manage-data/guides/choosing-file-format
7https://www.econometricsociety.org/publications/es-data-editor-website/packag
e
20
4. Methods
4. Data preprocessing: Real-world data are often incomplete and inconsistent
and may contain errors. So, preprocessing the data is an inevitable step in the
entire process.
5. Challenges encountered: It is highly unlikely that one will not face any
challenges during preprocessing. Numerous issues can arise during the data
cleaning and imputation processes.
Table 4.2: Evaluation criteria - SPADES-ML
C1 Data selection requirements and arguments of data suitability 0-2
1.1 Mention of data selection criteria or requirements 1
1.2 Reason for suitability of data 1
C2 Data quality requirements argued / validated 0-2.5
2.1 Accuracy 0.5
2.2 Completeness 0.5
2.3 Credibility 0.5
2.4 Currentness 0.5
2.5 Consistency 0.5
C3 FAIR 0-2
3.1 F: Metadata linked or available otherwise 0.5
3.2.1 Raw data 0.25
A: Available to download
3.2.2 Processed data 0.25
3.3 I: Common file formats used 0.5
3.4.1 Data storage location: 0.25
long-term
R: Possibility for future use
3.4.2 Replication package 0.25
C4 Data Preprocessing 0-2
4.1 Preprocessing described/ reasons given 1
4.2 Reproducible (e.g., code available or detailed steps) 1
C5 Challenges during data processing 0-2
5.1 Mention any challenges encountered during data processing 1
5.2 Provide reproducible solutions to these challenges 1
Scoring for each element: A great deal of thought has gone into assigning points
to the criteria and requirements in SPADES-ML. As we cannot argue why any
category should be more important than another, we decided that each category
receives the same number of 2 points. One exception is the category data quality
requirements: Here we allow 2.5 points out of two reasons. First, it was convenient to
assign 0.5 points to each of the 5 quality attributes. But second, we also consider that
the definition of clear data quality requirement is a critical element when describing
the data preparation for ML, and therefore decided to weight this category slightly
more. To balance the scores and make it fair, we normalised the scores after the
analysis. It is shown in Chapter 5.
As part of the design of SPADES-ML, we also examined how similar scoring systems
are applied in SE. Carrillo de Gea et al. [11] have used a ranking system to evaluate
RE tools in different categories in RE. For this purpose, a plus (+), zero and minus
(-) based scoring system was used. We decided to use scores rather than +/- because
we were interested in performing a statistical analysis to further interpret the results.
In addition, minus (-) is not relevant to our evaluation scenario because we are either
21
4. Methods
evaluating the availability of some criteria or not. For example, not mentioning the
data selection criteria does not make the situation worse. Rather, it should be given
zero points. However, in a way, we do mimic this system of assigning ++/– by
assigning values in between the range, such as 0.5, when it is not quite possible to
give full points.
Several other papers used a similar scoring system. Achimugu et al. [2] had a quality
assessment (QA) to identify and analyse existing software requirement prioritisation
techniques. The primary studies that were selected were assigned scores based on
the five QA questions. Binary or partial scores were assigned to the QA questions
(e.g., “Yes” = 1, “Partly” = 0.5, and “No” = 0). The quality score for a particular
study is computed by summing all the scores of the answers to the QA questions. In
the SLR on the practices and challenges of agile RE, Inayat et al. [24] used a similar
QA method with four criteria to assess the quality of the evidence present in the
studies. To improve the rating and categorisation, the authors used a similar ordinal
scale as in Achimugu et al. [2] instead of a dichotomous scale. Normalised scores
were considered for further steps. Kitchenham and Charters [26] used a checklist
consisting of twelve questions to identify, evaluate, and synthesise the SR method-
ology in SE. For the questions in the checklist, the following values were assigned:
“Yes” = 1, “Partly” = 0.5, “No” = 0, Not Applicable = NA. However, in-between
values were possible, as interpolation of numerical values was permitted. Further-
more, “NA” values were assigned a value of “0” for statistical analysis. SPADES-ML
was built on the established practices.
4.4 Analysis using SPADES-ML
Using the notes taken during our literature review according to our analysis protocol,
we analysed each paper by going through each of the criteria in Table 4.2. We
conducted a pair analysis, i.e., both researchers worked together, in order to reduce
any personal bias in the evaluation. Most of the time, we referred back to the
research papers and had to read through the lines to identify some criteria, such as
data quality requirements and data processing challenges. A summary of the scores
for each paper is shown in the Table 5.1 in Chapter 5.
The following two papers were disregarded during the analysis using SPADES-ML.
The paper of Natel de Moura et al. [43] was unfortunately unavailable to us, even
when using the support of the library of our university. Instead, only a conference
presentation of the paper could be found, which provides some details but not enough
to allow for a fair judgement. We also excluded the paper by Tuia et al. [62] because
it provides only a generic perspective, rather than a specific case study using a data
source. It provides an overview of how ML can be used in wildlife conservation. As
both papers do not provide sufficient insight into data quality and data management,
they have been disregarded.
Further, several factors were considered when interpreting the results, including the
publication year, environmental field, ML techniques used, data format type, and
22
4. Methods
number of data sources. Our goal was to determine how data quality and data man-
agement vary with each of these factors. We grouped the research papers based on
the aforementioned factors. A group of papers was considered if the group contained
at least three papers, and the other groups which are not considered are marked with
an “*”. As for the statistical analysis, mean, median, variance, standard deviation,
90% and 95% confidence intervals were calculated for each category: C1, C3, C4,
and C5 in SPADES-ML and for the total column. 90% confidence interval was con-
sidered adhering to conventions, but 95% confidence interval was also considered
because it is the standard for scientific research. While both measurements show
the spreading of the data, 95% confidence interval has a lesser margin of error. Box
plots are used to visualise the results of the statistical analysis due to their clarity
and simplicity. The results for each factor are discussed below in the respective
sub-section.
4.5 Survey to validate SPADES-ML
We designed a survey to be sent to the authors of the papers included in our literature
review. Our aim was to obtain their expert opinion on the criteria we had included
in SPADES-ML. However, to reduce bias in the responses of the participants, any
details about SPADES-ML or the inclusion of their paper in our case study were
not disclosed. Participants were asked to give their opinion from the perspective of
three different academic roles most researchers take: reader of a paper, author of a
paper and reviewer of a paper. This allowed the importance of the criteria to be
gathered from three perspectives so that the overall points assigned for the criteria
could be discussed. The survey consisted of four main sections. The first section
aimed at collecting demographic information in the form of closed-ended questions.
The following questions were included in this section:
• Which field of environmental research are you specialised in?
• How many years of experience do you have as an author in environmental
research?
• Do you regularly review papers for journals or conferences?
• If yes: How many years of experience do you have in reviewing research papers?
• Would you consider yourself a “data producers”, i.e., someone who creates
datasets through e.g., field measurements?
• Would you consider yourself a “data consumer”, i.e., someone who uses data
for e.g. modelling tasks or other aspects in your research?
• If yes to data consumer: Where does the data come from that you use in your
research?
In each of the remaining three sections, we asked the participant to judge the impor-
tance of the different criteria of SPADES-ML from the perspective of being a reader,
author, and reviewer of a scientific paper in environmental research. They consisted
of Likert-scale questions and were designed with four options, two negatives, and
two positives to ensure that the neutral option would be excluded. We provided
definitions whenever necessary to clarify the terms we were looking for and included
23
4. Methods
links to direct respondents to the source where we found the definitions we needed.
Follow-up open-ended questions were added for each criterion to get more ideas and
the thoughts of the respondents.
Before the survey was sent to the authors, it was validated by two independent
researchers, including one with an environmental science background. The validation
provided by the independent researchers, together with feedback from our thesis
supervisors, helped to refine the survey to its final version. The survey form can be
found in Appendix A.
Survey population selection: The survey was initially sent to 55 authors of the
original papers listed in Table 4.1. This group included authors who had provided
their contact information, and the corresponding authors of the papers included in
our literature review. Due to the very low initial response rate of 5.45%, even after
sending reminders, we decided to extend the survey population.
Extending the survey population: The survey was sent to different groups of
practitioners. A new survey link was sent to each group, and a duplicate of the
original was used to track the response rates separately. We found the contact de-
tails of the remaining 49 authors who had contributed to the original papers in the
literature review and distributed the survey to them. This group had the lowest
response rate of 4.08%. A reminder email was sent to all groups halfway through.
To extend the survey population even further, we applied forward snowballing: we
searched for papers citing our original set of papers and contacted the correspond-
ing authors. This group included 123 practitioners, with a response rate of 11.38%.
Finally, we also manually searched on the websites of universities and research in-
stitutes to identify environmental researchers who apply ML. This group included
24 practitioners, and the response rate was 16.67%. The overall response rate was
9.16% with 23 responses after being sent to 251 practitioners.
4.6 Analysis of the survey
The data collected from the survey were analysed to extract information and con-
ceptualise knowledge in order to find solutions to the research questions. Our study
followed a mixed approach with both qualitative and quantitative questions.
Qualitative analysis
Thematic analysis (TA) was used to analyse the qualitative data that we collected
from the survey. This is a flexible method used to identify, analyse and report
patterns or themes within qualitative data [10]. It is used to organise and describe
the data we collected with the open-ended questions, which were presented as follow-
up questions to each criterion.
Familiarisation: The first step was to familiarise ourselves with the responses. To
do this, we extracted the data into an Excel spreadsheet and cleaned it. We had to
24
4. Methods
remove some data that contained details such as “No comments/ No/ None” and
irrelevant data. The responses were then grouped according to the academic roles
that we defined. We referred to them as a pair to understand and determine valuable
insights. We had to refer to the responses several times to familiarise ourselves with
the content and understand the initial idea.
Initial Coding: As the second step, codes were created manually for each response
and we had to go through the responses iteratively a number of times to finalise the
coding we had assigned. This was done to capture the meaningful ideas behind the
comments they had added. Some of the coding that we identified are "preprocessing
crucial to explain", "Data to confirm credibility" and "quality for accuracy".
Defining themes: The initial codes were then reviewed and similar codes were
grouped to identify appropriate themes. This is a bottom-up approach where themes
were identified through the data we collected and the criteria we defined. Identified
themes are listed together with the findings in Chapter 5.
Reviewing themes: Each theme was reviewed together to see if they were distinct
from each other. This was done in order to merge any overlapping themes or to
redefine them with different meanings.
Defining and naming themes: The identified themes were labelled alongside the
codes and respective quotes. Both the coding and theming processes were iterative.
We managed them for each role separately to identify any role-specific patterns.
Report: The final step was to formulate the findings of the themes and link them
with the findings of the literature review and the survey according to each criterion.
We were able to extract the essence of the findings from that approach and present
the formal results in Chapter 5.
Quantitative analysis
The quantitative data generated by the Likert scale questions were passed to a data
preparation stage, where we converted the responses into numerical values (1-4).
Responses were grouped by role for in-depth analysis. There were some missing
values, especially among the responses of the reviewers, as some may not have
experience in reviewing papers and may want to skip answering these questions.
Missing values were ignored so that they would not affect the final results.
Initially, we calculated the mean value of each criterion for the 23 responses col-
lected and visualised the findings using a grouped bar chart, assuming that this
would provide the correct insights according to the roles we had defined. We were
able to gather some interesting information, but then we realised that there could be
misleading assumptions derived from the mean-based analysis that would affect the
actual findings. Considering the number of responses and the need for a role-based
analysis, a descriptive analysis was performed that highlights frequency distribu-
tions. With this approach, we were able to preserve the original value and use it to
25
4. Methods
derive more insights.
A diverging stacked bar chart facilitated the approach appropriate for this scenario,
and we created a chart for each role containing all the criteria we defined in table
4.2, including the number of responses for each. We then created another graph to
evaluate the overall results with all responses regardless of role. In these graphs, the
responses were split, with positive responses split to the right and negative responses
split to the left. Each response was assigned a different colour to visualise the spread,
and as there was no neutral point considered in the survey design, the graphs consist
of four colours. Shades of red were used to represent the “Not at all important” and
“Somewhat unimportant” options. Shades of green were used to represent the “Very
important” and “Somewhat important” options. We were able to identify patterns
with the findings and the relationship between different aspects and roles based on
the generated output. The charts and their interpretations, including the role-based
observations, are explained in detail in Chapter 5.
4.7 Analyse and create recommendations
Through the analysis carried out using the literature review and SPADES-ML, we
identified some of the challenges faced during the data pipeline process. Addition-
ally, more input was gathered from the qualitative data analysis of the survey. We
identified challenges by considering the perspectives of readers, authors, and re-
viewers. We selected a few high-scoring papers on SPADES-ML and identified the
approaches and good practices used by the authors. This provided insight into rec-
ommendations for avoiding some of the identified challenges. Thus, a connection
was made between SPADES-ML, the challenges, and the recommendations.
It is also important to note that the challenges were rarely discussed in detail
in the reviewed papers. Having gained a sufficient understanding of the chal-
lenges, we looked for solutions through the solutions discussed by Munappy et
al. [54], [4], [30], [42]. We focus on this work because it provides a deep insight
into how SE techniques can be applied to find a thorough methodology for building
data pipelines for ML in a specific application. We were interested in investigating
whether these methods can also be applied in Environmental Research. These in-
sights were used to generate the initial recommendations, including findings that we
gathered from other literature. These recommendations were structured to answer
existing problems in data quality and data pipeline in environmental ML research,
with the insights that we can provide from the SE perspective.
26
5
Results
This chapter presents the results, including the analysis of the reviewed research
papers using SPADES-ML, the outcome of the survey studies, and the set of rec-
ommendations based on the identified challenges and findings.
5.1 Analysis of literature using SPADES-ML
As mentioned in Chapter 4, a total of 28 research papers were considered, includ-
ing 26 from the paper by Konya et Nematzadeh [27] and two other studies that
were included using forward and backwards snowballing. The results of each paper,
obtained after reviewing the literature based on the analysis protocol, were system-
atically evaluated using SPADES-ML. As stated in Table 4.2, we had defined five
categories in SPADES-ML and the points were allocated considering the categories
and sub-criterion we defined for each. Table 5.1 provides a summary of the analysis
of the reviewed papers using SPADES-ML. The table shows the points allocated for
each category. The last column sums the total points scored for each paper. The
purpose of the total sum column is for further statistical analysis only. This column
is not intended for comparing one study to another and drawing conclusions such
as “study A is better than study B” based on the total score. Each study has its
own unique strengths and weaknesses, so it is not possible to compare them based
on total score.
The maximum score for each category in the table is different: C1, C3, C4, and C5
can go up to 2.0, while C2 can go up to 2.5. To create balance among the categories,
we decided to normalise the scores to the range of 0-1. Since all categories except
C2 have a maximum score of 2.0, we divided the scores by 2.0. We divided the C2
scores by 2.5 to normalise them. We wanted to check whether direct comparisons
between the categories would lead to discrepancies. Table 5.2 shows the results we
found after normalisation.
To compare the two approaches, we sorted the research papers in ascending order
of total score shown in Figure 5.3. When we compared the two results, there were
no significant differences in the order; both appear to be consistent with each other.
So, we decided to use the normalised approach for further analysis. After careful
consideration, as we noted in Chapter 4, we decided to exclude papers I and XI
27
5. Results
Table 5.1: A summary of the analysis results
Paper C1 C2 C3 C4 C5 Total
ID Max 2 p. Max 2.5 p. Max 2 p. Max 2 p. Max 2 p. Out of 10.5 p.
I 1 0 0 0 0 1
II 2 2 0 1 1.5 6.5
III 1 1.5 0 1 0 3.5
IV 2 2.5 2 2 1 9.5
V 2 2 1 0 0 5
VI 1 2 0 0.5 1 4.5
VII 2 2 1 0 0.5 5.5
VIII 1 2 0 0 1 4
IX 2 2 0 1 1 6
X 2 2.5 1.5 2 1 9
XI 0 0.5 0.75 0 1 2.25
XII 1.5 2.5 0 1 1 6
XIII 1 2 0 1 1 5
XIV 2 2.5 1.75 2 2 10.25
XV 1 0.5 1.5 1 0 4
XVI 1 2 0 1 1 5
XVII 2 2 0 1 1 6
XVIII 1 1.5 0 0.5 1 4
XIX 2 2.5 0 0 0 4.5
XX 1 1 1.75 1 0 4.75
XXI 2 2 0.25 1 2 7.25
XXII 1 1.5 0 1 1 4.5
XXIII 2 2.5 1.5 2 2 10
XXIV 1.5 0.5 0 0.5 1 3.5
XXV 2 1.5 1.25 2 1 7.75
XXVI 2 1.5 0 1 0.5 5
XXVII 0.5 2 1.5 1 1 6
XXVIII 1 1.5 0 0 1 3.5
C1= Data selection and data suitability; C2= Data quality requirements; C3= FAIR;
C4=Preprocessing; C5= Challenges during data processing.
because we were not able to use them for a detailed analysis due to the missing or
generic details they contain. These exclusions did not affect the results and helped
correctly analyse the content found.
Overview of the fulfilment of SPADES-ML criterion
To assess how well each criterion was fulfilled by the papers we referred to, we
created a dot plot, as shown in Figure 5.1, with the results we gathered from our
analysis. The aim was to quantify the extent to which these papers addressed the
key aspects of SPADES-ML. The y-axis provides all the elements that are considered
in SPADES-ML, while the x-axis shows how many papers fulfilled each of these ele-
ments. The “Fully Fulfilled” column shows how many papers received the maximum
score in each category with respect to SPADES-ML. The papers that received half
points for the specified criterion are visualised in the “partially fulfilled” column,
and the total shows the combination of the two.
28
5. Results
Table 5.2: Summary of the normalised results
Paper ID C1 C2 C3 C4 C5 TotalOut of 5
I 0.5 0 0 0 0 0.5
II 1 0.8 0 0.5 0.75 3.05
III 0.5 0.6 0 0.5 0 1.6
IV 1 1 1 1 0.5 4.5
V 1 0.8 0.5 0 0 2.3
VI 0.5 0.8 0 0.25 0.5 2.05
VII 1 0.8 0.5 0.25 0 2.55
VIII 0.5 0.8 0 0.5 0 1.8
IX 1 0.8 0 0.5 0.5 2.8
X 1 1 0.75 1 0.5 4.25
XI 0 0.2 0.375 0 0.5 1.075
XII 0.75 1 0 0.5 0.5 2.75
XIII 0.5 0.8 0 0.5 0.5 2.3
XIV 1 1 0.875 1 1 4.875
XV 0.5 0.2 0.75 0.5 0 1.95
XVI 0.5 0.8 0 0.5 0.5 2.3
XVII 1 0.8 0 0.5 0.5 2.8
XVIII 0.5 0.6 0 0.25 0.5 1.85
XIX 1 1 0 0 0 2
XX 0.5 0.4 0.875 0.5 0 2.275
XXI 1 0.8 0.125 0.5 1 3.425
XXII 0.5 0.6 0 0.5 0.5 2.1
XXIII 1 1 0.75 1 1 4.75
XXIV 0.75 0.2 0 0.25 0.5 1.7
XXV 1 0.6 0.625 1 0.5 3.725
XXVI 1 0.6 0 0.5 0.25 2.35
XXVII 0.25 0.8 0.75 0.5 0.5 2.8
XXVIII 0.5 0.6 0 0 0.5 1.6
C1= Data selection and data suitability; C2= Data quality requirements; C3= FAIR;
C4=Preprocessing; C5= Challenges during data processing.
The papers have focused more on criteria such as the mention of data selection cri-
teria and the provision of reasons for data suitability. All attributes were prioritised
when considering data quality, but completeness, currentness, and consistency re-
ceived the most attention. Most of the papers poorly address aspects of the FAIR
principle. Only 11 papers out of 26 provided raw data. Although the majority of
papers provide details of the preprocessing stage and the challenges encountered,
fewer papers provide reproducible preprocessing steps or solutions to the problems
faced.
The analysis revealed a significant shortfall in the provision of reproducible informa-
tion to other researchers. These findings emphasise the need for clearer guidelines
to be established for ML-based environmental research.
29
5. Results
Table 5.3: Comparison of normalised and not normalised approaches
Normalised Not normalised
ID Total Out of 5 p. ID Out of 10.5 p.
I 0.5 I 1
XI 1.075 XI 2.25
XXVIII 1.6 XXVIII 3.5
III 1.6 III 3.5
XXIV 1.7 XXIV 3.5
VIII 1.8 VIII 4
XVIII 1.85 XVIII 4
XV 1.95 XV 4
XIX 2 XIX 4.5
VI 2.05 VI 4.5
XXII 2.1 XXII 4.5
XX 2.28 XX 4.75
V 2.3 V 5
XVI 2.3 XVI 5
XXVI 2.35 XXVI 5
VII 2.55 VII 5.5
XIII 2.55 XIII 5.5
XII 2.75 XII 6
IX 2.8 IX 6
XVII 2.8 XVII 6
XXVII 2.8 XXVII 6
II 3.05 II 6.5
XXI 3.43 XXI 7.25
XXV 3.73 XXV 7.75
X 4.25 X 9
IV 4.5 IV 9.5
XXIII 4.75 XXIII 10
XIV 4.88 XIV 10.25
Analysis based on published year
The research papers that we considered for the literature review were published
between 2017 and 2024. In our analysis, we wanted to explore whether there is an
association between the time of publication and the score in each of the categories
of SPADES-ML. To investigate whether the research papers have evolved over time,
we analysed the scores with respect to their publication year. 2022 has the highest
number of papers published and we considered it as the middle point. We then
formed year-groups that contain about the same number of papers as the year 2022.
Figure 5.2 shows the distribution of scores for the total score and for the individual
criteria (C1 to C5) of SPADES-ML across three publication year groups: 2017–2021,
2022 and 2023–2024.
There is a clear upward trend in total scores for more recent publications. The
median and mean scores have both increased from the earlier years (2017–2021) to
the most recent period (2023–2024). However, uncertainty is high in the period of
2023-2024. Also, the overall quality seems to be high in the year 2022. These results
30
5. Results
Total Fully fulfilled Partially fulfilled
23 18 5
Mention of data selection criteria
21 17 4
Reason for suitability
15 15 0
Accuracy
21 21 0
Completeness
17 17 0
Credibility
22 22 0
Currentness
20 20 0
Consistency
6 6 0
Metadata available
11 11 0
Raw data
5 5 0
Processed data
8 8 0
File formats
9 9 0
Long-term storage
7 7 0
Replication package
20 16 4
Preprocessing reasons
8 8 0
Reproducible code
19 18 1
Challenges mentioned
4 3 1
Solutions to challenges
0 10 20 0 10 20 0 10 20
Count Count Count
Figure 5.1: Dot plot illustrating how many referred research papers fulfilled the
criteria in SPADES-ML.
suggest an improvement in publication quality over time.
FAIR Principles (C3) scores have improved over time. The years 2017–2021 show
low median and mean scores, which have increased since then. This suggests that
researchers have recently started to pay more attention to data accessibility and
reproducibility. However, mention of challenges and provision of reproducible solu-
tions (C5) does not show any improvement over time. This suggests that, although
other aspects have improved, researchers have not devoted much attention to pre-
processing challenges.
31
5. Results
Total score distribution - published year
5
Mean
Median
90% CI
4 95% CI
3
2
1
0
2017 2021 2022 2023 2024
(a) Total score distribution based on published year
Score distribution for C1 - published year Score distribution for C2 - published year
1.2 1.2
Mean Mean
Median Median
1.0 90% CI 1.0 90% CI
95% CI 95% CI
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.2 0.2
2017 2021 2022 2023 2024 2017 2021 2022 2023 2024
(b) Score distribution for C1 (c) Score distribution for C2
Score distribution for C3 - published year Score distribution for C4 - published year
1.2 1.2
Mean Mean
Median Median
1.0 90% CI 1.0 90% CI
95% CI 95% CI
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.2 0.2
2017 2021 2022 2023 2024 2017 2021 2022 2023 2024
(d) Score distribution for C3 (e) Score distribution for C4
Score distribution for C5 - published year
1.2
Mean
Median
1.0 90% CI
95% CI
0.8
0.6
0.4
0.2
0.0
0.2
2017 2021 2022 2023 2024
(f) Score distribution for C5
Figure 5.2: Score distribution based on published year.
n(2017-2021): 7, n(2022): 14, n(2023-2024): 5
C1= Data selection and data suitability; C2= Data quality requirements; C3= FAIR;
C4=Preprocessing; C5= Challenges during data processing.
32
Score (Out of 1) Score (Out of 1)
Score (Out of 1)
Score (Out of 1)
Score (Out of 1) Score (Out of 1)
5. Results
Table 5.4: Categorisation by environmental fields
Paper ID Environmental field
VI
XIII Air Quality and Atmospheric Science
XXII
XXIV
XIX Water Quality Monitoring *
XIV Marine and Aquatic Science *
V
XXVIII
XXIII
XII
II Energy Forecasting and Management
XXVI
XV
XX
VII
III Climate modelling
XXVII
IV
XVI Biodiversity & Ecosystem Monitoring *
XVIII
VIII Hydrological Modeling
X
XXV
XXI
XVII Waste management
IX
Analysis based on environmental field
The papers were categorised by environmental fields to observe any findings. There
were eight categories as shown in the Table 5.4. These environmental fields were
initially taken from Table 2 (Appendix C Table C.1) in Konya et Nematzadeh [27].
Furthermore, to maintain consistency, we refined the environmental fields identified
in the literature review to align them with those used in the “Demographics” section
of the survey, as shown in Figure 5.7.
Figure 5.3 shows the distribution of scores across the different environmental fields.
As stated earlier, groups were considered for the statistical analysis only if they
contained at least three papers. Therefore, we had to remove the categories “Water
Quality Monitoring”, “Marine and Aquatic Science” and “Biodiversity and Ecosys-
tem Monitoring”, as they contained two, one and two papers, respectively.
The distribution of total scores in waste management indicates higher median and
mean scores, suggesting stronger overall quality. In contrast, air quality and at-
mospheric science show the lowest mean and least variability, indicating consistent
scores among the papers. Hydrological modelling indicates more variability with a
lower median, indicating that most papers score low, but there are a few papers
which scored high across all categories.
33
5. Results
Total score distribution - environmental field
5
Mean
Median
90% CI
4 95% CI
3
2
1
0
Air Quality and Energy Forecasting Climate modelling Hydrological Modeling Waste management
Atmospheric Science and Management
(a) Total score distribution based on environmental
field
Score distribution for C1- environmental field Score distribution for C2 - environmental field
1.2 1.2
Mean Mean
Median Median
1.0 90% CI 1.0 90% CI
95% CI 95% CI
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.2 0.2
Air Quality and Energy Forecasting Climate modelling Hydrological Modeling Waste management Air Quality and Energy Forecasting Climate modelling Hydrological Modeling Waste management
Atmospheric Science and Management Atmospheric Science and Management
(b) Score distribution for C1 (c) Score distribution for C2
Score distribution for C3 - environmental field Score distribution for C4 - environmental field
1.2 1.2
Mean Mean
Median Median
1.0 90% CI 1.0 90% CI
95% CI 95% CI
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.2 0.2
Air Quality and Energy Forecasting Climate modelling Hydrological Modeling Waste management Air Quality and Energy Forecasting Climate modelling Hydrological Modeling Waste management
Atmospheric Science and Management Atmospheric Science and Management
(d) Score distribution for C3 (e) Score distribution for C4
Score distribution for C5 - environmental field
1.2
Mean
Median
1.0 90% CI
95% CI
0.8
0.6
0.4
0.2
0.0
0.2
Air Quality and Energy Forecasting Climate modelling Hydrological Modeling Waste management
Atmospheric Science and Management
(f) Score distribution for C5
Figure 5.3: Score distribution based on environmental field
n(Air): 3*, n(Energy): 8, n(Climate): 3*, n(Hydrological): 3*, n(Waste): 4*
C1= Data selection and data suitability; C2= Data quality requirements; C3= FAIR;
C4=Preprocessing; C5= Challenges during data processing; *Below the standard thresholds of 5
34
Score (Out of 1) Score (Out of 1)
Score (Out of 5)
Score (Out of 1)
Score (Out of 1) Score (Out of 1)
5. Results
Energy forecasting and management and waste management scored highly in the C1
category, “Data Selection and Suitability”, which indicates the presence of clearly
defined data selection criteria and the provision of proper arguments for data suit-
ability. All the environmental fields performed better in terms of fulfilling data
quality attributes (C2), which highlights the focus of researchers in addressing these
attributes.
Air Quality, Atmospheric Science and Waste Management score low in the C3 cri-
teria of the FAIR principles, which may be due to accessibility concerns relating to
security. The higher variation in energy forecasting and management indicates that
there may be uncertainty.
Most fields exhibit a similar scoring pattern for C4 (data pre-processing and repro-
ducibility) criteria, while C5 (mention of challenges identified in pre-processing and
providing solutions) scores lower across all fields.
Analysis based on ML techniques used
Table 5.5: Categorisation by ML techniques used
ML Category ML Technique Paper ID
SNN, ELM VI
ANN XXVIII
RNN+CNN X
Neural Networks(NN) MLPR XIIILSTM XIX
PGNN, NN, HPML XII
RSML XXI
SVM,ANN VIII
Support Vector SVM, MLR, GBR, XGB XXII
Machines (SVM) SVM, MLP, ELM IISVM, ANN, KNN, MLR VII
YOLOv5 XIV
Image processing YOLOv5, Computer vision XVIIYOLOF, SOLOv,
Computer vision IX
GTB, SVM, Random Forest V
ET, DT XVI
BRT, SVM IV
Random Forest, SVM, ANN XXIII
Ensemble Learning Random Forest XXVII
Techniques (ELT) Random Forest, SVM, ANN XVIIIConditional Random Forest,
Univariate Non-linear Regression XXV
Random Forest III
NN, XGBoost, LightGBM XV
SVM,VMD,ELM,CEEMDAN,
VMD-CEEMDAN-ELM XXIV
Other ML-based regression model XXVI
SVM, XGBoost, MARS, RR XX
Out of curiosity, we categorised the papers based on the ML techniques used in the
35
5. Results
studies. ML techniques used in each paper were identified from the literature review.
Then we grouped them into general ML categories based on the techniques. As is
it shown in the Table 5.5, most papers use several techniques belonging to different
general categories. We considered two possible options: duplicating the content
in all relevant categories or identifying a primary technique and sorting into that
category. Option one would increase the sample size for each ML category and might
introduce confounding bias. However, if we could identify a primary technique for
each study, option two would be better. After careful consideration, we selected the
primary technique based on the algorithm with the best performance. The selected
primary technique for each study is marked in bold text in Table 5.5. The “other”
category was used when the primary technique did not belong to any of the other
categories.
Figure 5.4 shows the distribution of total and criterion-wise (C1–C5) scores across
the different ML techniques that were applied. Most ML techniques achieve rela-
tively high C1 scores, indicating the right selection and suitability of the data. Image
processing has a high mean, median and minimum variability, reflecting clearly de-
fined data selection criteria. The overall score for C2 is comparatively high, and
the ML techniques included in the other category seem to address data quality
requirements less effectively than the other four ML techniques.
Analysis based on type of data format
During our literature review, we identified several common data formats used in
the studies. The majority of the papers used time series data, while many used a
mix of time series, text, numerical, and multimedia data. A few studies used only
multimedia data. The categorisation into data formats is shown in Table 5.6.
Figure 5.5 shows box plots of total scores and scores for individual categories, cate-
gorised by data format. The total score distribution shows that the multimedia data
format has the highest mean and median score, which indicates that it has better
performance. This may be because papers using multimedia data (e.g., marine and
aquatic science/waste management) have properly mentioned the criteria we defined
in SPADES-ML.
In C1, the time series and mixed data formats performed similarly. In this data
suitability category, multimedia data has shown more promising performance, with
a higher median. In C2, the median values of all three data formats are similar,
but the variability is low in multimedia, indicating that the data quality attributes
have been well addressed. The median values are quite low in C3, with higher
variability across all data formats. Although the median is low for both time series
and multimedia data, the variability is lower for multimedia data, suggesting more
consistent scores in FAIR conditions. The median values in C4 pre-processing are
similar, but multimedia data performs better than the other two. The results for
C5 are similar, and the mixed data format has more variability and can therefore
be uncertain.
36
5. Results
Total core distribution - ML techniques used
5
Mean
Median
90% CI
4 95% CI
3
2
1
0
Neural Networks Support Vector Machines Image processing Ensemble Learning Other
Techniques
(a) Total score distribution based on ML
techniques used
Score distribution for C1 - ML techniques used Score distribution for C2 - ML techniques used
1.2 1.2
Mean
Median
1.0 1.0 90% CI
95% CI
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
Mean
0.0 Median 0.0
90% CI
95% CI
0.2 0.2
Neural Networks Support Vector Machines Image processing Ensemble Learning Other Neural Networks Support Vector Machines Image processing Ensemble Learning Other
Techniques Techniques
(b) Score distribution for C1 (c) Score distribution for C2
Score distribution for C3 - ML techniques used Score distribution for C4 - ML techniques used
1.2 1.2
Mean Mean
Median Median
1.0 90% CI 1.0 90% CI
95% CI 95% CI
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.2 0.2
Neural Networks Support Vector Machines Image processing Ensemble Learning Other Neural Networks Support Vector Machines Image processing Ensemble Learning Other
Techniques Techniques
(d) Score distribution for C3 (e) Score distribution for C4
Score distribution for C5 - ML techniques used
1.2
Mean
Median
1.0 90% CI
95% CI
0.8
0.6
0.4
0.2
0.0
0.2
Neural Networks Support Vector Machines Image processing Ensemble Learning Other
Techniques
(f) Score distribution for C5
Figure 5.4: Score distribution based on ML techniques used
n(NN): 8, n(SVM): 3*, n(Image processing): 3*, n(ELT): 9, n(Other): 3*
C1= Data selection and data suitability; C2= Data quality requirements; C3= FAIR;
C4=Preprocessing; C5= Challenges during data processing; *Below the standard thresholds of 5
37
Score (Out of 1) Score (Out of 1)
Score (Out of 5)
Score (Out of 1)
Score (Out of 1) Score (Out of 1)
5. Results
Table 5.6: Categorisation by data format
Paper ID Data format
VI Time-Series
XXVIII Time-Series
X Time-Series
XXII Time-Series
XX Time-Series
II Time-Series
XXIV Time-Series
XXVI Time-Series
XIII Time-Series
XIX Time-Series
XXIII Time-Series
XXVII Time-Series
XII Time-Series
VII Time-Series
III Mixed
XXI Mixed
XV Mixed
V Mixed
XVI Mixed
IV Mixed
VIII Mixed
XVIII Mixed
XXV Mixed
XIV Multimedia
XVII Multimedia
IX Multimedia
38
5. Results
Total score distribution - data format
5
Mean
Median
90% CI
4 95% CI
3
2
1
0
Time-Series Mixed Multimedia
(a) Total score distribution based on data format
Score distribution for C1 - data format Score distribution for C2 - data format
1.2 1.2
Mean
Median
1.0 1.0 90% CI
95% CI
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
Mean
0.0 Median 0.0
90% CI
95% CI
0.2 0.2
Time-Series Mixed Multimedia Time-Series Mixed Multimedia
(b) Score distribution for C1 (c) Score distribution for C2
Score distribution for C3 - data format Score distribution for C4 - data format
1.2 1.2
Mean Mean
Median Median
1.0 90% CI 1.0 90% CI
95% CI 95% CI
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.2 0.2
Time-Series Mixed Multimedia Time-Series Mixed Multimedia
(d) Score distribution for C3 (e) Score distribution for C4
Score distribution for C5 - data format
1.2
Mean
Median
1.0 90% CI
95% CI
0.8
0.6
0.4
0.2
0.0
0.2
Time-Series Mixed Multimedia
(f) Score distribution for C5
Figure 5.5: Score distribution based on data format
n(Time Series): 14, n(Mixed): 9, n(Mutimedia): 3*
C1= Data selection and data suitability; C2= Data quality requirements; C3= FAIR;
C4=Preprocessing; C5= Challenges during data processing; *Below the standard thresholds of 5
39
Score (Out of 1) Score (Out of 1)
Score (Out of 5)
Score (Out of 1)
Score (Out of 1) Score (Out of 1)
5. Results
Table 5.7: Categorisation by the number of data sources
Number of Data sources Paper ID
Single VI
Single XXII
Single XXIV
Single XIX
Single XIV
Single V
Single XXIII
Single XII
Single II
Single XXVI
Single VII
Single XX
Single XXVIII
Single VIII
Single XXV
Single XVII
Single IX
Single X
Multiple XV
Multiple XXVII
Multiple IV
Multiple XVI
Multiple XVIII
Multiple XIII
Multiple XXI
Multiple III
Analysis based on the number of data sources
The majority of studies have used a single data source. A few others have used mul-
tiple data sources, including public portals and private data collected from various
organisations. Categorisation is shown in Table 5.7.
As Figure 5.6 shows, the distribution of total scores is quite similar for single and
multiple data sources. Only a slight improvement can be seen for single data sources.
However, when we consider the C1 score, there is a considerable difference: single
data sources have higher median and mean values, indicating good performance, but
with large variability, indicating uncertainty. Multiple data sources scored low in
terms of data selection and suitability because we had to validate different sources,
some of which did not include the required arguments. However, validating a single
source was straightforward. In C2, the median of both data sources is quite similar,
but the uncertainty in the single data source is quite high. C3 has a low median for
both sources, with higher variability. Multiple sources have more variability than a
single source, which indicates higher uncertainty. Variability seems to be low for C4
in multiple data sources, where most papers have scored partially. Even in C5, the
variability is low in multiple data sources, but the uncertainty is higher compared
with single data sources.
40
5. Results
Total Score distribution - Data Source
5
Mean
Median
90% CI
4 95% CI
3
2
1
0
single Multiple
(a) Total score distribution based on number
of data sources
Score distribution for C1 - Data Source Score distribution for C2 - Data Source
1.2 1.2
Mean Mean
Median Median
1.0 90% CI 1.0 90% CI
95% CI 95% CI
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.2 0.2
single Multiple single Multiple
(b) Score distribution for C1 (c) Score distribution for C2
Score distribution for C3 - Data Source Score distribution for C4 - Data Source
1.2 1.2
Mean Mean
Median Median
1.0 90% CI 1.0 90% CI
95% CI 95% CI
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.2 0.2
single Multiple single Multiple
(d) Score distribution for C3 (e) Score distribution for C4
Score distribution for C5 - Data Source
1.2
Mean
Median
1.0 90% CI
95% CI
0.8
0.6
0.4
0.2
0.0
0.2
single Multiple
(f) Score distribution for C5
Figure 5.6: Score distribution based on number of data sources
n(Single): 18, n(Multiple): 8
C1= Data selection and data suitability; C2= Data quality requirements; C3= FAIR;
C4=Preprocessing; C5= Challenges during data processing.
41
Score (Out of 1) Score (Out of 1)
Score (Out of 5)
Score (Out of 1)
Score (Out of 1) Score (Out of 1)
5. Results
5.2 Analysis of Survey responses
As the survey consisted of both closed and open-ended questions, the survey data
were analysed using both quantitative and qualitative methods. Not all of the 23
respondents answered the open-ended questions, as these were added as follow-up
questions to a series of closed-ended questions targeting the SPADES-ML categories.
As we asked participants to answer the questions from the perspective of three
different academic roles: Reader, Author and Reviewer, a few participants had also
chosen not to answer some closed-ended questions that were not relevant to them.
Survey respondents demographics
This provides information on demographic information that we collected from the
survey respondents. This helped us to identify the professional diversity, their con-
tribution to ML-based environmental research and how they usually collect data for
their research.
Figure 5.7: Distribution of environmental fields of the respondents
Figure 5.7 shows the variety of experience of the respondents within different envi-
ronmental fields. We used eight environmental fields to categorise, and the remain-
ing three: urban water management, agriculture, and petroleum modelling were
provided by the respondents.
Figure 5.8 illustrates the distribution of years of experience that participants have
as authors of environmental research. All participants have worked for more than
one year, with the majority having more than six years’ experience.
Another aspect we wanted to investigate was whether these authors regularly review
research papers for journals or conferences. Figure 5.9 shows the distribution, with
a significant number of participants — 19 in total — regularly reviewing papers.
42
5. Results
How many years of experience do you have as an author in environmental research?
Response
None
How many years of Less than 1
 experience do you 1 - 5 years
 have as an author 4 9 10 6-10 years
 in environmental More than 10 years
research?
0 5 10 15 20
Count
Figure 5.8: Distribution of the years of experience as an author in environmental
research
Do you regularly review papers for journals or conferences?
Response
Yes
No
Do you regularly 
 review papers for
 journals or 19 4
 conferences?
0 5 10 15 20
Count
Figure 5.9: Distribution of paper review frequency
From all these 19 participants, 8 of them have more than 10 years of experience and
the distribution can be found in Figure 5.10.
How many years of experience do you have in reviewing research papers?
Response
None
Less than 1
How many years of 1 - 5 years
 experience do you
 have in reviewing 4 2 8 1 8 6-10 years
 research papers? More than 10 years
0 5 10 15 20
Count
Figure 5.10: Distribution of the years of experience as a reviewer
These demographic questions were designed to highlight participants’ experience of,
and active involvement in, authoring and reviewing environmental research.
We then examined how participants access the data they use for their research. The
main objective was to understand whether they primarily produce their own data, as
this would minimise communication issues with external data producers and ensure
a thorough understanding of the requirements and capabilities of their data sources.
Out of 23 participants, 17 identified as data producers. Figure 5.11 presents the
distribution of these responses.
Figure 5.12 provides further analysis of the above aspect by distinguishing between
participants who primarily generate their own data and those who mostly rely on
data produced by external parties. The majority of participants indicated that they
regularly use both self-produced and externally produced data, suggesting that they
explore all possibilities for obtaining high-quality data.
43
5. Results
Would you consider yourself a data producer?
Response
Yes
No
Would you consider
yourself a 17 6
data producer?
0 5 10 15 20
Count
Figure 5.11: Distribution of data producing experience
Where do you usually get the data for your research?
Response
I mostly use data
produced by myself
Where do you I use data produced
usually get the 3 14 5 by myself and datadata for your from other researchers
research? I mostly use data
produced from other
researchers
0 5 10 15 20
Count
Figure 5.12: Distribution of origin of data
Analysis of survey results - closed-ended questions
Quantitative analysis was performed to analyse the responses to the survey’s closed-
ended questions. Likert scale questions that were formed to support the SPADES-
ML were analysed according to the different roles that we defined. The 23 responses
were analysed based on the importance provided by the respondents and the mean
value for each criterion was calculated to obtain an overall value. Figure 5.13 shows
the grouped bar chart of the mean importance of each criterion recorded for each
role.
The importance of mentioning data selection and suitability criteria was highlighted
within the responses. Data accuracy, consistency and credibility were identified as
the main data quality requirements prioritised among all three roles.
The author role has given higher scores than the reader and reviewer across several
criteria, such as the importance of metadata, using common file formats, providing
a replication package, and providing reproducible preprocessing steps. This may be
due to their awareness and responsibility in managing the data and providing details
of the research carried out.
The reviewer role tends to show slightly lower interest in some areas than in the
other roles. Access to metadata and raw data, and details of reproducible solutions
to the challenges encountered throughout the research, are important criteria that
reviewers prioritise low.
Upon closer inspection, we concluded that visualising Likert-scale data with a mean
score would not be an effective approach in our scenario. This approach would result
in important information and insights being lost, and some findings would not be
accurately interpreted. Instead, we chose a visualisation that highlights the actual
44
5. Results
Figure 5.13: Grouped Bar Chart: Mean Importance Ratings of Evaluation
Criteria by Role
ratings for each criterion, allowing us to see their relative importance.
The stacked bar chart is used for visualisation, with “important” responses diverging
to the right and “unimportant” responses diverging to the left. As there is no neutral
value, the middle ground is not visualised. We visualised the response data for each
role and for all roles combined. For this visualisation, we only considered the set of
questions common to all three roles, which we repeated.
Figure 5.14 shows the importance of each criterion from the reader’s perspective.
Readers mainly focused their attention on the following:
• Mention of data selection criteria
• Need for accurate data
• Importance of providing metadata
• Long-term availability of research data, to ensure that it can be accessed and
reused in future research
• Providing details about preprocessing steps, including the reasons for doing so
45
5. Results
46
Reader Role: Importance Ratings
Response
How important is it to mention the data selection criteria Not important at all
or requirements? 3 20 Somewhat unimportant
How important is it to provide reasons for the suitability
of data? 1 7 15
Somewhat important
Very important
Data accuracy 2 21
Data currentness 1 1 15 5
Data completeness 2 14 7
Data consistency 1 8 14
Data credibility 1 8 14
How important is metadata? 10 13
How important is it to gain access to raw data used in a
paper? 2 6 15
How important is it to gain access to processed data for
further analysis? 2 11 10
How important is it to use commonly supported data/file
formats in a research project? 2 12 9
How useful is it to provide long-term availability and
access to research data to support reusability in future 6 17
research?
How useful is it to provide a replication package to support
the reproducibility of research findings? 1 16 6
How important is it to describe the preprocessing actions
performed on the data, including the reasons for those 10 13
actions?
How important is it to provide reproducible preprocessing
steps or code? 4 10 9
How important is it to mention the challenges encountered
during data processing? 1 4 15 2
How important is it to provide reproducible solutions to
these challenges? 4 9 9
0 5 10 15 20
Number of Responses
Figure 5.14: Importance rating as a reader
5. Results
How important do you think it is for researchers to make their self-collected data publicly available to increase the credibility of the data?
Response
How important Not important at all
do you think it is Somewhat unimportant
for researchers to Somewhat important
make their self-collected 1 11 11
data publicly available Very important
to increase the
credibility of the data?
0 5 10 15 20
Count
Figure 5.15: Importance of making self-collected data publicly available to
increase the credibility of the data
In the readers’ section, we included a question specific to the readers’ views to gauge
the importance of making self-collected data publicly available to increase credibility.
Many researchers agreed with this approach, either fully or partially. However, only
one researcher thought it was somewhat unimportant for the credibility of the data.
Figure 5.15 shows the visualisation of that result.
Figure 5.16 shows the importance of each criterion from the author’s perspective.
Authors have focused their attention mainly on the data quality attributes - accu-
racy, currentness, consistency and credibility. They also fully agree on the impor-
tance of providing reasons for the suitability of the data used in the research, as
well as providing access to the raw data. They also strongly favour on providing a
replication package to ensure reproducibility and providing reproducible preprocess-
ing steps. They have slightly disagreed on providing commonly supported data/file
formats and data completeness.
47
5. Results
48
Author Role: Importance Ratings
Response
How important is it to mention the data selection criteria Not important at all
or requirements? 1 11 11 Somewhat unimportant
How important is it to provide reasons for the suitability
of data? 7 16
Somewhat important
Very important
Data accuracy 11 12
Data currentness 1 22
Data completeness 1 3 11 8
Data consistency 12 11
Data credibility 7 16
How important is metadata? 1 6 16
How important is it to gain access to raw data used in a
paper? 7 16
How important is it to gain access to processed data for
further analysis? 1 6 16
How important is it to use commonly supported data/file
formats in a research project? 4 9 9
How useful is it to provide long-term availability and
access to research data to support reusability in future 1 8 14
research?
How useful is it to provide a replication package to support
the reproducibility of research findings? 7 16
How important is it to describe the preprocessing actions
performed on the data, including the reasons for those 1 8 14
actions?
How important is it to provide reproducible preprocessing
steps or code? 9 14
How important is it to mention the challenges encountered
during data processing? 1 11 11
How important is it to provide reproducible solutions to
these challenges? 3 10 10
0 5 10 15 20
Number of Responses
Figure 5.16: Importance rating as an author
5. Results
We also included a specific question in the authors’ section to emphasise the impor-
tance of communicating properly with data producers before retrieving data. This
allows authors to gain useful insights into the fields necessary for their work and to
check whether the data matches their requirements. Twenty researchers identified
this as useful, and only three replied that it was somewhat unimportant. Figure 5.17
shows the visualisation of this result.
How important is it to be able to communicate with data producers before retrieving data?
Response
Not important at all
How important Somewhat unimportant
is it to be able
3 9 11 Somewhat important to communicate
with data producers Very important
before retrieving data?
0 5 10 15 20
Count
Figure 5.17: Importance of communicating with data producers before retrieving
data
From a reviewer’s perspective, the emphasis is on data quality attributes similar to
those of the authors, but they mainly consider data accuracy, currentness, complete-
ness and credibility. As they need to assess the relevance and temporal resolution
of the data when reviewing papers, these attributes are prioritised. They have also
emphasised the importance of metadata. Figure 5.18 displays the distribution of
importance ratings assigned by the participants when considering their role as re-
viewers.
49
5. Results
50
Reviewer Role: Importance Ratings
Response
How important is it to mention the data selection criteria Not important at all
or requirements? 1 15 7 Somewhat unimportant
How important is it to provide reasons for the suitability
of data? 3 9 11
Somewhat important
Very important
Data accuracy 6 16
Data currentness 7 15
Data completeness 3 19
Data consistency 1 2 12 7
Data credibility 12 10
How important is metadata? 9 13
How important is it to gain access to raw data used in a
paper? 1 7 14
How important is it to gain access to processed data for
further analysis? 1 10 11
How important is it to use commonly supported data/file
formats in a research project? 1 10 11
How useful is it to provide long-term availability and
access to research data to support reusability in future 2 11 9
research?
How useful is it to provide a replication package to support
the reproducibility of research findings? 3 10 9
How important is it to describe the preprocessing actions
performed on the data, including the reasons for those 2 7 13
actions?
How important is it to provide reproducible preprocessing
steps or code? 2 13 7
How important is it to mention the challenges encountered
during data processing? 1 8 13
How important is it to provide reproducible solutions to
these challenges? 3 9 10
0 5 10 15 20
Number of Responses
Figure 5.18: Importance rating as a reviewer
5. Results
51
All Roles Combined: Importance Ratings
Response
How important is it to mention the data selection criteria Not important at all
or requirements? 2 29 38 Somewhat unimportant
How important is it to provide reasons for the suitability
of data? 4 23 42
Somewhat important
Very important
Data accuracy 19 49
Data currentness 1 1 23 42
Data completeness 1 5 28 34
Data consistency 1 3 32 32
Data credibility 1 27 40
How important is metadata? 1 25 42
How important is it to gain access to raw data used in a
paper? 3 20 45
How important is it to gain access to processed data for
further analysis? 4 27 37
How important is it to use commonly supported data/file
formats in a research project? 7 31 29
How useful is it to provide long-term availability and
access to research data to support reusability in future 3 25 40
research?
How useful is it to provide a replication package to support
the reproducibility of research findings? 4 33 31
How important is it to describe the preprocessing actions
performed on the data, including the reasons for those 3 25 40
actions?
How important is it to provide reproducible preprocessing
steps or code? 6 32 30
How important is it to mention the challenges encountered
during data processing? 1 6 34 26
How important is it to provide reproducible solutions to
these challenges? 10 28 29
0 10 20 30 40 50 60 70
Number of Responses
Figure 5.19: Importance rating considering all roles
5. Results
Figure 5.19 shows the responses from all participants, regardless of their role, pro-
viding a generic view on the aspects in SPADES-ML. Higher importance is given
to data accuracy, where every user considers it important. Data credibility and
the importance of metadata have also been prioritised; only one person disagreed
about the importance of these aspects. Providing reproducible solutions to process-
ing challenges and highlighting these challenges was still seen as less of a priority
than other criteria.
Analysis of survey results - open-ended questions
Thematic analysis of the responses to open-ended questions revealed six main themes
related to the categories defined in the SPADES-ML. These themes reflect the priori-
ties of environmental researchers and the challenges they encounter when processing
environmental science data. The following themes were derived from participant re-
sponses, along with supporting evidence. Only the most important quotes are listed
here; all quotes relating to each theme can be found in Appendix B.
Theme 1: Transparency and justification of data selection
This theme emphasises the importance of researchers justifying their data selection
criteria, particularly with regard to suitability, to ensure relevance and quality. The
issue of introducing bias has also been raised and the importance of transparency
and justification of data selection has been highlighted.
Reader perspective
R3: “Very important to select/use data that is acceptable and reliable since
some data may misguide the prediction tool”
R9: “Data availability is often the only factor for selecting a data source due
to limited availability.”
Author perspective
A3: “Important to ensure model accuracy, improves model interpretability,
supports reproducibility”
A6: “Sometimes, data availability is the problem, so we don’t have the luxury
of choosing”
Reviewer perspective
V3: “Clear data selection criteria, relevant to the problem being solved. Im-
portant to check target population, format, quality, and completeness.”
V7: “As a reviewer, I’m always looking for an answer to ’why’. If data selec-
tion is anything but random, I look for possible sources of biases and what
was done to mitigate them or how the authors acknowledge it later on.”
From the reader’s perspective, researchers should focus on providing clarity regard-
ing the origin of the dataset and offering a proper explanation in order to build
52
5. Results
trust. Since the accuracy of the model depends on the suitability and relevance of
the data, readers consider it important to have details of the data selection process
and an assessment of the data’s suitability. From both the reader’s and the author’s
perspectives, it was highlighted that some researchers may have to prioritise avail-
ability when selecting a data source due to limitations in their field of research. From
a reviewer’s perspective, the reason for selecting the dataset is often investigated to
ensure that no bias has occurred and that the reasoning is justified.
Theme 2: Data quality and reliability
The importance of data quality requirements was emphasised in this theme, with
participants focusing on data trustworthiness. They emphasised that achieving qual-
ity attributes such as accuracy and credibility is essential, since poor quality data
can negatively impact modelling outcomes.
Reader perspective
R6: “The data should be recorded by a credible source, the uncertainty of any
type while collecting the data should be provided. Removing outliers should
be done by experts not automated. Consistent measurement is crucial”
R9: “Data accuracy and consistency are the most important factors, especially
when using ML, as gaps can be filled, but bad data makes this tough to do.”
Author perspective
A3: “High-quality data leads to more accurate predictions and saves time and
resources. Very important to check some data that has incomplete informa-
tion, which may lead to inaccurate models.”
A6: “Bad data, bad results.”
Reviewer perspective
V3: “Important to see high-quality data, such as accuracy, completeness,
consistency”
Data quality remains a common concern for everyone, especially when the results
are affected by negligence. The main quality attributes highlighted are accuracy,
completeness, consistency, and credibility. However, in some fields, such as waste
management, data from the most credible sources can be unreliable. Readers high-
light the importance of a clear definition and standardisation, along with the need
for a credible source. Authors have given focus on practical implications and conse-
quences that happen when the data quality is not addressed properly.
Theme 3: Data accessibility and constraints
This theme highlights the importance of data access and discusses practical acces-
sibility challenges. In order to verify or use research data/findings as the basis for
53
5. Results
new research, other researchers need to be able to access the data and information
about it.
Reader perspective
R3: “Many real-time application does not have data accessibility”
R6: “It’s one of the most important steps for the credibility of the research.
For the research to be conducted correctly, the same results should be obtained
by any researcher who is following the same steps as in the paper, thus without
data, that cannot be confirmed”
Author perspective
A3: “Easy access to data allows to quickly explore, preprocess, and train
models without delays”
A6: “Data is the only way to verify the creditability of the research”
Reviewer perspective
V5: “Usually, I have no time to review a paper to the level of analysis repli-
cation”
V7: “This is quite an idealistic perspective. I don’t personally believe that
all data needs to be accessible at the time of review, but it is needed for
reproducibility. I deem access very important, only to ensure reproducibility.”
From a reader’s perspective, access to raw or processed data is important, as they
need to be able to validate the results before using them in their own research.
Without access to the data, nothing can be confirmed or reproduced. However,
as mentioned, data accessibility is not guaranteed most of the time in real-time
situations. To ensure the credibility of their research, authors tend to make their
data accessible. The reviewer emphasised that they would not have enough time to
replicate the findings due to the limited time available for reviewing a paper.
Theme 4: Transparency in pre-processing
It was considered essential that the preprocessing steps were documented transpar-
ently. In all roles, the importance of clear, detailed and reproducible preprocessing
steps for ensuring data quality and model accuracy was emphasised.
Reader perspective
R6: “Sometimes, data preprocessing is more important than the modeling
step itself. It is crucial to explain the process”
R9: “Its important to provide these reproducible steps / code, but I find
contacting the authors and starting a dialogue is often the best way to truly
understand what they’ve done.”
54
5. Results
Author perspective
A3: “Very important in cleaning, transforming, and organizing raw data into
a format that can be used by ML models”
A9: “Much of the data in waste management is poor and we need to provide
justification as to why these incorrect data points should be removed from the
analysis”
Reviewer perspective
V3: “Important to make sure the raw data is clean and usable format”
The reader highlights the critical role that transparency plays in data pre-processing,
considering it to be more important than modelling itself. They emphasise that pre-
processing is the most important part and that, without it, the entire objective would
be missed. Readers also value direct communication with the authors in order to
gain deeper insights and understand the research properly. The authors’ perspec-
tive provides practical guidance on preparing data for ML models, emphasising its
importance for achieving accurate results. They highlight the importance of justify-
ing preprocessing steps to give readers a clear understanding of the reasons behind
them. The reviewer focuses on ensuring that the data used for modelling is in a
usable condition.
Theme 5: Challenges in data processing
It is important to highlight that researchers face challenges in the pre-processing
phase, which can be due to various reasons such as noise, missing values, and other
integration difficulties.
Reader perspective
R3: “Incomplete datasets with missing values can skew the model’s under-
standing, data might contain errors, outliers, or irrelevant information, ML
models can’t directly handle non-numeric data”
Author perspective
A7: “There are way too many challenges to be document everything. It could
be useful to include, maybe as comments in code, but not sure how else one
would include these.”
Reviewer perspective
V3: “Some issues should be mentioned e.g., missing or inconsistent data, noise,
high dimensionality, class imbalance, and difficulties in integrating data from
multiple sources, etc.”
The focus in all three roles is on the challenges encountered in preprocessing, such
55
5. Results
as missing values, outliers, and incorrect data. Although the authors recognised
that including all the details in the paper would not be practical, they emphasised
the importance of providing information on how these issues were resolved. The
recurring challenges they face emphasise the need for a proper data pipeline in ML-
based environmental research.
Theme 6: Institutional standards and practices
This theme highlights how adherence to formal regulations and committees shapes
data practices and the measures taken to ensure data quality.
Reader perspective
R1: “Remember that ’Monitoring data’ always need to meet criteria defined
by the agencies / EU. Therefore the selection of monitoring data is not a big
issue because they already (mostly) meet quality criteria.”
R5: “Embargo timing could be needed to protect ongoing publishing actions
and MSc-PhD Theses”
Author perspective
A4: “Scientific evaluation committees must take into consideration the value
of publishing datasets. Otherwise, the scientific community won’t publish
them.”
It highlights that readers can rely on trustworthy validation and need not worry
about certain aspects, as these will be automatically fulfilled by adhering to the
standards. Readers are aware that some research will be made available with a delay
in order to protect future interests. The authors have emphasised the importance of
publishing datasets and of reaching an agreement within the scientific community.
5.3 Findings on challenges and recommendations
This section presents the challenges identified during the study and initial recom-
mendations that could mitigate those challenges. Table 5.8 provides an overview
of the challenges we identified in the data selection and preparation stages of ML-
based environmental research. The “Source” column indicates the step at which the
challenge was identified. The “Severity” column was added to indicate how severe
the identified challenges are, considering the findings gathered from the SPADES-
ML analysis. If the challenge was barely addressed in the analysed research papers
— in fewer than 10 papers — we identified it as having a high level of severity. If
the challenge is addressed in more than 20 papers, we identified it as low severity.
Anything in between falls under medium severity. The “Notes” column explains the
challenge.
56
5. Results
Table 5.8: Challenges identified in data selection and preparation for the
application of ML in environmental disciplines
Challenge Source Severity Notes
A1:Unclear han- Literature re- Medium Author 9: “Much of the
dling of miss- view/ Survey data in waste management
ing/null data is poor and we need to pro-
vide justification as to why
these incorrect data points
should be removed from the
analysis””
A2:Lack of clarity Literature re- Medium “High-frequency fluctua-
in handling noisy view/ Survey tions in time-series data
data (e.g., meteorological and
pollutant concentration
data) could introduce
noise.” [17], Author 3:
“Some data might have
incomplete information,
irrelevant, incorrect, or
random data points, or
multiple records of the
same data”
A3:Unclear han- Literature re- Medium “Dataset having a large
dling of high view number of input variables
dimensionality and (plants and meteorological
redundancy data), leading to high di-
mensionality and potential
redundancy due to correla-
tions among variables” [14]
A4:Lack of clar- Literature re- Medium
ity of handling view
datasets that use
different scales and
metrics
B:Difficulty locat- SPADES-ML High Many datasets lacked suffi-
ing metadata about cient metadata, making it
data and documen- difficult for researchers to
tation understand important de-
tails about data
Continued on next page
57
5. Results
Challenge Source Severity Notes
C:Lack of details SPADES-ML High Many research papers do
for replication and not provide all the informa-
future use tion needed to replicate the
results. Access to raw data
and scripts with all the nec-
essary codes, including doc-
umentation, is needed.
D:Difficulty locat- SPADES-ML Low Sometimes, only the names
ing data used when of public portals were men-
it is available on tioned from which data were
public portals acquired without providing
specific details.
E:Difficulty locat- SPADES-ML Medium Data pipeline process was
ing proper details rarely mentioned in full, in-
about data pipeline cluding all the necessary
steps and transformations.
F:Unclarity of data SPADES-ML Medium Often details about data
quality require- quality were not mentioned
ments while focusing attention on
model performance
G:Lack of details SPADES-ML Low Sometimes, when data were
to ensure data suit- collected from multiple
ability when mul- sources, the same level of
tiple data sources details were not given for
were used data suitability
H:Ensuring credi- SPADES-ML/ Medium Reader 6: “Data should
bility of a dataset Survey be recorded by a credible
source, the uncertainty of
any type while collecting
data should be provided”
I:Restricted access Survey Medium Special permission may
to data in real-time be needed to access some
applications data. Such restrictions
often demotivate re-
searchers.Reader 3: “Many
real-time applications do
not have data accessibility”
Continued on next page
58
5. Results
Challenge Source Severity Notes
J:The most credible Survey Low Reader 9: “In waste man-
data in worst qual- agement, often the most
ity credible data (eg., from na-
tional governments) is the
worst quality as they have
no means to measure it at
a national level. Data ac-
curacy and consistency are
the most important factors,
especially when using ML,
as gaps can be filled, but
bad data makes this tough
to do”
K:Limited data Survey Medium Reader 9: “Data availability
availability in some is often the only factor for
environmental selecting a data source due
fields to limited availability.”, Au-
thor 6: “Sometimes, data
availability is the problem,
so we don’t have the luxury
of choosing”
L: Dirty data - Literature re- High Reader 3: “Incomplete
Noisy, missing, or view/ Survey datasets with missing val-
invalid entries ues can skew the model’s
understanding, data might
contain errors, outliers,
or irrelevant information,
ML models can’t directly
handle non- numeric data”,
Author 3: “Very important
to check some data that
has incomplete informa-
tion, which may lead to
inaccurate models. Can
be overcome by removal,
or using algorithms that
handle missing values”
We have provided some initial recommendations for possible methods and prac-
tices to tackle the challenges identified above in Table 5.9. These recommendations
are based on four sources: good practices identified in our analysis of papers us-
ing SPADES-ML(L), analysis results from the survey(S), the work of Munappy et
al.(M) [53], [54], [4], [41], [42], [40], and other literature(O). The guidelines are dis-
cussed following the structure of SPADES-ML. The “Code” column contains the
59
5. Results
reference to the SPADES-ML.
All of the proposed recommendations are valuable in addressing one or more of the
identified challenges. They can differ based on the effort required for implemen-
tation. Therefore, the “Effort” column was roughly determined based on the time
required for completing the task. Recommendations that introduce new practices
or tools that are unfamiliar to the environmental research domain are categorised as
“High” in terms of effort, as more time will be required for training and onboarding.
“Low” effort was allocated to recommendations that could be implemented within
a few hours. Anything requiring more time was allocated as “Medium” effort. The
“Reference to challenge” column provides details of the challenges that are partially
or fully fulfilled by these recommendations.
Table 5.9: Guidelines for future ML-enabled research
Code Guideline Examples Effort Reference Source
(in to
environ- challenge
mental
science)
C1: Data selection & data suitability
1.1 Provide a clear study Location, Low G - L, S,
scope - reason for depth Partially M [40]
selecting the data
source(s)
1.1 Justification when Time Low G - L, S
particular data range, Partially
selections are made frequency
1.1 Define clear criteria for Manually Medium G - L
ground truth data verified or Partially
sampling directly
observed
data
1.1 Mention any Low G - L
thresholds or criteria Partially
that were used to
include/exclude
samples
1.1 Manage proper Medium A4 - S,
communication with Partially O [44],
data collectors when M [40]
applicable
1.2 Clearly state the Low G - L
practical or scientific Partially
question the data are
intended to address
60
5. Results
Code Guideline Examples Effort Reference Source
(in to
environ- challenge
mental
science)
1.2 Explain how data Different Medium G - L
represents diversity of location, Partially
conditions lightings
1.2 Justify the relevance of Low G - L
the dataset to the Partially
specific case
1.2 Mention the data Low H - Fully L, S
source and its
credibility
1.2 Ensure measurements Medium G - L
and variables capture Partially
relevant details
C2: Data quality requirements
2.1 Clearly describe Medium F - L
methods used to Partially
identify and remove
erroneous or unlikely
values
2.1 Use expert-verified Medium F - L
annotations where Partially
possible
2.2 Considering multiple Sites, Medium F - L
states/conditions to periods Partially
ensure coverage
2.2 State the proportion of Deletion, Medium F - L, S,
missing data and how imputa- Partially M [4]
it was handled tion
2.2 Justify if certain Medium F - L
variables were Partially
excluded and explain
their absence
2.3 Use reputable data Medium F - L, S
sources where possible Partially
2.3 Make the data openly Medium F - L, S,
accessible when Partially O [44]
possible
2.3 Describe data Sensors Low F - L
collection instruments Partially
2.4 Clearly define the data Low F - L
collection period Partially
61
5. Results
Code Guideline Examples Effort Reference Source
(in to
environ- challenge
mental
science)
2.4 Justify how the data’s Seasonal Low F - L
age matches the tracking Partially
research objective and
clearly indicate if parts
are outdated
2.4 Mention any temporal Low F - L
gaps and their Partially
relevance to the study
2.5 Maintain uniform Sensor Medium F - L
settings, annotation configura- Partially
protocols, formatting tions
and categorisation
standards throughout
the dataset
2.5 Describe how Medium F - L
inconsistencies (e.g., Partially
contradictory values
from different sources)
were handled
2.1/ Data Linter: a new High J - M [41]
2.2/ class of tools that Partially
2.5 automatically inspects
ML datasets to
identify potential
issues in the data
C3: FAIR
3.1 Provide structured High B - L, S,
metadata describing Partially M [41]
variables, timeframes,
equipment, and
locations
3.1 Link datasets to Medium C - L
persistent identifiers Partially
(DOIs, ORCIDs)
3.1 Link metadata in the Low B - L
paper or in Partially
supplementary
material
62
5. Results
Code Guideline Examples Effort Reference Source
(in to
environ- challenge
mental
science)
3.2.1 Clearly state where Low D - Fully L
raw data can be
accessed (include DOI,
URL, or API)
3.2.2 Share processed Medium C - L
outputs Partially
3.2 Clearly state access Low J - L
conditions (open, Partially
restricted,
institutional)
3.3 Use open, Low C - L, S
non-proprietary file Partially
formats (CSV, JSON)
3.4.1 Use institutional or Low C - L
domain-specific Partially
repositories that
ensure long-term
access
3.4.1 Use version-controlled Medium C - L
platforms (e.g., Partially
GitHub)
3.4.2 Provide a replication High C - L, S
package including Partially
data, scripts, models,
and instructions to
reproduce
C4: Data Preprocessing
4.1 Data cleaning, High L - Fully L, S,
scrubbing - procedure M [41]
to modify or remove
incomplete, incorrect,
inaccurately
formatted, or repeated
data in a dataset and
apply data imputation
- to solve the missing
data problem
63
5. Results
Code Guideline Examples Effort Reference Source
(in to
environ- challenge
mental
science)
4.1 Clearly document all Medium A1, A2, L, S,
preprocessing steps, A3 - Fully M [41]
including: A4, E -
- removed data (nulls, Partially
duplicates, outliers,
temporal exclusion)
- transformed data
(normalisation,
scaling) or filtered
- feature engineering
techniques (derived or
composite variables)
4.1 Describe why each step Low C - L, S
was necessary Partially
(according to the
domain knowledge)
4.2 Provide open access to: Medium C, E - L, S
- Source code or Partially
notebooks
- Annotation tools and
formats
- Step-by-step
documentation to
replicate results
4.2 Include enough details Medium E - L
to rerun the pipeline Partially
from raw input to final
output
4.2 Include software Low C - L
versions, libraries, and Partially
environment setup
C5: Challenges during data processing
64
5. Results
Code Guideline Examples Effort Reference Source
(in to
environ- challenge
mental
science)
5.1 Clearly state technical Poor Medium C, E - L, S
or relevant challenges lighting, Partially
during preprocessing noisy
or analysis data,
hardware
limits,
missing
values,
misclassi-
fications,
anoma-
lies,
sensor
errors,
data
gaps,
misalign-
ment in
multi-
source
datasets
(e.g.,
temporal)
5.1 Describe how issues Medium C, E - L
were detected (e.g., Partially
thresholds, visual
inspection, statistical
tests)
65
5. Results
Code Guideline Examples Effort Reference Source
(in to
environ- challenge
mental
science)
5.2 Document how each Medium E - L, S
challenge was solved, Partially
including:
- Code snippets or
links to GitHub
- Threshold
adjustments or
parameter tuning
- Algorithmic
workarounds
- Use of external tools
(e.g., Colab for scaling
up processing)
- Justify exclusions or
imputation
66
6
Discussion
In this chapter, we discuss the results of our research and address our research
questions. We take a closer look at the challenges and recommendations. This
chapter also presents a discussion of the contribution of this study to SE. Lastly, we
address threats to validity and future work.
6.1 RQ1: How are data selection and data prepa-
ration done when applying ML in environ-
mental research?
Based on the findings of the analysis through SPADES-ML and survey analysis, we
concluded that aspects of data quality and management are significantly underre-
ported in ML-based applications in environmental research. Papers primarily focus
on providing details about ML algorithms and performance metrics, while neglecting
to address data quality and preprocessing. However, for the model to perform well,
high-quality data must be acquired, and the quality of the data must be ensured
through appropriate processing. Keeping this in mind, we analysed how data were
selected and prepared in ML-based environmental research applications.
Data selection
Most papers collected their data from public datasets, pre-existing scientific databases,
or other reputable third parties. Researchers also tend to collect their own data to
ensure reliability. Most researchers use a combination of data produced by them-
selves and other researchers, while some strictly use data produced by themselves.
Since data accuracy is essential, finding reliable sources is important for ensuring an
accurate model.
Approximately 85% of the papers we referred to in our literature review mentioned
the data selection criteria, which is a positive aspect because it gives the reader a
good idea of the argumentation to why a dataset was used. When it came to provide
a set of justification to why a selected dataset is applicable in the given research,
around 80% of the papers managed to do so, either fully or partially. While most
of the papers provide descriptive reasoning, some have applied mathematical or
67
6. Discussion
statistical justification to validate the reasoning. Environmental researchers mainly
focused on spatial and temporal resolution and sensor reliability when selecting the
data. We would like to acknowledge that some authors have highlighted known
biases in certain datasets, and have accounted for these in their analysis.
Managing data quality
The papers contained few references to ISO/IEC 25012 data quality attributes [19],
despite the researchers’ emphasis on this issue in their survey responses. Although
data accuracy was highlighted as the main data quality attribute, only 57% of the
papers have emphasised the need to maintain data accuracy in relation to existing
literature. While nearly all the papers focus on model accuracy, the manner in which
data accuracy was handled is not properly documented. The reviewed papers mainly
highlight data completeness, currentness, and consistency quality attributes. While
Wang et al. [64] has proposed a multidimensional framework consisting of intrinsic
data quality, and further Pradhan et al. [49] provides a structured approach to
managing data quality, our analysis found that approximately 69% of the studies
address data quality. The survey analysis revealed that, regardless of their roles, all
participants generally agree on the importance of data quality in the study.
Based on the information we gathered, it appears that environmental researchers
tend to eliminate anomalous data to ensure accuracy. This contrasts with the ap-
proach of ensuring completeness, in which researchers focus on covering large time-
frames and considering multiple variables when acquiring data. To maintain consis-
tency, researchers standardise variables or sometimes perform cross-validation. They
ensure credibility by sourcing data from trusted organisations. Many researchers fo-
cus on acquiring recent data because environmental conditions can change rapidly,
and results depend on how recent the data are.
Data preprocessing
As shown in Munappy et al. [53], [41], ensuring the reproducibility and reliability of
ML results begins with transparent and well-documented data pipelines. Although
preprocessing steps such as data cleaning, normalisation, and handling missing val-
ues were mentioned in studies that were analysed, they often lacked sufficient detail
to support reproducibility. Only 30% of the papers provided reproducible prepro-
cessing steps, either by describing them in detail or by providing the codebase, while
70% of the studies mentioned preprocessing, though not in detail. This finding cor-
roborates the observation by Munappy et al. that current ML studies in applied
domains often lack the engineering standards necessary for tackling data-related
challenges.
However, the survey findings highlighted the importance of preprocessing and men-
tioning preprocessing actions with reasoning. While providing reproducible prepro-
cessing steps was not prioritised in the same way, 58% of respondents said that
mentioning preprocessing steps is very important, while only 44% said that pro-
viding reproducible preprocessing steps is very important. Nevertheless, providing
68
6. Discussion
reproducible preprocessing steps was considered somewhat important, yet often not
reflected in practice. The discrepancy between the perceived importance and the
actual practice suggests the need for methodological changes across the field.
6.2 RQ2: What are the challenges of data selec-
tion and preparation for the application of
ML in environmental research?
SPADES-ML was specifically designed to uncover hidden information relating to the
challenges of selecting and preparing data for ML-based environmental research. It
allowed us to systematically evaluate data pipeline practices across five key areas:
data selection and suitability, data quality requirements, adherence to the FAIR
principles, data preprocessing, and challenges encountered during preprocessing.
The analysis enabled us to detect patterns that would have been difficult to iden-
tify through qualitative reading alone. With the help of a literature review and
its analysis of SPADES-ML, we were able to gather both qualitative insights and
quantitative comparisons across papers. Several challenges were identified through
the literature review and the SPADES-ML analysis, which are listed in Table 5.8.
Furthermore, the analysis through SPADES-ML helped us systematically identify
data-related methodological issues in ML-based environmental research. These is-
sues include the lack of adherence to the FAIR principles and the minimal provision
of information on reproducibility in preprocessing. We found that many papers do
not provide access to raw datasets, metadata, or detailed documentation on data
preprocessing steps. These issues have not previously been the focus of research, yet
they pose a threat to the reusability and credibility of data.
The analysis performed using SPADES-ML was further validated by the survey,
in which respondents confirmed the findings and flaws identified in relation to the
real-world scenarios. Many of the challenges identified by SPADES-ML were con-
firmed by researchers as genuine obstacles to ML-based environmental science. This
approach can be used in other ML-based domains without limiting its ability to
thoroughly identify and evaluate data. It can serve as a checklist to help researchers
justify their work in their research papers, enhancing the reusability and credibility
of their work.
6.3 RQ3: What solutions are reported in SE that
could mitigate the challenges of data selection
and preparation for the application of ML in
environmental research?
In RQ1, we explored current data quality and data management practices in ML-
enabled environmental science research. Through RQ2, we identified the challenges
69
6. Discussion
one could face during this process. RQ1 and RQ2 were the main focus of our re-
search. For RQ3, we made some initial recommendations to help mitigate the chal-
lenges identified in RQ2. These recommendations are presented in Table 5.9. The
recommendations follow the SPADES-ML structure because analysing the papers
with SPADES-ML made identification easy. Some recommendations were identified
from the analysis of the survey, from literature in SE, and specifically from the work
of Munappy et al. [53], [54], [4], [41], [42], [40]. Throughout Munappy’s PhD thesis,
many challenges in data pipelines were identified and thoroughly investigated in an
industrial setting within the field of embedded systems. However, some of the solu-
tions discussed in the aforementioned papers could not be mapped out due to their
irrelevance to the domain.
The recommendations presented in Table 5.9 serve as constructive work. While we
acknowledge the need to validate them, we present these recommendations as a set
of guidelines and a methodology pipeline for future environmental science and SE
research that utilises ML.
The recommendations primarily focus on addressing the gaps identified through the
SPADES-ML analysis and the survey. Researchers highlighted the main concerns
in their survey feedback, which are not properly addressed in research papers in
practical situations. The gaps identified (low reproducibility, lack of data qual-
ity requirements), therefore, naturally mapped onto the proposed SE approaches.
Our recommendations focus on filling these gaps. These recommendations reflect
concerns raised by environmental researchers and are grounded in widely validated
SE methods. They are supported by our findings from the literature review, the
SPADES-ML analysis and the survey analysis. Following these guidelines in ML-
enabled research has the potential to improve data management practices and reduce
challenges in this process.
6.4 Contribution to Research Software Engineer-
ing
Research Software Engineering (RSE) is an emerging discipline which bridges the
gap between software engineering and academic research. RSE focuses on applying
software engineering principles to research contexts. Research software engineers
often work in research environments and address domain-specific challenges such
as algorithmic efficiency, data-intensive workflows, and high-performance comput-
ing. This role has gained importance as software has become increasingly important
across disciplines, from computational biology to climate modelling [8]. The disci-
pline is critical in order to advance scientific and technological innovation by creating
efficient, reliable and sustainable software solutions tailored to researchers’ needs.
Due to the increasing complexity and scale of research data and computational re-
quirements, the importance of RSE has grown significantly in recent years. Research
Software Engineers play a critical role in the design, implementation, and mainte-
nance of these software tools. They ensure that they meet the high standards of
70
6. Discussion
performance, reliability, and usability required for research applications. Balancing
the flexibility and adaptability of software solutions with the rigour and discipline
of software engineering practices is a key challenge for RSE. Research projects often
make it difficult to adhere to traditional software development processes because
of rapidly changing requirements and the need to explore new methodologies. It is
also important to validate and discuss the methodologies used in RSE in order to
improve the reliability of research projects.
By considering ML-based environmental research as a case study, we explored the
aspects of data quality, data management and how they are handled and reported.
We were able to provide structured guidelines for use in RSE. These guidelines can
be used to evaluate the quality of data and the data pipeline across ML projects.
The findings highlight the need for reproducibility and transparency for other re-
searchers, as well as how this can be achieved. Our findings also suggest that,
even when the code and models are well documented and available for public use, a
lack of transparency in the data pipeline reduces opportunities for reproducibility.
Throughout the process, we identified a discrepancy between what was expected and
what was practically available. With the challenges and recommendations identified,
this can act as a checklist and guideline for SE researchers to validate their approach
when conducting research using ML. They can also identify areas for improvement
and invest time in refining the data pipeline. Proper version control systems that
ensure the long-term availability of data and the necessary code would encourage
contributions from other researchers.
6.5 Threats to validity
This section discusses threats to validity, focusing on how the research questions
were answered, and how this undermines the reliability and validity of the findings.
Internal validity
This refers to concerns that arise when the causal relations within a study are exam-
ined and it occurs without the knowledge of the researchers [56]. When conducting
the literature review, we were concerned that the data acquired might depend on
the reader’s point of view. This could result in bias, which we wanted to eliminate
from the outset. Therefore, we analysed the first six research papers and discussed
them, taking notes as necessary. This was done to maintain a common approach.
We agreed on the aspects to be checked and the extent to which they should be
checked. To confirm our approach, each of us analysed one paper, which the other
then reviewed. We also discussed our findings on a weekly basis. To validate our
approach, we asked one of our supervisors to review three of the papers we had
analysed.
When analysing our literature review findings using SPADES-ML, we aimed to min-
imise bias, as providing scores individually can be subjective. Therefore, we analysed
all the findings from the research papers together and provided scores. In regard
71
6. Discussion
to the design of the survey, the phrasing of the question can have influenced the
responses since our focus was on validating SPADES-ML. Therefore, in order to
address this issue, the survey questions were reviewed and validated by two inde-
pendent researchers, one of whom had an environmental science background. The
survey was also pilot tested by the supervisor and refined based on the feedback
received. We invited the authors of the papers referred to in our literature review
to participate in the survey, as we required their expertise in order to evaluate
SPADES-ML from a generic, study-independent perspective. The survey was con-
ducted anonymously to reduce bias and to prevent us from mapping their responses
to their study, which we referred to. Also, when conducting thematic analysis, to
reduce the bias in coding and themes interpretation, to reduce subjective judgment,
both of us discussed each other’s coding and iteratively checked and grouped them.
External validity
This mainly concerns whether the results can be generalised from the specific en-
vironment in which the study was conducted to other environments [56]. For our
literature review, we limited our analysis to papers presented by Konya and Ne-
matzadeh [27], as well as two other papers identified through snowballing. This
approach may have overlooked valuable insights presented in other relevant research
papers. However, all of the articles were published after 2017, making them recent.
Only 23 of the 251 researchers to whom we sent our survey responded. Neverthe-
less, we covered experts from a few environmental fields, exploring their knowledge
and experience as readers, authors and reviewers. Through forward snowballing
sampling, we identified many other environmental researchers, not just the corre-
sponding and other involved authors of the selected research papers. We also sent
our survey to environmental researchers who use ML-enabled techniques and were
found on university websites and other relevant websites. Thanks to this, we were
able to analyse diverse environmental research data, mitigate limitations, and offer
meaningful insights.
Construct validity
Construct validity focuses on generalisation of the experiment result to a concept or
theory behind it [56]. To ensure an accurate interpretation of the concepts addressed
in SPADES-ML and the survey, we provided the survey participants with definitions
and relevant sources to prevent any misinterpretation and to help them understand
the ideas behind the concepts. We also used the ISO/IEC 25012 data quality frame-
work [19] and the FAIR data principles in our analysis. These were supported by
prior literature and knowledge among the participants. We kept these aspects in
mind when designing the survey, which helped us to reach a shared understanding
of the areas that needed clarification. The survey consisted of five main sections,
72
6. Discussion
three of which contained repeated questions designed to gain an understanding of
the perspectives of readers, authors and reviewers. Every term was explained re-
peatedly in all sections, with extra resources provided where necessary. We ensured
proper interpretations were given and understandability was achieved by reviewing
and validating the survey with two independent researchers before sending it to the
participants.
Conclusion validity
Conclusion validity refers to threats to the validity of conclusions derived from anal-
yses. Sample size is one consideration here. Since we only referred to 26 papers
for our literature review, when conducting the case study, there were situations in
which we had to consider a minimum of three data points for statistical analysis
and visualisation. Similarly, we received only 23 responses to our survey, despite
sending it to 251 practitioners. Due to the small sample size, the number of data
points that could be used was limited, which may impact reliability. For compara-
tive purposes, we had to consider the minimum threshold of three data points, which
is below the standard of five points required for box plots. To ensure transparency,
we added warnings to the figures that fall below this threshold. We used matplotlib
and numpy to generate box plots, and seaborn, pandas, and matplotlib were used
to generate the dot plot. The stacked bar chart was generated using pandas and
matplotlib.
As we only included open-ended follow-up questions to allow for further comments on
the aspects we checked with closed-ended questions, the overall number of comments
was low. Consequently, we were only able to gain a few insights from the thematic
analysis, so we had to focus more on statistical analysis to interpret the results in
detail. Nevertheless, we were able to draw some interesting conclusions from the
details we gathered regarding the three roles.
To increase the reliability of the findings, we individually referred to and discussed
the first six papers in order to find a common interpretation. After agreeing on
a set of common terms, we continued to extract information from other research
papers. We also performed the analysis using SPADES-ML in pairs to increase
reliability. This approach enabled us to reduce bias. Information extraction was
carried out carefully to minimise miscalculations, so that anyone using the same
methods to derive the results would ultimately reach the same conclusions. As we
used a systematic approach to allocate points as either total or partial in SPADES-
ML, the results can be reproduced.
6.6 Future work
Future work on this study will include validating and extending the proposed find-
ings and recommendations. We have provided recommendations based on the SE
aspect, considering ML-based environmental research in terms of data selection,
73
6. Discussion
quality, and management. We need to validate the proposed guidelines in a practi-
cal scenario to check their acceptability in a real-world research context. We need
to measure the effectiveness of these recommendations against how practical they
are in actual use. We can collaborate with environmental researchers specialising
in ML or SE to validate and analyse the guidelines. From this, we will be able to
see what they encounter in real time that prevents them from adhering to these
recommendations from their point of view. They may require additional resources
or training in certain areas, and through proper collaboration, we will be able to
identify what we should change in the real-time scenario. We have adopted a gen-
eralised approach to providing recommendations that consider all environmental
fields. These recommendations could be improved further by tailoring the findings
to specific environmental fields.
74
7
Conclusion
This research aimed to investigate how data are selected and prepared for use in
ML applications within environmental research, and to identify key challenges and
potential solutions to improve data quality and management. Building on the liter-
ature review conducted by Konya and Nematzadeh [27], we adopted a data quality
and management perspective and used SPADES-ML to analyse the findings, which
were then validated through a survey involving environmental science experts. From
our analysis of 26 research papers as well from the survey findings, we were able to
see that current research in environmental field based on ML ignore some critical
aspects of data quality, preprocessing, and providing reproducible solutions. The
survey highlighted that experts consider these aspects to be important, but when a
paper is actually published, these aspects are not properly reported. While most re-
searchers describe data sources and mention quality attributes, they provide limited
justification for preprocessing steps and the challenges encountered. Furthermore,
only a small number of papers provided details regarding reproducibility and ensured
that their findings aligned with FAIR principles.
Through the survey, we were able to validate the importance of the criteria we
defined in SPADES-ML, with input from environmental researchers who actively
incorporate ML into their work. There was a clear consensus among them about
the importance of justifiable data selection, maintaining data quality, adhering to
FAIR principles and providing reproducible preprocessing details. Some participants
also highlighted data accessibility issues and other challenges relating to missing and
incomplete data in their area of research. To address the challenges identified, we
also examined the solutions proposed by Munappy et al. [54], [4], [30], [42] which
allowed us to gain deeper insight into our recommendations from SE techniques for
identifying proper data pipelines.
We have proposed a set of actionable recommendations that take into account SE
practices and the specific needs and concerns raised by environmental researchers.
However, these recommendations need to be validated in real-world projects by
assessing their impact on model performance and improving reproducibility. The
recommendations can be refined through ongoing feedback from environmental re-
searchers and validated as they are implemented. By incorporating good data prac-
tices into ML-enabled environmental research, we can develop more reliable solutions
to address these challenges.
75
7. Conclusion
76
Declaration of Generative AI
Generative AI tools were used to check for grammatical errors in the language and
to polish the language. It was not used to generate content.
77
7. Conclusion
78
Bibliography
[1] Data quality and artificial intelligence – mitigating bias and error to protect
fundamental rights. In Helping to make fundamental rights a reality for every-
one in the European Union, 2019.
[2] Philip Achimugu, Ali Selamat, Roliana Ibrahim, and Mohd Naz’ri Mahrin. A
systematic literature review of software requirements prioritization research.
Inf. Softw. Technol., 56(6):568–585, June 2014.
[3] Cathaoir Agnew, Dishant Mewada, Eoin M. Grua, Ciarán Eising, Patrick
Denny, Mark Heffernan, Ken Tierney, Pepijn Van de Ven, and Anthony Scan-
lan. Detecting the overfilled status of domestic and commercial bins using
computer vision. Intelligent Systems with Applications, 18:200229, 2023.
[4] M Aiswarya Raj, Jan Bosch, Helena Holmström Olsson, and Anders Jansson.
On the impact of ml use cases on industrial data pipelines. In 2021 28th Asia-
Pacific Software Engineering Conference (APSEC), pages 463–472, 2021.
[5] Wael Almikaeel, Lea Čubanová, and Andrej Šoltész. Hydrological drought
forecasting using machine learning—gidra river case study. Water, 14(3), 2022.
[6] Wael Almikaeel, Andrej Šoltész, Lea Čubanová, and Dana Baroková. Hydro-
informer: A deep learning model for accurate water level and flood predictions,
07 2024.
[7] Samuel da Costa Alves Basílio, Camila Martins Saporetti, Zaher Mundher
Yaseen, and Leonardo Goliatt. Global horizontal irradiance modeling from
environmental inputs using machine learning with automatic model selection.
Environ. Dev., 44(100766):100766, December 2022.
[8] Susan M Baxter, StevenW Day, Jacquelyn S Fetrow, and Stephanie J Reisinger.
Scientific software development is not an oxymoron. PLOS Computational Bi-
ology, 2(9):1–4, 09 2006.
[9] Federico Bonofiglio, Fabio C. De Leo, Connor Yee, Damianos Chatzievangelou,
Jacopo Aguzzi, and Simone Marini. Machine learning applied to big data from
marine cabled observatories: A case study of sablefish monitoring in the ne
79
Bibliography
pacific. Frontiers in Marine Science, 9, 2022.
[10] Virginia Braun and Victoria Clarke. Using thematic analysis in psychology.
Qualitative Research in Psychology, 3:77–101, 01 2006.
[11] Juan Manuel Carrillo de Gea, Joaquín Nicolás Ros, José Fernández-Alemán,
Ambrosio Toval, and Christof Ebert. Requirements engineering tools. Software,
IEEE, 28:86 – 91, 09 2011.
[12] L. Cattaneo, A. Polenghi, M. Macchi, and V. Pesenti. On the role of data
quality in ai-based prognostics and health management. IFAC-PapersOnLine,
55(19):61–66, 2022. 5th IFAC Workshop on Advanced Maintenance Engineer-
ing, Services and Technologies AMEST 2022.
[13] H Chojer, P T B S Branco, F G Martins, M C M Alvim-Ferraz, and S I V
Sousa. Can data reliability of low-cost sensor devices for indoor air particulate
matter monitoring be improved? – an approach using machine learning. Atmos.
Environ. (1994), 286(119251):119251, October 2022.
[14] C Condemi, D Casillas-Pérez, L Mastroeni, S Jiménez-Fernández, and
S Salcedo-Sanz. Hydro-power production capacity prediction based on ma-
chine learning regression techniques. Knowl. Based Syst., 222(107012):107012,
June 2021.
[15] Saverio De Vito, Girolamo Di Francia, Elena Esposito, Sergio Ferlito, Fabrizio
Formisano, and Ettore Massera. Adaptive machine learning strategies for net-
work calibration of IoT smart air quality monitoring devices. Pattern Recognit.
Lett., 136:264–271, August 2020.
[16] Steve Easterbrook. Climate change: A grand software challenge. pages 99–104,
11 2010.
[17] Alexandre Fabregat, Anton Vernet, Marc Vernet, Lluís Vázquez, and Josep A.
Ferré. Using machine learning to estimate the impact of different modes of
transport and traffic restriction strategies on urban air quality. Urban Climate,
45:101284, 2022.
[18] Elham Fijani, Rahim Barzegar, Ravinesh Deo, Evangelos Tziritis, and Kon-
stantinos Skordas. Design and implementation of a hybrid model based on
two-layer decomposition method coupled with extreme learning machines to
support real-time environmental monitoring of water quality parameters. Sci-
ence of The Total Environment, 648:839–853, 2019.
[19] International Organization for Standardization and International Electrotech-
nical Commission. ISO/IEC 25012: Software Engineering : Software Prod-
uct Quality Requirements and Evaluation (SQuaRE) : Data Quality Model.
ISO/IEC, 2008.
80
Bibliography
[20] Cristina Gouveia, Alexandra Fonseca, António Câmara, and Francisco Fer-
reira. Promoting the use of environmental data collected by concerned citizens
through information and communication technologies. Journal of Environmen-
tal Management, 71(2):135–154, 2004.
[21] Corinna Gries, Mark Servilla, Margaret O’Brien, Kristin Vanderbilt, Colin
Smith, Duane Costa, and Susanne Grossman-Clarke. Achieving fair data prin-
ciples at the environmental data initiative, the us-lter data repository. Biodi-
versity Information Science and Standards, 3, 06 2019.
[22] Ivan Henderson V. Gue, Neil Stephen A. Lopez, Anthony S.F. Chiu, Aristo-
tle T. Ubando, and Raymond R. Tan. Predicting waste management system
performance from city and country attributes. Journal of Cleaner Production,
366:132951, 2022.
[23] M. Hino, E. Benami, and Nina Brooks. Machine learning for environmental
monitoring. Nature Sustainability, 1, 10 2018.
[24] Irum Inayat, Siti Salwah Salim, Sabrina Marczak, Maya Daneva, and Shaha-
boddin Shamshirband. A systematic literature review on agile requirements
engineering practices and challenges. Comput. Human Behav., 51:915–929, Oc-
tober 2015.
[25] Loso Judijanto, Donny Priyangan, Hanifah Muthmainah, and I Jata. The in-
fluence of data quality and machine learning algorithms on ai prediction perfor-
mance in business analysis in indonesia. The Eastasouth Journal of Information
System and Computer Science, 1:75–86, 12 2023.
[26] Barbara Kitchenham and Pearl Brereton. A systematic review of systematic re-
view process research in software engineering. Inf. Softw. Technol., 55(12):2049–
2075, December 2013.
[27] Aniko Konya and Peyman Nematzadeh. Recent applications of ai to envi-
ronmental disciplines: A review. The Science of The Total Environment,
906:167705, 01 2024.
[28] Larry Lannom, Dimitris Koureas, and Alex R. Hardisty. Fair data and services
in biodiversity science and geoscience. Data Intelligence, 2(1-2):122–130, 01
2020.
[29] Sabina Leonelli and Niccolò Tempini. Data Journeys in the Sciences. springer,
07 2020.
[30] Aiswarya M, Jan Bosch, and Helena Olsson. Maturity assessment model for
industrial data pipelines. pages 503–513, 12 2023.
[31] Kilkenny M F and Robinson K M. Data quality: “garbage in – garbage out”,
May 2018.
81
Bibliography
[32] Tharsanee Maganathan, Soundariya Senthilkumar, and Vishnupriya Balakrish-
nan. Machine learning and data analytics for environmental science: A review,
prospects and challenges. IOP Conference Series: Materials Science and Engi-
neering, 955(1):012107, nov 2020.
[33] Louis-Gabriel Maltais and Louis Gosselin. Energy management of domestic
hot water systems with model predictive control and demand forecast based on
machine learning. Energy Conversion and Management: X, 15(100254):100254,
August 2022.
[34] Massimiliano Manfren, Patrick AB. James, and Lamberto Tronchin. Data-
driven building energy modelling – an analysis of the potential for generalisation
through interpretable machine learning. Renewable and Sustainable Energy
Reviews, 167:112686, 2022.
[35] Xiao-Li Meng. Enhancing (publications on) data quality: Deeper data minding
and fuller data confession. Journal of the Royal Statistical Society: Series A
(Statistics in Society), 184, 10 2021.
[36] Violeta Migallón, Francisco J. Navarro-González, Héctor Penadés, José Pe-
nadés, and Yolanda Villacampa. A parallel methodology using radial basis
functions versus machine learning approaches applied to environmental mod-
elling. Journal of Computational Science, 63:101817, 2022.
[37] Clayton Miller, Bianca Picchetti, Chun Fu, and Jovan Pantelic. Limitations
of machine learning for building energy prediction: ASHRAE great energy
predictor III kaggle competition error analysis. Sci. Technol. Built Environ.,
28(5):610–627, May 2022.
[38] Michelle E Miro, David Groves, Bob Tincher, James Syme, Stephanie Tanver-
akul, and David Catt. Adaptive water management in the face of uncertainty:
Integrating machine learning, groundwater modeling and robust decision mak-
ing. Clim. Risk Manag., 34(100383):100383, 2021.
[39] Paula Moral, Álvaro García-Martín, Marcos Escudero-Viñolo, José M.
Martínez, Jesús Bescós, Jesús Peñuela, Juan Carlos Martínez, and Gonzalo
Alvis. Towards automatic waste containers management in cities via computer
vision: containers localization and geo-positioning in city maps. Waste Man-
agement, 152:59–68, 2022.
[40] Aiswarya Munappy, Jan Bosch, Helena Holmstrom Olsson, Anders Arpteg,
and Bjorn Brinne. Data management challenges for deep learning. In 2019
45th Euromicro Conference on Software Engineering and Advanced Applications
(SEAA). IEEE, August 2019.
[41] Aiswarya Raj Munappy, Jan Bosch, Helena Holmström Olsson, Anders Arpteg,
and Björn Brinne. Data management for production quality deep learning mod-
els: Challenges and solutions. Journal of Systems and Software, 191:111359,
82
Bibliography
2022.
[42] Aiswarya Raj Munappy, David Issa Mattos, Jan Bosch, Helena Holmström Ols-
son, and Anas Dakkak. From ad-hoc data analytics to dataops. In Proceedings
of the International Conference on Software and System Processes, ICSSP ’20,
page 165–174, New York, NY, USA, 2020. Association for Computing Machin-
ery.
[43] Carolina Natel de Moura, Jan Seibert, Miriam Rita Moro Mine, and Ricardo
Carvalho de Almeida. Are machine learning methods robust enough for hy-
drological modeling under changing conditions? EGU General Assembly 2020,
2019.
[44] Ngoc-Thanh Nguyen, Keila Lima, Astrid Marie Skålvik, Rogardt Heldal, Eric
Knauss, Tosin Daniel Oyetoyan, Patrizio Pelliccione, and Camilla Sætre. Syn-
thesized data quality requirements and roadmap for improving reusability of
in-situ marine data. In 2023 IEEE 31st International Requirements Engineer-
ing Conference (RE), pages 65–76, 2023.
[45] J Jake Nichol, Matthew G Peterson, Kara J Peterson, G Matthew Fricke, and
Melanie E Moses. Machine learning feature analysis illuminates disparity be-
tween E3SM climate models and observed climate change. J. Comput. Appl.
Math., 395(113451):113451, October 2021.
[46] Salomon Obahoundje, Arona Diedhiou, Komlavi Akpoti, Kouakou Kouassi,
Eric Ofosu, and Didier Kouame. Predicting climate-driven changes in reser-
voir inflows and hydropower in côte d’ivoire using machine learning modeling.
Energy, 302:131849, 05 2024.
[47] Amandalynne Paullada, Inioluwa Deborah Raji, Emily M. Bender, Emily Den-
ton, and Alex Hanna. Data and its (dis)contents: A survey of dataset develop-
ment and use in machine learning research. Patterns, 2(11):100336, November
2021.
[48] Daniel Vazquez Pombo, Oliver Gehrke, and Henrik W Bindner. Solete, a 15-
month long holistic dataset including: Meteorology, co-located wind and solar
pv power from denmark with various resolutions. Data in Brief, 42, 2022.
[49] Shameer Pradhan, Hans-Martin Heyn, and Eric Knauss. Identifying and man-
aging data quality requirements: a design science study in the field of automated
driving. Software Quality Journal, 32:1–48, 05 2023.
[50] Maria Priestley, Fionntán O’donnell, and Elena Simperl. A survey of data
quality requirements that matter in ml development pipelines. J. Data and
Information Quality, 15(2), June 2023.
[51] Martí Puig and Rosa Mari Darbra. Innovations and insights in environmental
monitoring and assessment in port areas. Current Opinion in Environmental
83
Bibliography
Sustainability, 70:101472, 2024.
[52] Tharsanee R M, Soundariya Senthilkumar, and Vishnupriya Balakrishnan. Ma-
chine learning and data analytics for environmental science: A review, prospects
and challenges. IOP Conference Series: Materials Science and Engineering,
955:012107, 11 2020.
[53] Aiswarya Raj, Jan Bosch, and Helena Olsson. Data Pipeline Management in
Practice: Challenges and Opportunities, pages 168–184. springer, 11 2020.
[54] Aiswarya Raj, Jan Bosch, Helena Holmström Olsson, and Tian J. Wang. Mod-
elling data pipelines. In 2020 46th Euromicro Conference on Software Engi-
neering and Advanced Applications (SEAA), pages 13–20, 2020.
[55] Lucas Ramos, Marilaine Colnago, and Wallace Casaca. Data-driven analysis
and machine learning for energy prediction in distributed photovoltaic gener-
ation plants: A case study in queensland, australia. Energy Rep., 8:745–751,
April 2022.
[56] Per Runeson and Martin Höst. Guidelines for conducting and reporting case
study research in software engineering. Empirical Softw. Engg., 14(2):131–164,
April 2009.
[57] Daniel Saboe, Hamidreza Ghasemi, Ming Ming Gao, Mirjana Samardzic,
Kiril D. Hristovski, Dragan Boscovic, Scott R. Burge, Russell G. Burge, and
David A. Hoffman. Real-time monitoring and prediction of water quality pa-
rameters and algae concentrations using microbial potentiometric sensor signals
and machine learning tools. Science of The Total Environment, 764:142876,
2021.
[58] Daniel Schwabe, Katinka Becker, Martin Seyferth, Andreas Klaß, and Tobias
Schaeffter. The metric-framework for assessing data quality for trustworthy ai
in medicine: a systematic review. npj Digital Medicine, 7, 08 2024.
[59] Arumoy Shome, Luís Cruz, and Arie van Deursen. Data smells in public
datasets. In Proceedings of the 1st International Conference on AI Engineering:
Software Engineering for AI, CAIN ’22, page 205–216, New York, NY, USA,
2022. Association for Computing Machinery.
[60] Fabian Sittaro, Christopher Hutengs, and Michael Vohland. Which factors
determine the invasion of plant species? machine learning based habitat mod-
elling integrating environmental factors and climate scenarios. Int. J. Appl.
Earth Obs. Geoinf., 116(103158):103158, February 2023.
[61] Amrish Solanki. Advancements in artificial intelligence: A comprehensive re-
view and future prospects, Apr 2024.
[62] Devis Tuia, Benjamin Kellenberger, Sara Beery, Blair R Costelloe, Silvia
84
Bibliography
Zuffi, Benjamin Risse, Alexander Mathis, Mackenzie W Mathis, Frank van
Langevelde, Tilo Burghardt, Roland Kays, Holger Klinck, Martin Wikelski,
Iain D Couzin, Grant van Horn, Margaret C Crofoot, Charles V Stewart, and
Tanya Berger-Wolf. Perspectives in machine learning for wildlife conservation.
Nat. Commun., 13(1):792, February 2022.
[63] Costas A. Velis, David C. Wilson, Yoni Gavish, Sue M. Grimes, and Andrew
Whiteman. Socio-economic development drives solid waste management per-
formance in cities: A global analysis using machine learning. Science of The
Total Environment, 872:161913, 2023.
[64] Richard Y. Wang and Diane M. Strong. Beyond accuracy: what data quality
means to data consumers. J. Manage. Inf. Syst., 12(4):5–33, March 1996.
[65] Mark Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gaby Apple-
ton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Olavo
Bonino da Silva Santos, Philip Bourne, Jildau Bouwman, Anthony Brookes,
Tim Clark, Merce Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris
Evelo, Richard Finkers, and Barend Mons. The fair guiding principles for sci-
entific data management and stewardship. Scientific Data, 3, 03 2016.
[66] Huanyu Zhou, Yingning Qiu, Yanhui Feng, and Jing Liu. Power prediction of
wind turbine in the wake using hybrid physical process and machine learning
models. Renewable Energy, 198:568–586, 2022.
[67] Huanyu Zhou, Yingning Qiu, Yanhui Feng, and Jing Liu. Power prediction of
wind turbine in the wake using hybrid physical process and machine learning
models. Renew. Energy, 198:568–586, October 2022.
[68] Jun-Jie Zhu, Meiqi Yang, and Zhiyong Jason Ren. Machine learning in environ-
mental research: Common pitfalls and best practices. Environmental Science
& Technology, 57(46):17671–17689, 2023. PMID: 37384597.
85
Bibliography
86
A
Survey form to validate the point
system
This appendix contains the SPADES-ML validation survey form.
I
A. Survey form to validate the point system
Figure A.1: Section 1(a) of the survey
II
A. Survey form to validate the point system
Figure A.2: Section 1(b) of the survey
III
A. Survey form to validate the point system
Figure A.3: Section 2 of the survey
IV
A. Survey form to validate the point system
Figure A.4: Section 3(a) of the survey
V
A. Survey form to validate the point system
Figure A.5: Section 3(b) of the survey
VI
A. Survey form to validate the point system
Figure A.6: Section 3(c) of the survey
VII
A. Survey form to validate the point system
Figure A.7: Section 3(d) of the survey
VIII
A. Survey form to validate the point system
Figure A.8: Section 4(a) of the survey
IX
A. Survey form to validate the point system
Figure A.9: Section 4(b) of the survey
X
A. Survey form to validate the point system
Figure A.10: Section 4(c) of the survey
XI
A. Survey form to validate the point system
Figure A.11: Section 4(d) of the survey
XII
A. Survey form to validate the point system
Figure A.12: Section 5(a) of the survey
XIII
A. Survey form to validate the point system
Figure A.13: Section 5(b) of the survey
XIV
A. Survey form to validate the point system
Figure A.14: Section 5(c) of the survey
XV
A. Survey form to validate the point system
Figure A.15: Section 5(d) of the survey
Figure A.16: Section 6 of the survey
XVI
B
Thematic analysis
This appendix presents the results of the thematic analysis performed on the quali-
tative data. All the quotes under each theme are listed here.
Theme 1: Transparency and justification of data selection
Reader perspective
R1: “Remember that ‘Monitoring data’ always needs to meet criteria defined by the
agencies / EU. Therefore, the selection of monitoring data is not a big issue because
they already (mostly) meet quality criteria.”
R2: “The purpose and samling design should be presented.”
R3: “Very important to select/use data that is acceptable and reliable since some
data may misguide the prediction tool”
R7: “It’s very important as it affects the model accuracy, and also whether your
model can accurately capture what’s relevant through the noisy data. However, this
is also a source of bias. The reader should be made aware of this source of bias.”
R9: “Data availability is often the only factor for selecting a data source due to
limited availability.”
Author perspective
A3: “Important to ensure model accuracy, improves model interpretability, supports
reproducibility”
A6: “Sometimes, data availability is the problem, so we don’t have the luxury of
choosing”
Reviewer perspective
V3: “Clear data selection criteria, relevant with the problem being solved. Impor-
tant to check target population, format, quality, and completeness.”
V7: “As a reviewer, I’m always looking for an answer to ’why’. If data selection is
XVII
B. Thematic analysis
anything but random, I look for possible sources of biases and what was done to
mitigate them or how the authors acknowledge it later on.”
Theme 2: Data quality and reliability
Reader perspective
R1: “Why is not precision mentioned? That is an important criterion.”
R2: “Clear definition of units and standardisation is of importance.”
R6: “The data should be recorded by a credible source, the uncertainty of any type
while collecting the data should be provided. Removing outliers should be done by
experts not automated. Consistent measurement is crucial”
R6: ‘It depends on the purpose of the research, however, the correctness of the data
is more important than the quantity”
R9: “In waste management, often the most credible data (e.g. from national gov-
ernments) is of the worst quality as they have no means to measure it at a national
level. Data accuracy and consistency are the most important factors, especially
when using ML, as gaps can be filled, but bad data makes this tough to do.”
Author perspective
A3: “High-quality data leads to more accurate predictions and saves time and re-
sources. Vary important to check some data that has incomplete information, which
may lead to inaccurate models. Can be overcome by removal, or using algorithms
that handle missing values”
A6: “Bad data, bad results.”
Reviewer perspective
V3: “Important to see high-quality data, such as accuracy, completeness, consis-
tency”
V7: “As ML models can deal with some amount of noise, these details are not that
relevant, but it’s nice to include anyway”
Theme 3: Data accessibility and constraints
Reader perspective
R3: “Many real-time application does not have data accessibility”
R5: “Without we cannot validate published methods/results independently”
R6: “It’s one of the most important steps for the credibility of the research. For
XVIII
B. Thematic analysis
the research to be conducted correctly, the same results should be obtained by any
researcher who is following the same steps as in the paper, thus without data, that
cannot be confirmed”
R7: “It really depends on the application. If I’m continuing the research, or testing a
different model or applying different techniques, then I can work with the processed
data, but otherwise, I would need the raw data. In my opinion, raw data along with
the script(s) to process the data is sufficient. Tools such as Large Language Models
make working with others’ code easier.”
R9: “more for metadata than data accessibility - The method is usually the most
important thing to me as the manner in which a data point was measured impacts
how we can interpret the data, Unfortunately this is rarely provided in my field.
For data accessibility, both are very important as you often want to go beyond what
analysis the paper reported.”
Author perspective
A3: “Easy access to data allows to quickly explore, preprocess, and train models
without delays”
A6: “Data is the only way to verify the creditability of the research”
A9: “I work a lot with countries in the Global South. Data formats have to be
readily accessible otherwise they will not be able to access (e.g. if expensive /
technical software is required)”
Reviewer perspective
V3: “Data accessibility is important and easily accessible if possible”
V5: “Usually, I have no time to review a paper to the level of analysis replication”
V5: “If I need to re-use the data (gaining access) then is not for revision but for
additional analyses plus new data”
V7: “This is quite an idealistic perspective. I don’t personally believe that all data
needs to be accessible at the time of review, but it is needed for reproducibility. I
deem access very important, only to ensure reproducibility.”
Theme 4: Transparency in pre-processing
Reader perspective
R3: “Very important when used for predicting models”
R4: “It would be interesting, if possible, to include also raw data in order to allow
another scientist to provide additional or alternative methods for data preprocess-
ing.”
XIX
B. Thematic analysis
R6: “Sometimes, data preprocessing is more important than the modeling step itself.
It is crucial to explain the process”
R7: “It’s very critical and can either help train accurate models or make the model
completely miss the target.”
R9: “Its important to provide these reproducible steps / code, but I find contacting
the authors and starting a dialogue is often the best way to truly understand what
they’ve done.”
Author perspective
A3: “Very important in cleaning, transforming, and organizing raw data into a
format that can be used by ML models”
A3: “Very important to check some data that has incomplete information, which
may lead to inaccurate models. Can be overcome by removal, or using algorithms
that handle missing values”
A9: “Much of the data in waste management is poor and we need to provide justi-
fication as to why these incorrect data points should be removed from the analysis”
Reviewer perspective
V3: “Important to make sure the raw data is clean and usable format”
Theme 5: Challenges in data processing
Reader perspective
R3: “Incomplete datasets with missing values can skew the model’s understand-
ing, data might contain errors, outliers, or irrelevant information, ML models can’t
directly handle non-numeric data”
Author perspective
A3: “Some data might have incomplete information, irrelevant, incorrect, or random
data points, or multiple records of the same data”
A7: “There are way too many challenges to be document everything. It could be
useful to include, maybe as comments in code, but not sure how else one would
include these.”
Reviewer perspective
V3: “Some issues should be mentioned e.g. missing or inconsistent data, noise, high
dimensionality, class imbalance, and difficulties in integrating data from multiple
sources, etc.”
XX
B. Thematic analysis
Theme 6: Institutional standards and practices
Reader perspective
R1: “Remember that ’Monitoring data’ always need to meet criteria defined by the
agencies / EU. Therefore the selection of monitoring data is not a big issue because
they already (mostly) meet quality criteria.”
R5: “Embargo timing could be needed to protect ongoing publishing actions and
MSc-PhD Theses”
Author perspective
A4: “Scientific evaluation committees must take into consideration the value of
publishing datasets. Otherwise, the scientific community won’t publish them.”
XXI
B. Thematic analysis
XXII
C
Referred table from SLR that
guided our literature review
This appendix presents Table 2 of SLR by Konya et Nematzadeh [27] which contains
details of the papers referred to in our literature review.
Table C.1: Table 2 - AI applications in environmental disciplines based on Konya
et Nematzadeh [27]
Targeted envi- AI tools Performance Processing Best per- Publication
ronmental field metrics time formance
Air quality moni- Shallow Neu- Mean absolute 4 weeks cali- Shallow Neu- De Vito et
toring ral Networks error (MAE), bration set, 4 ral Networks al.,2020 [15]
(SNN), Ex- mean relative weeks offline (SNN)
treme Learning error, normal- training
Machine (ELM) ized MAE, root
mean square
error (RMSE),
nRMSE
Air quality predic- Multilayer Per- Correlation 2 h train- n.a. Fabregat et
tion ceptron Regres- coefficient (R2), ing CPU al.,2022 [17]
sor (MLPR) factor of two (Central
of observa- Processing
tions (FAC2), Unit)
geometric
mean bias
(GeoMean),
geometric stan-
dard deviation
(GeoSTD),
RMSE, mean
bias
XXIII
C. Referred table from SLR that guided our literature review
Indoor air partic- Multiple Lin- R2, RMSE, n.a. Support Vec- Chojer et
ulate matter (PM) ear Regression mean bias error tor Machine al.,2022 [13]
monitoring (MLR), Support (MBE) (SVM)
Vector Machine
(SVM), Gra-
dient Boosting
Regression
(GBR), Ex-
treme Gradient
Boosting (XGB)
Real-time monitor- Least Square R2, RMSE, n.a. VMD- Fijani et
ing of water quality Support Vector MAE, normal- CEEMDAN- al.,2019 [18]
Machine, com- ized RMSE, ELM
plete ensemble normalized
empirical mode MAE, bias error
decomposition
algorithm with
adaptive noise
CEEMDAN,
Variational
Mode De-
composition
(VMD), Ex-
treme Learning
Machine, and
their combina-
tion
Real-time monitor- Long short- RMSE, normal- n.a. n.a. Saboe et
ing and prediction term memory ized RMSE al.,2021 [57]
of water quality (LSTM) neural
network for
multivariate
single-step and
multivariate
multi-step time
series forecast-
ing
Ocean monitoring: YOLOv5 (deep Average preci- Reduced n.a. Bonofiglio et
animal tracking learning) sion (AP) video pro- al., 2022 [9]
cessing time
from 20 min
to 3 min
Energy prediction Random Forest, MAE, RMSE, n.a. Gradient Ramos et
Support Vector mean absolute Tree Boost- al.,2022 [55]
Machine, Gradi- percentage error ing
ent Tree Boost- (MAPE), pre-
ing diction accuracy
Solar photovoltaic Random Forest, MAE, RMSE, n.a. Random Pombo et
(PV) power fore- Support Vector residual sum of Forest al.,2022 [48]
casting Machine, Artifi- squares (RSS),
cial Neural Net- R2
works
XXIV
C. Referred table from SLR that guided our literature review
Power prediction of Deep Neural RMSE, MAE, n.a. Physical Zhou et
wind turbines Network, Trans- MAPE, R2 Guided Neu- al.,2022 [67]
fer Learning ral Network
(PGNN)
Hydro-power pro- Multilayer RMSE, MAE, n.a. Support Vec- Condemi et
duction capacity perceptron, correlation tor Machine al.,2021 [14]
prediction Extreme Learn-
ing Machine,
Support Vector
Machine
Building energy Machine R2, MAPE, n.a. n.a. Manfren et
modeling Learning- normalized al.,2022 [34]
based regression mean bias error
model (NMBE)
Building energy Neural Net- RMSE, MBE n.a. n.a. Miller et
prediction works, XGBoost al.,2022 [37]
Environmental Multiple Linear Relative er- n.a. Support Vec- Migallón et
modeling Regression, ror, R2, MSE, tor Machine al.,2022 [36]
K-nearest RMSE, MAE,
neighbor, Ar- MAPE
tificial Neural
Networks, Sup-
port Vector
Machine
Hydrological mod- Long short-term Nash–Sutcliffe n.a. Long short- Natel de
eling memory, trans- efficiency (NSE) term mem- Moura et
fer learning ory al.,2020 [43]
Solar radiation pre- Multivariate R2, RMSE, n.a. All of them Da Costa
diction Adaptive Re- MAE, MAPE, Alves Basílio
gression Spline, mean abso- et al.,
Support Vec- lute deviation 2022 [7]
tor Machine, (MAD), uncer-
XGBoost, poly- tainty
nomial Ridge
Regression
Climate models Random Forest R2, MAE, n.a. n.a. Nichol et
average test al.,2021 [45]
anomaly corre-
lation coefficient
(ACC)
Invasive plant Support Vec- Area under the n.a. Boosted Sittaro et
species tor Machine, curve (AUC), Regression al.,2023 [60]
Boosted Regres- RMSE Trees
sion Trees
XXV
C. Referred table from SLR that guided our literature review
Wildlife conserva- Computer Vi- For Machine n.a. Deep learn- Tuia et
tion sion, Bayesian Learning: n.a. ing algo- al.,2022 [62]
estimation, For deep learn- rithms: Con-
Decision Tree, ing: accuracy, volutional
Random Forest, recall, precision Neural Net-
Support Vec- work, Vision
tor Machine, Transform-
Deep learning: ers, Long
Artificial Neu- short-term
ral Network, memory,
Convolutional Gated Re-
Neural Net- current Unit
work, Vision
Transformers,
Long short-term
memory, Gated
Recurrent Unit
Nature conserva- Decision Tree Accuracy, preci- n.a. Extra Tree Moran et
tion Classifier, Extra sion, recall, F- Classifier al.,2017 [34]
Tree Classifier score
Energy manage- Artificial Neural R2 n.a. n.a. Maltais and
ment on domestic Networks Gosselin,
hot water 2022 [33]
Water manage- Random Forest, n.a. n.a. Random Miro et
ment, groundwater Support Vector Forest al.,2021 [38]
modeling Machine, Artifi-
cial Neural Net-
works
Hydrological Deep learning: Confusion n.a. All of them Almikaeel et
drought forecasting Artificial Neural matrix al., 2022 [5]
Networks and
Machine Learn-
ing: Support
Vector Machine
Waste management Conditional RMSE, sym- n.a. All of them Velis et
Random For- metric mean provided al.,2023 [63]
est, Univariate absolute per- different
Non-linear centage error insights
Regression (SMAPE),
Akaike infor-
mation crite-
ria corrected
(AICc)
Waste management Rough Set- Accuracy n.a. n.a. Gue et al.,
based Machine 2022 [22]
Learning
Waste management Computer Vi- Intersection n.a. YOLOv5 Moral et
sion, Deep over union al.,2022 [39]
learning: Ef- (IoU), precision,
ficientDet, recall, average
YOLOv5 precision (AP)
XXVI
C. Referred table from SLR that guided our literature review
Waste management Computer Vi- Mean aver- n.a. YOLOF, Agnew et
sion, Deep age precision SOLOv2 al.,2023 [3]
learning: (mAP), aver-
Faster-RCNN, age precision,
RetinaNet, normalized con-
Grid-RCNN, fusion matrix
YOLOF, Mask-
RCNN, Cascade
Mask-RCNN,
YOLACT,
SOLOv2
XXVII